WO1999054500A9 - Biallelic markers for use in constructing a high density disequilibrium map of the human genome - Google Patents

Biallelic markers for use in constructing a high density disequilibrium map of the human genome

Info

Publication number
WO1999054500A9
WO1999054500A9 PCT/IB1999/000822 IB9900822W WO9954500A9 WO 1999054500 A9 WO1999054500 A9 WO 1999054500A9 IB 9900822 W IB9900822 W IB 9900822W WO 9954500 A9 WO9954500 A9 WO 9954500A9
Authority
WO
WIPO (PCT)
Prior art keywords
biallelic marker
map
biallelic
markers
polynucleotide
Prior art date
Application number
PCT/IB1999/000822
Other languages
French (fr)
Other versions
WO1999054500A3 (en
WO1999054500A2 (en
Inventor
Daniel Cohen
Marta Blumenfeld
Ilya Chumakov
Original Assignee
Genset Sa
Daniel Cohen
Marta Blumenfeld
Ilya Chumakov
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genset Sa, Daniel Cohen, Marta Blumenfeld, Ilya Chumakov filed Critical Genset Sa
Priority to AU34386/99A priority Critical patent/AU3438699A/en
Priority to CA002324866A priority patent/CA2324866A1/en
Priority to EP99915988A priority patent/EP1071817A2/en
Priority to US09/422,978 priority patent/US6537751B1/en
Publication of WO1999054500A2 publication Critical patent/WO1999054500A2/en
Publication of WO1999054500A9 publication Critical patent/WO1999054500A9/en
Publication of WO1999054500A3 publication Critical patent/WO1999054500A3/en
Priority to US10/349,143 priority patent/US20040005584A1/en
Priority to US11/370,584 priority patent/US20060177863A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6834Enzymatic or biochemical coupling of nucleic acids to a solid phase
    • C12Q1/6837Enzymatic or biochemical coupling of nucleic acids to a solid phase using probe arrays or probe chips
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/172Haplotypes

Definitions

  • the partial sequence information available can be used to identify genes responsible for detectable human traits, such as genes associated with human diseases, and to develop diagnostic tests capable of identifying individuals who express a detectable trait as the result of a specific genotype or individuals whose genotype places them at risk of developing a detectable trait at a subsequent time.
  • detectable human traits such as genes associated with human diseases
  • diagnostic tests capable of identifying individuals who express a detectable trait as the result of a specific genotype or individuals whose genotype places them at risk of developing a detectable trait at a subsequent time.
  • the present invention relates to an ordered set of human genomic sequences comprising single nucleotide polymorphisms, as well as the use of these polymo ⁇ hisms as a high resolution map of the human genome, methods of identifying genes associated with detectable human traits, and diagnostics for identifying individuals who carry a gene which causes them to express a detectable trait or which places them at risk of expressing a detectable trait in the future.
  • the map-related biallelic markers of the present invention offer a number of important advantages over other genetic markers such as RFLP (Restriction fragment length polymo ⁇ hism), VNTR (Variable Number of Tandem Repeats) markers and earlier STS- (sequence tagged sites) derived markers.
  • RFLP Restriction fragment length polymo ⁇ hism
  • VNTR Variariable Number of Tandem Repeats
  • STS- sequence tagged sites
  • the first generation of markers were RFLPs, which are variations that modify the length of a restriction fragment. But methods used to identify and to type RFLPs are relatively wasteful of materials, effort, and time. Since they are biallelic markers (they present only two alleles, the restriction site being either present or absent), their maximum heterozygosity is 0.5. The theoretical number of RFLPs distributed along the entire human genome is more than 10 , which leads to a potential average inter-marker distance of 30 kilobases. However, in reality the number of evenly distributed RFLPs which occur at a sufficient frequency in the population to make them useful for tracking of genetic polymo ⁇ hisms is very limited.
  • VNTRs The second generation of genetic markers were VNTRs, which can be categorized as either minisatellites or microsatellites.
  • Minisatellites are tandemly repeated DNA sequences present in units of 5-50 repeats which are distributed along regions of the human chromosomes ranging from 0.1 to 20 kilobases in length. Since they present many possible alleles, their informative content is very high.
  • Minisatellites are scored by performing Southern blots to identify the number of tandem repeats present in a nucleic acid sample from the individual being tested. However, there are only 10 potential VNTRs that can be typed by Southern blotting. Thus, the number of easily typed informative markers in these maps is far too small for the average distance between informative markers to fulfill the requirements for a useful genetic map.
  • both RFLP and VNTR markers are costly and time- consuming to develop and assay in large numbers.
  • polymo ⁇ hisms Single Nucleotide Polymo ⁇ hisms (SNPs), more preferably non RFLP biallelic markers therein.
  • SNPs Single Nucleotide Polymo ⁇ hisms
  • polymo ⁇ hisms are identified by determining the sequence of the STSs in 5 to 10 individuals.
  • linkage analysis As will be further explained below, genetic studies have mostly relied in the past on a statistical approach called linkage analysis, which took advantage of microsatellite markers to study their inheritance pattern within families from which a sufficient number of individuals presented the studied trait. Because of intrinsic limitations of linkage analysis, which will be further detailed below, and because these studies necessitate the recruitment of adequate family pedigrees, they are not well suited to the genetic analysis of all traits, particularly those for which only sporadic cases are available (e.g. drug response traits), or those which have a low penetrance within the studied population.
  • association studies enabled by the biallelic markers of the present invention offer an alternative to linkage analysis. Combined with the use of a high density map of appropriately spaced, sufficiently informative markers, association studies, including linkage disequilibrium-based genome wide association studies, will enable the identification of most genes involved in complex traits.
  • Single nucleotide polymo ⁇ hism or biallelic markers can be used in the same manner as RFLPs and VNTRs but offer several advantages.
  • Single nucleotide polymo ⁇ hisms are densely spaced in the human genome and represent the most frequent type of variation. An estimated number of more than 10 7 sites are scattered along the 3xl0 9 base pairs of the human genome. Therefore, single nucleotide polymo ⁇ hisms occur at a greater frequency and with greater uniformity than RFLP or VNTR markers which means that there is a greater probability that such a marker will be found in close proximity to a genetic locus of interest.
  • Single nucleotide polymo ⁇ hisms are less variable than VNTR markers but are mutationally more stable.
  • biallelic markers of the present invention are often easier to distinguish and can therefore be typed easily on a routine basis.
  • Biallelic markers have single nucleotide based alleles and they have only two common alleles, which allows highly parallel detection and automated scoring.
  • the biallelic markers of the present invention offer the possibility of rapid, high- throughput genotyping of a large number of individuals. Biallelic markers are densely spaced in the genome, sufficiently informative and can be assayed in large numbers. The combined effects of these advantages make biallelic markers extremely valuable in genetic studies.
  • Biallelic markers can be used in linkage studies in families, in allele sharing methods, in linkage disequilibrium studies in populations, in association studies of case-control populations.
  • An important aspect of the present invention is that biallelic markers allow association studies to be performed to identify genes involved in complex traits. Association studies examine the frequency of marker alleles in unrelated case- and control-populations and are generally employed in the detection of polygenic or sporadic traits. Association studies may be conducted within the general population and are not limited to studies performed on related individuals in affected families
  • the present invention relates to a high density linkage disequilibrium-based genetic maps of the human genome which comprise the map-related biallelic markers of the invention and will allow the identification of genes responsible for detectable traits using genome-wide association studies and linkage disequilibrium mapping.
  • the present invention is based on the discovery of a set of novel map-related biallelic markers. See Table 1. The position of these markers and knowledge of the surrounding sequence has been used to design polynucleotide compositions which are useful in high density mapping of the human genome as well as in determining the identity of nucleotides at the marker position, and more complex association and haplotyping studies which are useful in determining the genetic basis for disease states.
  • the compositions and methods of the invention find use in the identification of the targets for the development of pharmaceutical agents and diagnostic methods, as well as the characterization of the differential efficacious responses to and side effects from pharmaceutical agents acting on a disease as well as other treatments.
  • a first embodiment of the present invention is a map of the human genome comprising an ordered array of biallelic markers, wherein at least 1, 2, 5, 10, 20, 25, 30, 50, 100, 200, 500, 1000, 2000 or 3000 of said biallelic markers are map-related biallelic markers.
  • the maps of the present invention encompass maps with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD Nos.
  • said ordered array comprises at least 20,000, 40,000, 60,000, 80,000, 100,000, or 120,000 biallelic markers; optionally, wherein said biallelic markers are separated from one another by an average distance of 10kb-200 kb, 15kb-150 kb, 20kb-100 kb, 100kb-150 kb, 50-100kb, or 25 kb-50 kb in the human genome; optionally, said biallelic markers are distributed at an average density of at least one biallelic marker every 150kb, 50 kb, or 30 kb m the human genome; or optionally, wherein, all of said biallelic markers are selected to have a heterozygosity rates of at least about 0.18, 0.32, or 0.42.
  • a second embodiment of the invention encompasses isolated, purified or recombinant polynucleotides consisting of, consisting essentially of, or compnsmg a contiguous span of nucleotides of a sequence selected as an individual or in any combination from the group consisting of SEQ LD No.
  • the present invention also relates to polynucleotides hybridizing under st ⁇ ngent or intermediate conditions to a sequence selected from the group consisting of
  • polynucleotides of the invention encompass polynucleotides with any further limitation descnbed in this disclosure, or those following, specified alone or in any combination: said contiguous span may optionally compnse a map-related biallelic marker; optionally either the 1 ST or the 2 ND allele of the respective SEQ LD No., as indicated m Table 1, may be specified as being present at said map- related biallelic marker; optionally, said biallelic marker may be within 6, 5, 4, 3, 2, or 1 nucleotides of the center of said polynucleotide or at the center of said polynucleotide; optionally, said polynucleotide may consists of, or consist essentially of a contiguous span which ranges in length from 8, 10, 12, 15, 18 or 20 to 21, 25, 35, 40, 43, or 47 nucleotides; optionally, said polynucleotide may consists of, or consist essentially of a contiguous span which ranges m length from 8, 10, 12,
  • a third embodiment of the invention encompasses any polynucleotide of the invention attached to a solid support.
  • the polynucleotides of the invention which are attached to a solid support encompass polynucleotides with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said polynucleotides may be specified as attached individually or in groups of at least 2, 5, 8, 10, 12, 15, 20, 25, 50, 100, 200, or 500 distinct polynucleotides of the inventions to a single solid support; optionally, polynucleotides other than those of the invention may attached to the same solid support as polynucleotides of the invention; optionally, when multiple polynucleotides are attached to a solid support they may be attached at random locations, or in an ordered array; optionally, said ordered array may be addressable.
  • a fourth embodiment of the invention encompasses the use of any polynucleotide for, or any polynucleotide for use in, determining the identity of nucleotides at a map-related biallelic marker.
  • the polynucleotides of the invention for use in determining the identity of nucleotides at a map-related biallelic marker encompass polynucleotides with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD No.
  • said polynucleotide may comprise a sequence disclosed in the present specification; optionally, said polynucleotide may consist of, or consist essentially of any polynucleotide described in the present specification; optionally, said determining may be performed in a hybridization assay, sequencing assay, microsequencing assay, or an enzyme-based mismatch detection assay; optionally, said polynucleotide may be attached to a solid support, array, or addressable array; optionally, said polynucleotide may be labeled.
  • a fifth embodiment of the invention encompasses the use of any polynucleotide for, or any polynucleotide for use in, amplifying a segment of nucleotides comprising a map- related biallelic marker.
  • the polynucleotides of the invention for use in amplifying a segment of nucleotides comprising a map-related biallelic marker encompass polynucleotides with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD Nos.
  • said polynucleotide may consist of, consist essentially of, or comprise a sequence selected individually or in any combination from the group consisting of SEQ LD
  • said polynucleotide may consist of, or consist essentially of any polynucleotide described in the present specification; optionally, said amplifying may be performed by a PCR or LCR.
  • said polynucleotide may be attached to a solid support, array, or addressable array.
  • said polynucleotide may be labeled.
  • a sixth embodiment of the invention encompasses methods of genotyping a biological sample comprising determining the identity of a nucleotide at a map-related biallelic marker.
  • the genotyping methods of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD No.
  • said method further comprises determining the identity of a second nucleotide at said biallelic marker, wherein said first nucleotide and second nucleotide are not base paired (by Watson & Crick base pairing) to one another; optionally, said biological sample is derived from a single individual or subject; optionally, said method is performed in vitro; optionally, said biallelic marker is determined for both copies of said biallelic marker present in said individual's genome; optionally, said biological sample is derived from multiple subjects or individuals; optionally, said method further comprises amplifying a portion of said sequence comprising the biallelic marker prior to said determining step; optionally, wherein said amplifying is performed by PCR, LCR, or replication of a recombinant vector comprising an origin of replication and said portion in a host cell; optionally, wherein said determining is performed by a hybridization assay, sequencing assay,
  • a seventh embodiment of the invention comprises methods of estimating the frequency of an allele in a population comprising genotyping individuals from said population for a map-related biallelic marker and determining the proportional representation of said biallelic marker in said population.
  • the methods of estimating the frequency of an allele in a population of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ Nos.
  • An eighth embodiment of the invention comprises methods of detecting an association between an allele and a phenotype, comprising the steps of a) determining the frequency of at least one map-related biallelic marker allele in a trait positive population, b) determining the frequency of said map-related biallelic marker allele in a control population and; c) determining whether a statistically significant association exists between said genotype and said phenotype.
  • the methods of detecting an association between an allele and a phenotype of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map- related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD Nos.
  • control population may be a trait-negative population, or a random population; optionally, wherein said phenotype is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity; optionally, the determining steps a) and b) are performed on all of the biallelic markers of SEQ LD Nos. 1 to 3908.
  • An ninth embodiment of the present invention encompasses methods of estimating the frequency of a haplotype for a set of biallelic markers in a population, comprising the steps of: a) genotyping each individual in said population for at least one map-related biallelic marker, b) genotyping each individual in said population for a second biallelic marker by determining the identity of the nucleotides at said second biallelic marker for both copies of said second biallelic marker present in the genome; and c) applying a haplotype determination method to the identities of the nucleotides determined in steps a) and b) to obtain an estimate of said frequency.
  • the methods of estimating the frequency of a haplotype of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally said haplotype determination method is selected from the group consisting of asymmetric PCR amplification, double PCR amplification of specific alleles, the Clark method, or an expectation maximization algorithm; optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ ID Nos.
  • said second biallelic marker is a map-related biallelic marker; optionally, the identity of the nucleotides at the biallelic markers in every one of the sequences of SEQ LD No. 1 to 3908 is determined in steps a) and b).
  • a tenth embodiment of the present invention encompasses methods of detecting an association between a haplotype and a phenotype, comprising the steps of: a) estimating the frequency of at least one haplotype in a trait positive population according to a method of estimating the frequency of a haplotype of the invention; b) estimating the frequency of said haplotype in a control population according to the method of estimating the frequency of a haplotype of the invention; and c) determining whether a statistically significant association exists between said haplotype and said phenotype.
  • the methods of detecting an association between a haplotype and a phenotype of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No.
  • control population may be a trait-negative population, or a random population; optionally, wherein said phenotype is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity; optionally, the identity of the nucleotides at the biallelic markers in every one of the following sequences: SEQ ID No. 1 to 3908 is included in the estimating steps a) and b).
  • An eleventh embodiment of the present invention is a method of identifying a gene associated with a detectable trait comprising the steps of: a) determining the frequency of each allele of at least one map-related biallelic marker in individuals having the detectable trait and individuals lacking the detectable trait; b) identifying at least one alleles of one or biallelic markers having a statistically significant association with the detectable trait; and c) identifying a gene in linkage disequilibrium with said allele.
  • the methods of the present invention for identifying a gene associated with a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, wherein the method further comprises d) identifying a mutation in the gene identified in step c) which is associated with the detectable trait; optionally, wherein the individuals having the detectable trait and the individuals lacking the detectable trait are readily distinguishable from one another; optionally, wherein the individuals having the detectable trait and the individuals lacking the detectable trait are selected from a bimodal population; optionally, wherein the individuals having the detectable trait are at one extreme of the population and the individuals lacking the detectable trait are at the other extreme of the population; optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No.
  • a twelfth embodiment of the present invention is a method of identifying biallelic markers associated with a detectable trait comprising the steps of: a) determining the frequencies of a set of biallelic markers comprising at least one map-related biallelic marker in individuals who express said detectable trait and individuals who do not express said detectable trait; and b) identifying one or more biallelic markers in said set which are statistically associated with the expression of said detectable trait.
  • the methods of the present invention for identifying biallelic markers associated with a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said detectable trait is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity.
  • a thirteenth embodiment of the present invention is a method of identifying biallelic marker(s) in linkage disequilibrium with a trait causing allele or in linkage disequilibrium with a trait-associated biallelic marker comprising the steps of: a) selecting at least one map-related biallelic marker which is in the genomic region suspected of containing the trait-causing allele or the trait-associated biallelic marker; and b) determining which of the map-related biallelic markers are associated with the trait-causing allele or in linkage disequilibrium with the trait- associated biallelic marker.
  • the methods of the present invention for identifying biallelic marker(s) in linkage disequilibrium with a trait causing allele or in linkage disequilibrium with a trait-associated biallelic marker encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said detectable trait is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity.
  • a fourteenth embodiment of the present invention is a method for determining whether an individual is at risk of developing a detectable trait or suffers from a detectable trait comprising the steps of: a) obtaining a nucleic acid sample from the individual; b) screening the nucleic acid sample with at least one map-related biallelic marker; and c) determining whether the nucleic acid sample contains at least one allele of said map-related biallelic marker statistically associated with the detectable trait.
  • the methods of the present invention for determining whether an individual is at risk of developing a detectable trait or suffers from a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said detectable trait is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity.
  • a fifteenth embodiment of the present invention is a method of administering a drug or a treatment comprising the steps of: a) obtaining a nucleic acid sample from an individual; b) determining the identity of the polymo ⁇ hic base of at least one map-related biallelic marker which is associated with a positive response to the treatment or the drug; or at least one biallelic map-related marker which is associated with a negative response to the treatment or the drug; and c) administering the treatment or the drug to the individual if the nucleic acid sample contains said biallelic marker associated with a positive response to the treatment or the drug or if the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug.
  • the methods of the present invention for administering a drug or a treatment encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; or optionally, the administering step comprises administering the drug or the treatment to the individual if the nucleic acid sample contains said biallelic marker associated with a positive response to the treatment or the drug and the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug.
  • a sixteenth embodiment of the present invention is a method of selecting an individual for inclusion in a clinical trial of a treatment or drug comprising the steps of: a) obtaining a nucleic acid sample from an individual; b) determining the identity of the polymo ⁇ hic base of at least one map-related biallelic marker which is associated with a positive response to the treatment or the drug, or at least one map-related biallelic marker which is associated with a negative response to the treatment or the drug in the nucleic acid sample, and c) including the individual in the clinical trial if the nucleic acid sample contains said map-related biallelic marker associated with a positive response to the treatment or the drug or if the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug.
  • the methods of the present invention for selecting an individual for inclusion in a clinical trial of a treatment or drug encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination:
  • said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof;
  • the including step comprises administering the drug or the treatment to the individual if the nucleic acid sample contains said biallelic marker associated with a positive response to the treatment or the drug and the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug.
  • a seventeenth embodiment of the present invention is a method of identifying a gene associated with a detectable trait comprising the steps of: a) selecting a gene suspected of being associated with a detectable trait; and b) identifying at least one map-related biallelic marker within said gene which is associated with said detectable trait.
  • the methods of the present invention for identifying a gene associated with a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of
  • the identifying step comprises determining the frequencies of the map-related biallelic marker(s) in individuals who express said detectable trait and individuals who do not express said detectable trait and identifying one or more biallelic markers which are statistically associated with the expression of the detectable trait.
  • Figure 1 is a cytogenetic map of chromosome 21.
  • Figure 2a shows the results of a computer simulation of the distribution of inter- marker spacing on a randomly distributed set of biallelic markers indicating the percentage of biallelic markers which will be spaced a given distance apart for 1 , 2, or 3 markers/BAC in a genomic map (assuming a set of 20,000 minimally overlapping BACs covering the genome are evaluated).
  • Figure 2b shows the results of a computer simulation of the distribution of inter- marker spacing on a randomly distributed set of biallelic markers indicating the percentage of biallelic markers which will be spaced a given distance apart for 1, 3, or 6 markers/BAC in a genomic map (assuming a set of 20,000 minimally overlapping BACs covering the genome are evaluated).
  • Figure 3 shows, for a series of hypothetical sample sizes, the p-value significance obtained in association studies performed using individual markers from the high-density biallelic map, according to various hypotheses regarding the difference of allelic frequencies between the trait-positive and trait-negative samples.
  • Figure 4 is a hypothetical association analysis conducted with a map comprising about
  • Figure 5 is a hypothetical association analysis conducted with a map comprising about 20,000 biallelic markers.
  • Figure 6 is a hypothetical association analysis conducted with a map comprising about 60,000 biallelic markers.
  • Figure 7 is a haplotype analysis using biallelic markers in the Apo E region.
  • Figure 8 is a simulated haplotype analysis using the biallelic markers in the Apo E region included in the haplotype analysis of Figure 7.
  • Figure 9 shows a minimal array of overlapping clones which was chosen for further studies of biallelic markers associated with prostate cancer, the positions of STS markers known to map in the candidate genomic region along the contig, and the locations of biallelic markers along the BAC contig harboring a genomic region harboring a candidate gene associated with prostate cancer which were identified using the methods of the present invention.
  • Figure 10 is a rough localization of a candidate gene for prostate cancer which was obtained by determining the frequencies of the biallelic markers of Figure 9 in affected and unaffected populations.
  • Figure 11 is a further refinement of the localization of the candidate gene for prostate cancer using additional biallelic markers which were not included in the rough localization illustrated in Figure 10.
  • Figure 12 is a haplotype analysis using the biallelic markers in the genomic region of the gene associated with prostate cancer.
  • Figure 13 is a simulated haplotype using the six markers included in haplotype 5 of Figure 12.
  • Figure 14 is a block diagram of an exemplary computer system.
  • Figure 15 is a flow diagram illustrating one embodiment of a process 200 for comparing a new nucleotide or protein sequence with a database of sequences in order to determine the homology levels between the new sequence and the sequences in the database.
  • Figure 16 is a flow diagram illustrating one embodiment of a process 250 in a computer for determining whether two sequences are homologous.
  • Figure 17 is a flow diagram illustrating one embodiment of an identifier process 300 for detecting the presence of a feature in a sequence.
  • SEQ LD Nos. 1 to 3908 contain nucleotide sequences comprising a portion of the map- related biallelic markers of the invention.
  • SEQ LD Nos. 3909 to 3934 contain nucleotide sequences comprising a portion of the map-related biallelic markers which are shown to be associated with Alzheimer's disease, prostate cancer or asthma as described in the Examples.
  • SEQ LD Nos. 3935 to 7842 contain nucleotide sequences of upstream amplification primers (PU) designed to amplify sequences containing the biallelic markers of SEQ LD Nos. 1 to 3908.
  • PU upstream amplification primers
  • SEQ LD Nos. 7843 to 7865 contain nucleotide sequences of upstream amplification primers (PU) designed to amplify sequences containing the biallelic markers of SEQ LD Nos. 3909 to 3934.
  • PU upstream amplification primers
  • SEQ ID Nos. 7866 to 11773 contain nucleotide sequences of downstream amplification primers (RP) designed to amplify sequences containing the biallelic markers of SEQ LD Nos. 1 to 3908.
  • RP downstream amplification primers
  • SEQ LD Nos. 11774 to 11796 contain nucleotide sequences of downstream amplification primers (RP) designed to amplify sequences containing the biallelic markers of
  • nucleic acids include RNA, DNA, or RNA/DNA hybrid sequences of more than one nucleotide in either single chain or duplex form.
  • nucleotide as used herein as an adjective to describe molecules comprising RNA, DNA, or RNA/DNA hybrid sequences of any length in single-stranded or duplex form.
  • nucleotide is also used herein as a noun to refer to individual nucleotides or varieties of nucleotides, meaning a molecule, or individual unit in a larger nucleic acid molecule, comprising a purine or pyrimidine, a ribose or deoxyribose sugar moiety, and a phosphate group, or phosphodiester linkage in the case of nucleotides within an oligonucleotide or polynucleotide.
  • nucleotide is also used herein to encompass "modified nucleotides" which comprise at least one modifications (a) an alternative linking group, (b) an analogous form of purine, (c) an analogous form of pyrimidine, or (d) an analogous sugar, for examples of analogous linking groups, purine, pyrimidines, and sugars see for example PCT publication No. WO 95/04064.
  • the polynucleotides of the invention are preferably comprised of greater than 50% conventional deoxyribose nucleotides, and most preferably greater than 90% conventional deoxyribose nucleotides.
  • polynucleotide sequences of the invention may be prepared by any known method, including synthetic, recombinant, ex vivo generation, or a combination thereof, as well as utilizing any purification methods known in the art.
  • purification is used herein to describe a polynucleotide or polynucleotide vector of the invention which has been separated from other compounds including, but not limited to other nucleic acids, carbohydrates, lipids and proteins (such as the enzymes used in the synthesis of the polynucleotide), or the separation of covalently closed polynucleotides from linear polynucleotides.
  • a polynucleotide is substantially pure when at least about 50 %, preferably 60 to 75% of a sample exhibits a single polynucleotide sequence and conformation
  • a substantially pure polynucleotide typically comprises about 50 %, preferably 60 to 90% weight/weight of a nucleic acid sample, more usually about 95%, and preferably is over about 99% pure.
  • Polynucleotide purity or homogeneity may be indicated by a number of means well known in the art, such as agarose or polyacrylamide gel electrophoresis of a sample, followed by visualizing a single polynucleotide band upon staining the gel. For certain pu ⁇ oses higher resolution can be provided by using HPLC or other means well known in the art.
  • primer denotes a specific oligonucleotide sequence which is complementary to a target nucleotide sequence and used to hybridize to the target nucleotide sequence.
  • a primer serves as an initiation point for nucleotide polymerization catalyzed by either DNA polymerase, RNA polymerase or reverse transcriptase.
  • probe denotes a defined nucleic acid segment (or nucleotide analog segment, e.g., polynucleotide as defined herein) which can be used to identify a specific polynucleotide sequence present in samples, said nucleic acid segment comprising a nucleotide sequence complementary of the specific polynucleotide sequence to be identified.
  • detectable trait “trait” and “phenotype” are used interchangeably herein and refer to any visible, detectable or otherwise measurable property of an organism such as symptoms of, or susceptibility to a disease for example.
  • detectable trait “trait” or “phenotype” are used herein to refer to symptoms of, or susceptibility to a disease; or to refer to an individual's response to an agent, drug, or treatment acting on a disease; or to refer to symptoms of, or susceptibility to side effects to an agent acting on a disease.
  • treatment is used herein to encompass any medical intervention known in the art including, for example, the administration of pharmaceutical agents, medically prescribed changes in diet, or habits such as a reduction in smoking or drinking, surgery, the application of medical devices, and the application or reduction of certain physical conditions, for example, light or radiation.
  • allele is used herein to refer to variants of a nucleotide sequence.
  • a biallelic polymo ⁇ hism has two forms; designated herein as the 1 ST allele and the 2 ND allele. Diploid organisms may be homozygous or heterozygous for an allelic form.
  • heterozygosity rate is used herein to refer to the incidence of individuals in a population, which are heterozygous at a particular allele. In a biallelic system the heterozygosity rate is on average equal to 2P a (l-P a ), where P a is the frequency of the least common allele. In order to be useful in genetic studies a genetic marker should have an adequate level of heterozygosity to allow a reasonable probability that a randomly selected person will be heterozygous.
  • genotyp refers the identity of the alleles present in an individual or a sample.
  • a genotype preferably refers to the description of the biallelic marker alleles present in an individual or a sample.
  • genotyping a sample or an individual for a biallelic marker consists of determining the specific allele or the specific nucleotide carried by an individual at a biallelic marker.
  • mutation refers to a difference in DNA sequence between or among different genomes or individuals which has a frequency below 1%.
  • haplotype refers to a combination of alleles present in an individual or a sample.
  • a haplotype preferably refers to a combination of biallelic marker alleles found in a given individual and which may be associated with a phenotype.
  • polymo ⁇ hism refers to the occurrence of two or more alternative genomic sequences or alleles between or among different genomes or individuals.
  • Polymo ⁇ hic refers to the condition in which two or more variants of a specific genomic sequence can be found in a population.
  • a “polymo ⁇ hic site” is the locus at which the variation occurs.
  • a single nucleotide polymo ⁇ hism is a single base pair change. Typically a single nucleotide polymo ⁇ hism is the replacement of one nucleotide by another nucleotide at the polymo ⁇ hic site.
  • single nucleotide polymorphism preferably refers to a single nucleotide substitution.
  • the polymo ⁇ hic site may be occupied by two different nucleotides.
  • biaselic polymo ⁇ hism and “biallelic marker” are used interchangeably herein to refer to a polymo ⁇ hism having two alleles at a fairly high frequency in the population, preferably a single nucleotide polymo ⁇ hism.
  • a "biallelic marker allele” refers to the nucleotide variants present at a biallelic marker site.
  • the frequency of the less common allele of the biallelic markers of the present invention has been validated to be greater than 1%, preferably the frequency is greater than 10%, more preferably the frequency is at least 20% (i.e. heterozygosity rate of at least 0.32), even more preferably the frequency is at least 30% (i.e. heterozygosity rate of at least 0.42).
  • a biallelic marker wherein the frequency of the less common allele is 30% or more is termed a "high quality biallelic marker.”
  • nucleotides in a polynucleotide with respect to the center of the polynucleotide are described herein in the following manner.
  • the nucleotide at an equal distance from the 3' and 5' ends of the polynucleotide is considered to be "at the center" of the polynucleotide, and any nucleotide immediately adjacent to the nucleotide at the center, or the nucleotide at the center itself is considered to be "within 1 nucleotide of the center.”
  • any of the five nucleotides positions in the middle of the polynucleotide would be considered to be within 2 nucleotides of the center, and so on.
  • the polymo ⁇ hism, allele or biallelic marker is "at the center" of a polynucleotide if the difference between the distance from the substituted, inserted, or deleted polynucleotides of the polymo ⁇ hism and the 3' end of the polynucleotide, and the distance from the substituted, inserted, or deleted polynucleotides of the polymo ⁇ hism and the 5' end of the polynucleotide is zero or one nucleotide.
  • the polymo ⁇ hism is considered to be "within 1 nucleotide of the center.” If the difference is 0 to 5, the polymo ⁇ hism is considered to be “within 2 nucleotides of the center.” If the difference is 0 to 7, the polymo ⁇ hism is considered to be "within 3 nucleotides of the center,” and so on.
  • the polymo ⁇ hism, allele or biallelic marker is "at the center" of a polynucleotide if the difference between the distance from the substituted, inserted, or deleted polynucleotides of the polymo ⁇ hism and the 3' end of the polynucleotide, and the distance from the substituted, inserted, or deleted polynucleotides of the polymo ⁇ hism and the 5' end of the polynucleotide is zero or one nucleotide.
  • the polymo ⁇ hism is considered to be "within 1 nucleotide of the center.” If the difference is 0 to 5, the polymo ⁇ hism is considered to be “within 2 nucleotides of the center.” If the difference is 0 to 7, the polymo ⁇ hism is considered to be "within 3 nucleotides of the center,” and so on.
  • upstream is used herein to refer to a location which, is toward the 5' end of the polynucleotide from a specific reference point.
  • base paired and "Watson & Crick base paired” are used interchangeably herein to refer to nucleotides which can be hydrogen bonded to one another be virtue of their sequence identities in a manner like that found in double-helical DNA with thymine or uracil residues linked to adenine residues by two hydrogen bonds and cytosine and guanine residues linked by three hydrogen bonds (See Stryer, L., Biochemistry, 4th edition, 1995).
  • complementary or “complement thereof are used herein to refer to the sequences of polynucleotides which is capable of forming Watson & Crick base pairing with another specified polynucleotide throughout the entirety of the complementary region. This term is applied to pairs of polynucleotides based solely upon their sequences and not any particular set of conditions under which the two polynucleotides would actually bind.
  • map-related biallelic marker relates to a biallelic marker in linkage disequilibrium with any of the sequences disclosed in SEQ LD Nos. 1 to 3908 which contain a biallelic marker of the map.
  • map-related biallelic marker encompasses all of the biallelic markers disclosed in SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to
  • the preferred map-related biallelic marker alleles of the present invention include each one of the alleles selected individually or in any combination from the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, as identified in field ⁇ 223> of the allele feature in the appended Sequence Listing, individually or in groups consisting of all the possible combinations of the alleles.
  • the terms "1 ST allele” and "2 ND allele” refer to the nucleotide located at the polymo ⁇ hic base of a polynucleotide sequence containing a biallelic marker, as identified in field ⁇ 222> of the allele feature in the appended Sequence Listing for each Sequence ID number.
  • polymo ⁇ hic base is located at nucleotide position 24 for each of SEQ LD Nos. 1 to 3908, with the exception of SEQ LD Nos. 914, 1013, 2544, 3434, 3795, and
  • the polymo ⁇ hic base is located at nucleotide position 23 for SEQ LD Nos. 914, 1013 and 2544, at nucleotide position 21 for SEQ ID No. 3028, at nucleotide position 20 for SEQ LD No. 3434.
  • the present invention encompasses polynucleotides for use as primers and probes in the methods of the invention. All of the polynucleotides of the invention may be specified as being isolated, purified or recombinant. These polynucleotides may consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence from any sequence in the Sequence Listing as well as sequences which are complementary thereto ("complements thereof). The "contiguous span" may be at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD.
  • flanking sequences surrounding the polymo ⁇ hic bases which, are enumerated in the Sequence Listing. Rather, it will be appreciated that the flanking sequences surrounding the biallelic markers, or any of the primers of probes of the invention which, are more distant from the markers, may be lengthened or shortened to any extent compatible with their intended use and the present invention specifically contemplates such sequences. It will be appreciated that the polynucleotides referred to in the Sequence Listing may be of any length compatible with their intended use. Also the flanking regions outside of the contiguous span need not be homologous to native flanking sequences which actually occur in human subjects.
  • the contiguous span may optionally include the map-related biallelic marker in said sequence.
  • Biallelic markers generally consist of a polymo ⁇ hism at one single base position. Each biallelic marker therefore corresponds to two forms of a polynucleotide sequence which, when compared with one another, present a nucleotide modification at one position.
  • the nucleotide modification involves the substitution of one nucleotide for another.
  • polynucleotides may consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence from SEQ LD Nos. 1 to 2260 as well as sequences which are complementary thereto.
  • the "contiguous span" may be at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular
  • Sequence LD Particularly preferred are polynucleotides which consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence of any of SEQ LD Nos. 1 to 2260, or the complements thereof, wherein the 1 ST allele of the biallelic marker of the SEQ ID No. is present at the map-related biallelic marker.
  • Other preferred polynucleotides consist of, consist essentially of, or comprise a contiguous span of nucleotides of any of SEQ LD Nos. 1 to 2260, or the complements thereof, wherein the 2 ND allele of the biallelic marker of the SEQ LD No. is present at the map-related biallelic marker.
  • Preferred polynucleotides may consist of, consist essentially of, or comprise a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD No., of a sequence from SEQ LD Nos. 2261 to 3734 as well as sequences which are complementary thereto.
  • Particularly preferred are polynucleotides which consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence of any of SEQ LD Nos.
  • Preferred polynucleotides may consist of, consist essentially of, or comprise a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD No., of a sequence from SEQ LD Nos. 3735 to 3908 as well as sequences which are complementary thereto.
  • Particularly preferred are polynucleotides which consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence of any of SEQ ID Nos.
  • polynucleotides of the present invention are polynucleotides which consist of, consist essentially of, or comprise a contiguous span at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD, of a sequence from SEQ ED Nos. 1201, 3242, 3907 and 3908 as well as sequences which are complementary thereto, wherein said contiguous span of SEQ LD Nos. 1201 or 3242 contains a "G" at the polymo ⁇ hic base, or wherein said contiguous span of SEQ LD Nos. 3907 or 3908 contain an "A" at the polymo ⁇ hic base.
  • the invention also relates to polynucleotides that hybridize, under conditions of high or intermediate stringency, to a polynucleotide of a sequence from any of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 as well as sequences which are complementary thereto.
  • polynucleotides are at least 8, 10, 12, 15, 18, 19, 20,
  • polynucleotides comprise a map-related biallelic marker.
  • the 1 ST or the 2 ND allele of the biallelic markers disclosed in the SEQ LD No. may be specified as being present at the map-related biallelic marker.
  • Conditions of high and intermediate stringency are further described in LILC.4.
  • the primers of the present invention may be designed from the disclosed sequences using any method known in the art.
  • a preferred set of primers is fashioned such that the 3' end of the contiguous span of identity with the sequences of the Sequence Listing is present at the 3' end of the primer.
  • Such a configuration allows the 3' end of the primer to hybridize to a selected nucleic acid sequence and dramatically increases the efficiency of the primer for amplification or sequencing reactions.
  • the contiguous span is found in one of the sequences described in SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 or the complements thereof.
  • the invention also relates to polynucleotides consisting of, consisting essentially of, or comprising a contiguous span of nucleotides of a sequence from SEQ LD Nos.
  • Allele specific primers may be designed such that a biallelic marker is at the 3' end of the contiguous span and the contiguous span is present at the 3' end of the primer. Such allele specific primers tend to selectively prime an amplification or sequencing reaction so long as they are used with a nucleic acid sample that contains one of the two alleles present at a biallelic marker.
  • the 3' end of primer of the invention may be located within or at least 2, 4, 6, 8, 10, to the extent that this distance is consistent with the particular Sequence LD, nucleotides upstream of a map-related biallelic marker in said sequence or at any other location which is appropriate for their intended use in sequencing, amplification or the location of novel sequences or markers.
  • Primers with their 3' ends located 1 nucleotide upstream of a map-related biallelic marker have a special utility as microsequencing assays.
  • Preferred microsequencing primers are described in SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, where for each of SEQ LD Nos.
  • the sense microsequencing primer contains the complement of the 19 nucleotides having their 3' ends located 1 nucleotide upstream of the polymo ⁇ hic base of the respective SEQ LD No, and where the antisense microsequencing primer contains the complement of the 19 nucleotides of the complementary strand, nucleotides of the primer having their 3' end located 1 nucleotide upstream of the polymo ⁇ hic base on the complementary strand to the respective SEQ LD No.
  • the probes of the present invention may be designed from the disclosed sequences for any method known in the art, particularly methods which allow for testing if a particular sequence or marker disclosed herein is present.
  • a preferred set of probes may be designed for use in the hybridization assays of the invention in any manner known in the art such that they selectively bind to one allele of a biallelic marker, but not the other under any particular set of assay conditions.
  • Preferred hybridization probes may consists of, consist essentially of, or comprise a contiguous span of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, or the complement thereof, which ranges in length from least 8, 10, 12, 15, 18, 19, 20,
  • the 1st allele or 2nd allele of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 may be specified as being present at the biallelic marker site.
  • said biallelic marker may be within 6, 5, 4, 3, 2, or 1 nucleotides of the center of the hybridization probe or at the center of said probe.
  • Any of the polynucleotides of the present invention can be labeled, if desired, by inco ⁇ orating a label detectable by spectroscopic, photochemical, biochemical, immunochemical, or chemical means.
  • useful labels include radioactive substances, fluorescent dyes or biotin.
  • polynucleotides are labeled at their 3' and 5' ends.
  • a label can also be used to capture the primer, so as to facilitate the immobilization of either the primer or a primer extension product, such as amplified DNA, on a solid support.
  • a capture label is attached to the primers or probes and can be a specific binding member which forms a binding pair with the solid's phase reagent's specific binding member (e.g. biotin and streptavidin). Therefore depending upon the type of label carried by a polynucleotide or a probe, it may be employed to capture or to detect the target DNA. Further, it will be understood that the polynucleotides, primers or probes provided herein, may, themselves, serve as the capture label. For example, in the case where a solid phase reagent's binding member is a nucleic acid sequence, it may be selected such that it binds a complementary portion of a primer or probe to thereby immobilize the primer or probe to the solid phase.
  • a solid phase reagent's binding member is a nucleic acid sequence
  • a polynucleotide probe itself serves as the binding member
  • the probe will contain a sequence or "tail" that is not complementary to the target.
  • a polynucleotide primer itself serves as the capture label
  • at least a portion of the primer will be free to hybridize with a nucleic acid on a solid phase.
  • DNA Labeling techniques are well known to the skilled technician.
  • Solid supports are known to those skilled in the art and include the walls of wells of a reaction tray, test tubes, polystyrene beads, magnetic beads, nitrocellulose strips, membranes, microparticles such as latex particles, sheep (or other animal) red blood cells, duracytes® and others.
  • the solid support is not critical and can be selected by one skilled in the art.
  • latex particles, microparticles, magnetic or nonmagnetic beads, membranes, plastic tubes, walls of microtiter wells, glass or silicon chips, sheep (or other suitable animal's) red blood cells and duracytes are all suitable examples.
  • a solid support refers to any material which is insoluble, or can be made insoluble by a subsequent reaction.
  • the solid support can be chosen for its intrinsic ability to attract and immobilize the capture reagent.
  • the solid phase can retain an additional receptor which has the ability to attract and immobilize the capture reagent.
  • the additional receptor can include a charged substance that is oppositely charged with respect to the capture reagent itself or to a charged substance conjugated to the capture reagent.
  • the receptor molecule can be any specific binding member which is immobilized upon (attached to) the solid support and which has the ability to immobilize the capture reagent through a specific binding reaction.
  • the receptor molecule enables the indirect binding of the capture reagent to a solid support material before the performance of the assay or during the performance of the assay.
  • the solid phase thus can be a plastic, derivatized plastic, magnetic or non-magnetic metal, glass or silicon surface of a test tube, microtiter well, sheet, bead, microparticle, chip, sheep (or other suitable animal's) red blood cells, duracytes® and other configurations known to those of ordinary skill in the art.
  • polynucleotides of the invention can be attached to or immobilized on a solid support individually or in groups of at least 2, 5, 8, 10, 12, 15, 20, or 25 distinct polynucleotides of the inventions to a single solid support.
  • polynucleotides other than those of the invention may attached to the same solid support as one or more polynucleotides of the invention.
  • any polynucleotide provided herein may be attached in overlapping areas or at random locations on the solid support.
  • the polynucleotides of the invention may be attached in an ordered array wherein each polynucleotide is attached to a distinct region of the solid support which does not overlap with the attachment site of any other polynucleotide.
  • such an ordered array of polynucleotides is designed to be "addressable" where the distinct locations are recorded and can be accessed as part of an assay procedure.
  • Addressable polynucleotide arrays typically comprise a plurality of different oligonucleotide probes that are coupled to a surface of a substrate in different known locations.
  • VLSIPSTM in which, typically, probes are immobilized in a high density array on a solid surface of a chip
  • examples of VLSIPSTM technologies are provided in US Patents 5,143,854 and 5,412,087 and in PCT Publications WO 90/15070, WO 92/10092 and WO 95/11995, which describe methods for forming oligonucleotide arrays through techniques such as light- directed synthesis techniques.
  • Oligonucleotide arrays may comprise at least one of the sequences selected from the group consisting of SEQ ID Nos.
  • Oligonucleotide arrays may also comprise at least one of the sequences selected from the group consisting of SEQ ID Nos.
  • arrays may also comprise at least one of the sequences selected from the group consisting of SEQ ED Nos.
  • the oligonucleotide array may comprise at least one of the sequences selecting from the group consisting of SEQ ID Nos.
  • Sequence ED for determining whether a sample contains one or more alleles of the biallelic markers of the present invention.
  • Each DNA chip can contain thousands to millions of individual synthetic DNA probes arranged in a gnd-hke pattern and mmiatunzed to the size of a dime
  • the efficiency of hybridization of nucleic acids in the sample with the probes attached to the chip may be improved by using polyacrylamide gel pads isolated from one another by hydrophobic regions in which the DNA probes are covalently linked to an acrylamide matnx
  • the polymo ⁇ hic bases present in the biallelic marker or markers of the sample nucleic acids are determined as follows Probes which contain at least a portion of one or more of the biallelic markers of the present invention are synthesized either in situ or by conventional synthesis and immobilized on an appropnate chip using methods known to the skilled technician.
  • any one or more alleles of the biallelic markers descnbed herein (SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto) or fragments thereof containing the polymo ⁇ hic bases, may be fixed to a solid support, such as a microchip or other immobilizing surface.
  • the fragments of these nucleic acids may compnse at least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides of the biallelic markers descnbed herein
  • the fragments include the polymo ⁇ hic bases of the biallelic markers.
  • a nucleic acid sample is applied to the immobilizing surface and analyzed to determine the identities of the polymo ⁇ hic bases of one or more of the biallelic markers.
  • the solid support may also include one or more of the amplification pnmers descnbed herein, or fragments compnsmg at least 10, at least 15, or at least 20 consecutive nucleotides thereof, for generating an amplification product containing the polymo ⁇ hic bases of the biallelic markers to be analyzed m the sample.
  • Another embodiment of the present invention is a solid support which includes one or more of the microsequencing pnmers of the invention, or fragments compnsmg at least 10, at least 15, or at least 20 consecutive nucleotides thereof and having a 3' terminus immediately upstream of the polymo ⁇ hic base of the corresponding biallelic marker, for determining the identity of the polymo ⁇ hic base of the one or more biallelic markers fixed to the solid support
  • one embodiment of the present mvention is an array of nucleic acids fixed to a solid support, such as a microchip, bead, or other immobilizing surface, compnsmg one or more of the biallelic markers in the maps of the present invention or a fragment compnsmg at least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides thereof including the polymo ⁇ hic base.
  • the array may compnse 1, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, or 3000 of the biallelic markers selected from the group consisting of SEQ DD Nos.: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto, or a fragment compnsmg at least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides thereof including the polymo ⁇ hic base.
  • Another embodiment of the present invention is an array compnsmg amplification pnmers for generating amplification products containing the polymo ⁇ hic bases of one or more, at least five, at least 10, at least 20, at least 100, at least 200, at least 300, at least 400, or more than 400 of the biallelic markers m the maps of the present invention.
  • the array may compnse amplification pnmers for generating amplification products containing the polymo ⁇ hic bases of at least 1,5, 10, 20, 50, 100, 200, 300, 400, 500, 1000, 2000, or 3000, of the biallelic markers selected from the group consisting of SEQ DD Nos.: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
  • the amplification pnmers included in the array are capable of amplifying the biallelic marker sequences to be detected in the nucleic acid sample applied to the array (i.e. the amplification pnmers correspond to the biallelic markers affixed to the array - see Table 1).
  • the arrays may include one or more of the amplification pnmers of SEQ DD Nos.: 3935 to 7842, 7866 to 11773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 11599, and 11600 to 11773 corresponding to the one or more biallelic markers of SEQ DD Nos. 1 to 3908, 1 to
  • Another embodiment of the present invention is an array which includes microsequencing pnmers capable of determining the identity of the polymo ⁇ hic bases of at least 1, 5, 10, 20, 50, 100, 200, 300, 500, 1000, 2000, or 3000 of the present invention.
  • the array may compnse microsequencing pnmers capable of determining the identity of the polymo ⁇ hic bases of one or more, at least five, at least 10, at least 20, at least 100, at least 200, at least 300, at least 400, or more than 400 of the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
  • Arrays containing any combination of the above nucleic acids which permits the specific detection or identification of the polymo ⁇ hic bases of the biallelic markers in the maps of the present invention including any combination of biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are also within the scope of the present mvention.
  • the anay may compnse both the biallelic markers and amplification pnmers capable of generating amplification products containing the polymo ⁇ hic bases of the biallelic markers.
  • the array may comprise both amplification pnmers capable of generating amplification products containing the polymo ⁇ hic bases of the biallelic markers and microsequencing pnmers capable of determining the identities of the polymo ⁇ hic bases of these markers.
  • arrays compnsmg specific groups of biallelic markers and, in some embodiments, specific amplification pnmers and microsequencing pnmers it will be appreciated that the present invention encompasses arrays including any biallelic marker, group of biallelic markers, amplification primer, group of amplification primers, microsequencing primer, or group of amplification primers described herein, as well as any combination of the preceding nucleic acids.
  • the present invention also encompasses diagnostic kits comprising one or more polynucleotides of the invention, optionally with a portion or all of the necessary reagents and instructions for genotyping a test subject by determining the identity of a nucleotide at a map- related biallelic marker.
  • the polynucleotides of a kit may optionally be attached to a solid support, or be part of an array or addressable array of polynucleotides.
  • the kit may provide for the determination of the identity of the nucleotide at a marker position by any method known in the art including, but not limited to, a sequencing assay method, a microsequencing assay method, a hybridization assay method, or an allele specific amplification method.
  • such a kit may include instructions for scoring the results of the determination with respect to the test subjects' risk of contracting a diseases involving a disease, likely response to an agent acting on a disease, or chances of suffering from side effects to an agent acting on a disease.
  • Any of a variety of methods can be used to screen a genomic fragment for single nucleotide polymo ⁇ hisms such as differential hybridization with oligonucleotide probes, detection of changes in the mobility measured by gel electrophoresis or direct sequencing of the amplified nucleic acid.
  • a preferred method for identifying biallelic markers involves comparative sequencing of genomic DNA fragments from an appropriate number of unrelated individuals.
  • DNA samples from unrelated individuals are pooled together, following which the genomic DNA of interest is amplified and sequenced.
  • the nucleotide sequences thus obtained are then analyzed to identify significant polymo ⁇ hisms.
  • One of the major advantages of this method resides in the fact that the pooling of the DNA samples substantially reduces the number of DNA amplification reactions and sequencing reactions, which must be carried out.
  • this method is sufficiently sensitive so that a biallelic marker obtained thereby usually demonstrates a sufficient frequency of its less common allele to be useful in conducting association studies.
  • the frequency of the least common allele of a biallelic marker identified by this method is at least 10%.
  • the DNA samples are not pooled and are therefore amplified and sequenced individually.
  • This method is usually preferred when biallelic markers need to be identified in order to perform association studies within candidate genes.
  • highly relevant gene regions such as promoter regions or exon regions may be screened for biallelic markers.
  • a biallelic marker obtained using this method may show a lower degree of informativeness for conducting association studies, e.g. if the frequency of its less frequent allele may be less than about 10%.
  • Such a biallelic marker will however be sufficiently informative to conduct association studies and it will further be appreciated that including less informative biallelic markers in the genetic analysis studies of the present invention, may allow in some cases the direct identification of causal mutations, which may, depending on their penetrance, be rare mutations.
  • Genomic DNA samples The genomic DNA samples from which the biallelic markers of the present invention are generated are preferably obtained from unrelated individuals conesponding to a heterogeneous population of known ethnic background.
  • the number of individuals from whom DNA samples are obtained can vary substantially, preferably from about 10 to about 1000, more preferably from about 50 to about 200 individuals.
  • DNA samples are collected from at least about 100 individuals in order to have sufficient polymo ⁇ hic diversity in a given population to identify as many markers as possible and to generate statistically significant results.
  • test samples include biological samples, which can be tested by the methods of the present invention described herein, and include human and animal body fluids such as whole blood, serum, plasma, cerebrospinal fluid, urine, lymph fluids, and various external secretions of the respiratory, intestinal and genitourinary tracts, tears, saliva, milk, white blood cells, myelomas and the like; biological fluids such as cell culture supematants; fixed tissue specimens including tumor and non-tumor tissue and lymph node tissues; bone marrow aspirates and fixed cell specimens.
  • the preferred source of genomic DNA used in the present invention is from peripheral venous blood of each donor. Techniques to prepare genomic DNA from biological samples are well known to the skilled technician. Details of a prefened embodiment are provided in Example 27. The person skilled in the art can choose to amplify pooled or unpooled DNA samples. II.B. DNA Amplification
  • biallelic markers in a sample of genomic DNA may be facilitated through the use of DNA amplification methods.
  • DNA samples can be pooled or unpooled for the amplification step.
  • DNA amplification techniques are well known to those skilled in the art.
  • Various methods to amplify DNA fragments carrying biallelic markers are further described hereinafter in HI.B.
  • the PCR technology is the preferred amplification technique used to identify new biallelic markers.
  • biallelic markers are identified using genomic sequence information generated by the inventors. Genomic DNA fragments, such as the inserts of the BAC clones descnbed above, are sequenced and used to design pnmers for the amplification of 500 bp fragments.
  • Pnmers may be designed using the OSP software (Hilher L and Green P., 1991). All pnmers may contain, upstream of the specific target bases, a common oligonucleotide tail that serves as a sequencing pnmer. Those skilled in the art are familiar with pnmer extensions, which can be used for these pu ⁇ oses.
  • genomic sequences of candidate genes are available m public databases allowing direct screening for biallelic markers.
  • Prefened pnmers, useful for the amplification of genomic sequences encoding the candidate genes focus on promoters, exons and splice sites of the genes.
  • a biallelic marker present in these functional regions of the gene have a higher probability to be a causal mutation.
  • Prefened pnmers include those disclosed in SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to
  • the amplification products generated as descnbed above, are then sequenced using any method known and available to the skilled technician.
  • Methods for sequencing DNA using either the dideoxy-mediated method (Sanger method) or the Maxam-Gilbert method are widely known to those of ordinary skill in the art. Such methods are for example disclosed m Maniatis et al. (Molecular Cloning, A Laboratory Manual, Cold Spnng Harbor Press, Second Edition, 1989). Alternative approaches include hybndization to high-density DNA probe anays as descnbed in Chee et al. (Science 274, 610, 1996).
  • the amplified DNA is subjected to automated dideoxy terminator sequencing reactions using a dye-pnmer cycle sequencing protocol.
  • the products of the sequencing reactions are run on sequencing gels and the sequences are determined using gel image analysis.
  • the polymo ⁇ hism search is based on the presence of supenmposed peaks in the electrophoresis pattern resulting from different bases occurnng at the same position
  • the two peaks conespondmg to a biallelic site present distinct colors conespondmg to two different nucleotides at the same position on the sequence.
  • the presence of two peaks can be an artifact due to background noise.
  • the two DNA strands are sequenced and a companson between the peaks is earned out.
  • the polymo ⁇ hism In order to be registered as a polymo ⁇ hic sequence, the polymo ⁇ hism has to be detected on both strands
  • the above procedure permits those amplification products, which contain biallelic markers to be identified
  • the detection limit for the frequency of biallelic polymo ⁇ hisms detected by sequencing pools of 100 individuals is approximately 0.1 for the minor allele, as verified by sequencing pools of known allelic frequencies.
  • more than 90% of the biallelic polymo ⁇ hisms detected by the pooling method have a frequency for the minor allele higher than 0.25. Therefore, the biallelic markers selected by this method have a frequency of at least 0.1 for the minor allele and less than 0.9 for the major allele.
  • At least 0.2 for the minor allele and less than 0.8 for the major allele Preferably at least 0.2 for the minor allele and less than 0.8 for the major allele, more preferably at least 0.3 for the minor allele and less than 0.7 for the major allele, thus a heterozygosity rate higher than 0.18, preferably higher than 0.32, more preferably higher than 0.42.
  • biallelic markers are detected by sequencing individual DNA samples, the frequency of the minor allele of such a biallelic marker may be less than 0.1.
  • the markers earned by the same fragment of genomic DNA need not necessanly be ordered with respect to one another within the genomic fragment to conduct association studies. However, m some embodiments of the present invention, the order of biallelic markers earned by the same fragment of genomic DNA are determined.
  • the polymo ⁇ hisms are evaluated for their usefulness as genetic markers by validating that both alleles are present in a population.
  • Validation of the biallelic markers is accomplished by genotyping a group of individuals by a method of the invention and demonstrating that both alleles are present.
  • Microsequencing is a prefened method of genotyping alleles.
  • the validation by genotyping step may be performed on individual samples denved from each individual in the group or by genotyping a pooled sample denved from more than one individual.
  • the group can be as small as one individual if that individual is heterozygous for the allele m question.
  • the group contains at least three individuals, more preferably the group contains five or six individuals, so that a single validation test will be more likely to result m the validation of more of the biallelic markers that are being tested. It should be noted, however, that when the validation test is performed on a small group it may result in a false negative result if as a result of sampling error none of the individuals tested carnes one of the two alleles. Thus, the validation process is less useful in demonstrating that a particular initial result is an artifact, than it is at demonstrating that there is a bonafide biallelic marker at a particular position in a sequence. All of the genotyping, haplotypmg, association, and interaction study methods of the invention may optionally be performed solely with validated biallelic markers.
  • the validated biallelic markers are further evaluated for their usefulness as genetic markers by determining the frequency of the least common allele at the biallelic marker site
  • the determination of the least common allele is accomplished by genotyping a group of individuals by a method of the invention and demonstrating that both alleles are present
  • This determination of frequency by genotyping step may be performed on individual samples denved from each individual in the group or by genotyping a pooled sample derived from more than one individual
  • the group must be large enough to be representative of the population as a whole
  • the group contains at least 20 individuals, more preferably the group contains at least 50 individuals, most preferably the group contains at least 100 individuals Of course the larger the group the greater the accuracy of the frequency determination because of reduced sampling enor.
  • a biallelic marker wherein the frequency of the less common allele is 30% or more is termed a "high quality biallelic marker.” All of the genotyping, haplotypmg, association, and interaction study methods of the invention may optionally be performed solely with high quality biallelic markers III. Methods Of Genotyping An Individual For Biallelic Markers
  • Methods are provided to genotype a biological sample for one or more biallelic markers of the present invention, all of which may be performed in vitro.
  • Such methods of genotyping comprise determining the identity of a nucleotide at a map-related biallelic marker by any method known in the art. These methods find use in genotyping case-control populations m association studies as well as individuals in the context of detection of alleles of biallelic markers which, are known to be associated with a given trait, m which case both copies of the biallelic marker present in individual's genome are determined so that an individual may be classified as homozygous or heterozygous for a particular allele.
  • genotyping methods can be performed nucleic acid samples denved from a single individual or pooled DNA samples.
  • Genotyping can be performed using similar methods as those descnbed above for the identification of the biallelic markers, or using other genotyping methods such as those further descnbed below. In prefened embodiments, the comparison of sequences of amplified genomic fragments from different individuals is used to identify new biallelic markers whereas microsequencing is used for genotyping known biallelic markers m diagnostic and association study applications.
  • nucleic acids in punfied or non-purified form, can be utilized as the starting nucleic acid, provided it contains or is suspected of containing the specific nucleic acid sequence desired.
  • DNA or RNA may be extracted from cells, tissues, body fluids and the like as descnbed above in II.A While nucleic acids for use m the genotyping methods of the invention can be denved from any mammalian source, the test subjects and individuals from which nucleic acid samples are taken are generally understood to be human III.B. Amplification Of DNA Fragments Comprising Biallelic Markers
  • Methods and polynucleotides are provided to amplify a segment of nucleotides compnsmg one or more biallelic marker of the present invention. It will be appreciated that amplification of DNA fragments compnsmg biallelic markers may be used in vanous methods and for various pu ⁇ oses and is not restricted to genotyping. Nevertheless, many genotyping methods, although not all, require the previous amplification of the DNA region carrying the biallelic marker of interest. Such methods specifically increase the concentration or total number of sequences that span the biallelic marker or include that site and sequences located either distal or proximal to it. Diagnostic assays may also rely on amplification of DNA segments carrying a biallelic marker of the present invention.
  • Amplification of DNA may be achieved by any method known m the art.
  • Amplification methods which can be utilized herein include but are not limited to Ligase Cham Reaction (LCR) as descnbed in EP A 320 308 and EP A 439 182, Gap LCR (Wolcott, M.J., Chn. Mcrobiol. Rev. 5:370-386), the so-called "NASBA” or "3SR” technique descnbed m Guatelh J.C. et al. (Proc Natl Acad. Sci USA 87:1874-1878, 1990) and in Compton J. (Nature 350:91-92, 1991), Q-beta amplification as descnbed in European Patent Application no 4544610, strand displacement amplification as descnbed in Walker et al. (Clin
  • LCR and Gap LCR are exponential amplification techniques, both depend on DNA ligase to join adjacent pnmers annealed to a DNA molecule.
  • probe pairs are used which include two primary (first and second) and two secondary (third and fourth) probes, all of which are employed m molar excess to target.
  • the first probe hybndizes to a first segment of the target strand and the second probe hybndizes to a second segment of the target strand, the first and second segments being contiguous so that the primary probes abut one another in 5' phosphate-3 'hydroxyl relationship, and so that a ligase can covalently fuse or hgate the two probes into a fused product.
  • a third segment of the target strand and the second probe hybndizes to a second segment of the target strand, the first and second segments being contiguous so that the primary probes abut one another in 5' phosphate-3 'hydroxyl relationship, and so that a ligase can covalently fuse or hgate the two probes into a fused product.
  • (secondary) probe can hybndize to a portion of the first probe and a fourth (secondary) probe can hybndize to a portion of the second probe in a similar abutting fashion.
  • the secondary probes also will hybndize to the target complement in the first instance.
  • the third and fourth probes which can be ligated to form a complementary, secondary ligated product.
  • Gap LCR is a version of LCR where the probes are not adjacent but are separated by 2 to 3 bases
  • RT-PCR polymerase chain reaction
  • AGLCR is a modification of GLCR that allows the amplification of RNA.
  • PCR technology is the prefened amplification technique used in the present invention.
  • a vanety of PCR techniques are familiar to those skilled in the art For a review of PCR technology, see Molecular Cloning to Genetic Eng ee ⁇ ng White, B.A. Ed. in Methods in
  • PCR pnmers on either side of the nucleic acid sequences to be amplified are added to a suitably prepared nucleic acid sample along with dNTPs and a thermostable polymerase such as Taq polymerase, Pfu polymerase, or Vent polymerase.
  • a thermostable polymerase such as Taq polymerase, Pfu polymerase, or Vent polymerase.
  • the nucleic acid m the sample is denatured and the PCR pnmers are specifically hybndized to complementary nucleic acid sequences in the sample.
  • the hybndized pnmers are extended. Thereafter, another cycle of denaturation, hybndization, and extension is initiated. The cycles are repeated multiple times to produce an amplified fragment containing the nucleic acid sequence between the pnmer sites.
  • PCR has further been descnbed in several patents including US Patents 4,683,195,
  • biallelic markers as descnbed above allows the design of appropnate oligonucleotides, which can be used as pnmers to amplify DNA fragments comprising the biallelic markers of the present invention.
  • Amplification can be performed using the pnmers initially used to discover new biallelic markers which are descnbed herein or any set of pnmers allowing the amplification of a DNA fragment comprising a biallelic marker of the present invention.
  • Pnmers can be prepared by any suitable method As for example, direct chemical synthesis by a method such as the phosphodiester method of Narang S.A. et al. (Methods Enzymol.
  • the present invention provides pnmers for amplifying a DNA fragment containing one or more biallelic markers of the present invention.
  • Prefened amplification pnmers are listed in SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. It will be appreciated that the pnmers listed are merely exemplary and that any other set of pnmers which produce amplification products containing one or more biallelic markers of the present invention
  • the pnmers are selected to be substantially complementary to the different strands of each specific sequence to be amplified.
  • the length of the pnmers of the present invention can range from 8 to 100 nucleotides, preferably from 8 to 50, 8 to 30 or more preferably 8 to 25 nucleotides. Shorter pnmers tend to lack specificity for a target nucleic acid sequence and generally require cooler temperatures to form sufficiently stable hybnd complexes with the template.
  • the formation of stable hybnds depends on the melting temperature (Tm) of the DNA
  • Tm depends on the length of the pnmer, the ionic strength of the solution and the G+C content
  • the G+C content of the amplification pnmers of the present invention preferably ranges between 10 and 75%, more preferably between 35 and 60%, and most preferably between 40 and 55%.
  • the approp ⁇ ate length for pnmers under a particular set of assay conditions may be empincally determined by one of skill in the art.
  • amplified segments carrying biallelic markers can range in size from at least about 25 bp to 35 kbp. Amplification fragments from 25-3000 bp are typical, fragments from 50-1000 bp are preferred and fragments from 100-600 bp are highly prefened. It will be appreciated that amplification pnmers for the biallelic markers may be any sequence which allow the specific amplification of any DNA fragment carrying the markers Amplification pnmers may be labeled or immobilized on a solid support as descnbed in I.
  • any method known in the art can be used to identify the nucleotide present at a biallelic marker site. Since the biallelic marker allele to be detected has been identified and specified in the present invention, detection will prove simple for one of ordinary skill in the art by employing any of a number of techniques. Many genotyping methods require the previous amplification of the DNA region carrying the biallelic marker of interest While the amplification of target or signal is often prefened at present, ultrasensitive detection methods which do not require amplification are also encompassed by the present genotyping methods.
  • Methods well-known to those skilled in the art that can be used to detect biallelic polymo ⁇ hisms include methods such as, conventional dot blot analyzes, single strand conformational polymo ⁇ hism analysis (SSCP) described by Orita et al. (Proc. Natl. Acad.
  • Prefened methods involve directly determining the identity of the nucleotide present at a biallelic marker site by sequencing assay, enzyme-based mismatch detection assay, or hybridization assay. The following is a description of some prefened methods.
  • a highly prefened method is the microsequencing technique.
  • the term "sequencing assay” is used herein to refer to polymerase extension of duplex primer/template complexes and includes both traditional sequencing and microsequencing.
  • the nucleotide present at a polymo ⁇ hic site can be determined by sequencing methods.
  • DNA samples are subjected to PCR amplification before sequencing as described above. DNA sequencing methods are described in IIC.
  • the amplified DNA is subjected to automated dideoxy terminator sequencing reactions using a dye-primer cycle sequencing protocol. Sequence analysis allows the identification of the base present at the biallelic marker site.
  • a nucleotide at the polymo ⁇ hic site that is unique to one of the alleles in a target DNA is detected by a single nucleotide primer extension reaction.
  • This method involves appropriate microsequencing primers which, hybridize just upstream of a polymo ⁇ hic base of interest in the target nucleic acid.
  • a polymerase is used to specifically extend the 3' end of the primer with one single ddNTP (chain terminator) complementary to the selected nucleotide at the polymo ⁇ hic site.
  • the identity of the inco ⁇ orated nucleotide is determined in any suitable way.
  • microsequencing reactions are carried out using fluorescent ddNTPs and the extended microsequencing primers are analyzed by electrophoresis on ABI 377 sequencing machines to determine the identity of the inco ⁇ orated nucleotide as described in EP 412 883.
  • capillary electrophoresis can be used m order to process a higher number of assays simultaneously.
  • An example of a typical microsequencing procedure that can be used in the context of the present invention is provided in Example 8
  • amplified genomic DNA fragments containing polymo ⁇ hic sites are incubated with a 5'-fluoresce ⁇ n-labeled primer in the presence of allelic dye-labeled dideoxynbonucleoside tnphosphates and a modified Taq polymerase
  • the dye-labeled pnmer is extended one base by the dye-termmator specific for the allele present on the template.
  • the fluorescence intensities of the two dyes m the reaction mixture are analyzed directly without separation or punfication All these steps can be performed in the same tube and the fluorescence changes can be monitored m real time.
  • the extended primer may be analyzed by MALDI-TOF Mass Spectrometry.
  • the base at the polymo ⁇ hic site is identified by the mass added onto the microsequencing pnmer (see Haff L.A. and Smirnov I P., Genome Research, 7:378-388, 1997).
  • Microsequencing may be achieved by the established microsequencing method or by developments or denvatives thereof
  • Alternative methods include several solid-phase microsequencing techniques
  • the basic microsequencing protocol is the same as descnbed previously, except that the method is conducted as a heterogenous phase assay, in which the pnmer or the target molecule is immobilized or captured onto a solid support.
  • oligonucleotides are attached to solid supports or are modified in such ways that permit affinity separation as well as polymerase extension.
  • the 5' ends and internal nucleotides of synthetic oligonucleotides can be modified in a number of different ways to permit different affinity separation approaches, e.g , biotinylation. If a single affinity group is used on the oligonucleotides, the oligonucleotides can be separated from the inco ⁇ orated terminator regent This eliminates the need of physical or size separation. More than one oligonucleotide can be separated from the terminator reagent and analyzed simultaneously if more than one affinity group is used
  • affinity group need not be on the pnming oligonucleotide but could alternatively be present on the template
  • immobilization can be earned out via an interaction between biotinylated DNA and streptavidin-coated microtitration wells or avidin-coated polystyrene particles.
  • oligonucleotides or templates may be attached to a solid support in a high-density format.
  • inco ⁇ orated ddNTPs can be radiolabeled (Syvanen, Chnica Chimica Acta 226:225-236, 1994) or linked to fluorescein (Livak and Hainer, Human Mutation 3:379-385,1994)
  • the detection of radiolabeled ddNTPs can be achieved through scintillation-based techniques
  • the detection of fluorescein-hnked ddNTPs can be based on the binding of antifluorescein antibody conjugated with alkaline phosphatase, followed by incubation with a chromogenic substrate (such as J-mtrophenyl phosphate).
  • reporter-detection pairs include: ddNTP linked to dinitrophenyl (DNP) and anti-DNP alkaline phosphatase conjugate (Harju et al., Clin Chem. 39/11 2282- 2287, 1993) or biotinylated ddNTP and horseradish peroxidase-conjugated streptavidin with ⁇ -phenylenediamine as a substrate (WO 92/15712)
  • DNP dinitrophenyl
  • biotinylated ddNTP and horseradish peroxidase-conjugated streptavidin with ⁇ -phenylenediamine as a substrate WO 92/15712
  • Past en et al (Genome research 7:606-614, 1997) descnbe a method for multiplex detection of single nucleotide polymo ⁇ hism in which the solid phase mimsequencing pnnciple is applied to an oligonucleotide anay format.
  • High-density anays of DNA probes attached to a solid support (DNA chips) are further descnbed in m.C.5.
  • the present invention provides polynucleotides and methods to genotype one or more biallelic markers of the present invention by performing a microsequencing assay.
  • any primer having a 3' end immediately adjacent to a polymo ⁇ hic nucleotide may be used as a microsequencing pnmer.
  • microsequencing analysis may be performed for any biallelic marker or any combination of biallelic markers of the present invention.
  • One aspect of the present invention is a solid support which includes one or more microsequencing pnmers compnsmg nucleotides complementary to the nucleotide sequences of SEQ DD Nos. 1 to 3908, 1 to 2260,
  • the present invention provides polynucleotides and methods to determine the allele of one or more biallelic markers of the present invention in a biological sample, by mismatch detection assays based on polymerases and/or ligases. These assays are based on the specificity of polymerases and ligases.
  • Discnmination between the two alleles of a biallelic marker can also be achieved by allele specific amplification, a selective strategy, whereby one of the alleles is amplified without amplification of the other allele. This is accomplished by placing a polymo ⁇ hic base at the 3' end of one of the amplification primers. Because the extension forms from the 3 'end of the primer, a mismatch at or near this position has an inhibitory effect on amplification. Therefore, under appropnate amplification conditions, these pnmers only direct amplification on their complementary allele Designing the appropnate allele-specific pnmer and the conesponding assay conditions are well with the ordinary skill m the art.
  • OLA Oligonucleotide Ligation Assay
  • OLA uses two oligonucleotides which are designed to be capable of hybndizmg to abutting sequences of a single strand of a target molecules.
  • One of the oligonucleotides is biotinylated, and the other is detectably labeled. If the precise complementary sequence is found in a target molecule, the oligonucleotides will hybndize such that their termini abut, and create a ligation substrate that can be captured and detected.
  • OLA is capable of detecting biallelic markers and may be advantageously combined with PCR as descnbed by Nickerson D.A et al. (Proc Natl. Acad Sci. USA 87:8923-8927, 1990) In this method, PCR is used to achieve the exponential amplification of target DNA, which is then detected using OLA
  • LCR ligase chain reaction
  • GLCR Gap LCR
  • LCR uses two pairs of probes to exponentially amplify a specific target. The sequences of each pair of oligonucleotides, is selected to permit the pair to hybndize to abutting sequences of the same strand of the target Such hybndization forms a substrate for a template-dependant ligase.
  • LCR can be performed with oligonucleotides having the proximal and distal sequences of the same strand of a biallelic marker site.
  • either oligonucleotide will be designed to include the biallelic marker site.
  • the reaction conditions are selected such that the oligonucleotides can be ligated together only if the target molecule either contains or lacks the specific nucleot ⁇ de(s) that is complementary to the biallelic marker on the oligonucleotide.
  • the oligonucleotides will not include the biallelic marker, such that when they hybridize to the target molecule, a "gap" is created as described in WO 90/01069.
  • each single strand has a complement capable of serving as a target dunng the next cycle and exponential allele-specific amplification of the desired sequence is obtained.
  • Ligase/Polymerase-mediated Genetic Bit AnalysisTM is another method for determining the identity of a nucleotide at a preselected site in a nucleic acid molecule (WO 95/21271). This method involves the inco ⁇ oration of a nucleoside tnphosphate that is complementary to the nucleotide present at the preselected site onto the terminus of a pnmer molecule, and their subsequent ligation to a second oligonucleotide The reaction is monitored by detecting a specific label attached to the reaction's solid phase or by detection in solution 4) Hybridization assay methods
  • a prefened method of determining the identity of the nucleotide present at a biallelic marker site involves nucleic acid hybndization
  • the hybndization probes which can be conveniently used in such reactions, preferably include the probes defined herein. Any hybndization assay may be used including Southern hybndization, Northern hybndization, dot blot hybridization and solid-phase hybridization (see Sambrook et al , Molecular Cloning - A Laboratory Manual, Second Edition, Cold Spnng Harbor Press, N.Y., 1989).
  • Hybndization refers to the formation of a duplex structure by two single stranded nucleic acids due to complementary base pamng.
  • Hybndization can occur between exactly complementary nucleic acid strands or between nucleic acid strands that contain minor regions of mismatch.
  • Specific probes can be designed that hybndize to one form of a biallelic marker and not to the other and therefore are able to discnminate between different allelic forms. Allele-specific probes are often used in pairs, one member of a pair showing perfect match to a target sequence containing the o ⁇ ginal allele and the other showing a perfect match to the target sequence containing the alternative allele.
  • Hybndization conditions should be sufficiently stnngent that there is a significant difference in hybndization intensity between alleles, and preferably an essentially binary response, whereby a probe hybndizes to only one of the alleles.
  • Stnngent, sequence specific hybndization conditions, under which a probe will hybndize only to the exactly complementary target sequence are well known in the art (Sambrook et al., Molecular Cloning - A Laboratory Manual, Second Edition, Cold Spring Harbor Press, N.Y., 1989). Stnngent conditions are sequence dependent and will be different in different circumstances.
  • stringent conditions are selected to be about 5°C lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and pH.
  • Tm thermal melting point
  • procedures using conditions of high st ⁇ ngency are as follows- Prehyb ⁇ dization of filters containing DNA is earned out for 8 h to overnight at 65°C in buffer composed of 6X SSC, 50 mM T ⁇ s-HCl (pH 7.5), 1 mM EDTA, 0.02% PVP, 0 02% Ficoll, 0.02% BSA, and 500 ⁇ g/ml denatured salmon sperm DNA.
  • Filters are hybndized for 48 h at 65°C, the prefened hybndization temperature, in prehyb ⁇ dization mixture containing 100 ⁇ g/ml denatured salmon sperm DNA and 5-20 X 10 6 cpm of
  • the hybndization step can be performed at 65°C in the presence of SSC buffer, 1 x SSC conespondmg to 0.15M NaCl and 0.05 M Na citrate Subsequently, filter washes can be done at 37°C for 1 h in a solution containing 2X SSC, 0.01% PVP, 0 01% Ficoll, and 0.01% BSA, followed by a wash in 0.1X SSC at 50°C for 45 mm Alternatively, filter washes can be performed in a solution containing 2 x SSC and 0.1%
  • the hybndized probes are detectable by autoradiography.
  • procedures using conditions of intermediate stnngency are as follows Filters containing DNA are prehybndized, and then hybndized at a temperature of 60°C m the presence of a 5 x SSC buffer and labeled probe. Subsequently, filters washes are performed in a solution containing 2x SSC at 50°C and the hybndized probes are detectable by autoradiography.
  • hybndizations can be performed m solution, it is prefened to employ a solid-phase hybndization assay.
  • the target DNA compnsmg a biallelic marker of the present invention may be amplified p ⁇ or to the hybndization reaction.
  • the presence of a specific allele m the sample is determined by detecting the presence or the absence of stable hybnd duplexes formed between the probe and the target DNA. The detection of hybnd duplexes can be earned out by a number of methods.
  • Vanous detection assay formats are well known which utilize detectable labels bound to either the target or the probe to enable detection of the hybnd duplexes Typically, hybndization duplexes are separated from unhybndized nucleic acids and the labels bound to the duplexes are then detected. Those skilled in the art will recognize that wash steps may be employed to wash away excess target DNA or probe Standard heterogeneous assay formats are suitable for detecting the hyb ⁇ ds using the labels present on the pnmers and probes.
  • the TaqMan assay takes advantage of the 5' nuclease activity of Taq DNA polymerase to digest a DNA probe annealed specifically to the accumulating amplification product
  • TaqMan probes are labeled with a donor-acceptor dye pair that interacts via fluorescence energy transfer Cleavage of the TaqMan probe by the advancing polymerase dunng amplification dissociates the donor dye from the quenching acceptor dye, greatly increasing the donor fluorescence All reagents necessary to detect two allelic vanants can be assembled at the beginning of the reaction and the results are monitored in real time (see Livak et al., Nature Genetics, 9-341-342, 1995).
  • molecular beacons are used for allele discnmmations Molecular beacons are hai ⁇ in-shaped oligonucleotide probes that report the presence of specific nucleic acids in homogeneous solutions When they bind to their targets they undergo a conformational reorganization that restores the fluorescence of an internally quenched fluorophore (Tyagi et al , Nature Biotechnology, 16-49-53, 1998)
  • the polynucleotides provided herein can be used m hybndization assays for the detection of biallelic marker alleles in biological samples. These probes are charactenzed in that they preferably compnse between 8 and 50 nucleotides, and m that they are sufficiently complementary to a sequence compnsmg a biallelic marker of the present invention to hybndize thereto and preferably sufficiently specific to be able to discnminate the targeted sequence for only one nucleotide vanation.
  • the GC content in the probes of the invention usually ranges between 10 and 75 %, preferably between 35 and 60 %, and more preferably between 40 and 55 %
  • the length of these probes can range from 10, 15, 20, or 30 to at least 100 nucleotides, preferably from 10 to 50, more preferably from 18 to 35 nucleotides.
  • a particularly prefened probe is 25 nucleotides in length.
  • the biallelic marker is within 4 nucleotides of the center of the polynucleotide probe In particularly prefened probes the biallelic marker is at the center of said polynucleotide
  • Shorter probes may lack specificity for a target nucleic acid sequence and generally require cooler temperatures to form sufficiently stable hybnd complexes with the template. Longer probes are expensive to produce and can sometimes self-hybndize to form hai ⁇ in structures Methods for the synthesis of oligonucleotide probes have been descnbed above and can be applied to the probes
  • the probes of the present invention are labeled or immobilized on a solid support. Labels and solid supports are further descnbed in I.
  • Detection probes are generally nucleic acid sequences or uncharged nucleic acid analogs such as, for example peptide nucleic acids which are disclosed in International Patent Application WO 92/20702, mo ⁇ hohno analogs which are descnbed in U.S Patents Numbered 5,185,444; 5,034,506 and 5,142,047.
  • the probe may have to be rendered "non-extendable" in that additional dNTPs cannot be added to the probe
  • nucleic acid probes can be rendered non-extendable by modifying the 3' end of the probe such that the hydroxyl group is no longer capable of participating in elongation.
  • the 3' end of the probe can be functionahzed with the capture or detection label to thereby consume or otherwise block the hydroxyl group.
  • the 3' hydroxyl group simply can be cleaved, replaced or modified, U S Patent Application Se ⁇ al No 07/049,061 filed Apnl 19, 1993 descnbes modifications, which can be used to render a probe non-extendable
  • the probes of the present invention are useful for a number of pu ⁇ oses. They can be used in Southern hybndization to genomic DNA or Northern hybndization to mRNA. The probes can also be used to detect PCR amplification products. By assaying the hybndization to an allele specific probe, one can detect the presence or absence of a biallelic marker allele in a given sample
  • Hybndization assays based on oligonucleotide anays rely on the differences in hybndization stability of short oligonucleotides to perfectly matched and mismatched target sequence vanants Efficient access to polymo ⁇ hism information is obtained through a basic structure compnsmg high-density anays of oligonucleotide probes attached to a solid support (the chip) at selected positions.
  • Each DNA chip can contain thousands to millions of individual synthetic DNA probes ananged m a gnd-like pattern and mmiatunzed to the size of a dime.
  • Chips of vanous formats for use m detecting biallelic polymo ⁇ hisms can be produced on a customized basis by Affymetnx (GeneChipTM), Hyseq (HyChip and HyGnostics), and Protogene Laborato ⁇ es In general, these methods employ anays of oligonucleotide probes that are complementary to target nucleic acid sequence segments from an individual which, target sequences include a polymo ⁇ hic marker.
  • EP785280 descnbes a tiling strategy for the detection of single nucleotide polymo ⁇ hisms.
  • anays may generally be "tiled” for a large number of specific polymo ⁇ hisms.
  • tilting is generally meant the synthesis of a defined set of oligonucleotide probes which is made up of a sequence complementary to the target sequence of interest, as well as preselected va ⁇ ations of that sequence, e.g., substitution of one or more given positions with one or more members of the basis set of monomers, i.e. nucleotides Tiling strategies are further descnbed in PCT application No WO 95/11995.
  • anays are tiled for a number of specific, identified biallelic marker sequences.
  • the anay is tiled to include a number of detection blocks, each detection block being specific for a specific biallelic marker or a set of biallelic markers.
  • a detection block may be tiled to include a number of probes, which span the sequence segment that includes a specific polymo ⁇ hism. To ensure probes that are complementary to each allele, the probes are synthesized in pairs diffe ⁇ ng at the biallelic marker. In addition to the probes diffe ⁇ ng at the polymo ⁇ hic base, monosubstituted probes are also generally tiled within the detection block.
  • These monosubstituted probes have bases at and up to a certain number of bases m either direction from the polymo ⁇ hism, substituted with the remaining nucleotides (selected from A, T, G, C and U)
  • the probes in a tiled detection block will include substitutions of the sequence positions up to and including those that are 5 bases away from the biallelic marker.
  • the monosubstituted probes provide internal controls for the tiled array, to distinguish actual hybndization from artefactual cross- hyb ⁇ dization. Upon completion of hybndization with the target sequence and washing of the anay, the anay is scanned to determine the position on the anay to which the target sequence hybndizes.
  • Hybndization and scanning may be earned out as descnbed in PCT application No. WO 92/10092 and WO 95/11995 and
  • the chips may compnse an anay of nucleic acid sequences of fragments of about 15 nucleotides in length.
  • the chip may compnse an anay including at least one of the sequences selected from the group consisting of SEQ DD No. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and the sequences complementary thereto, or a fragment thereof at least about 8 consecutive nucleotides, preferably 10, 15, 20, more preferably least 30, 35, 43, 44, 45, 46 or 47 consecutive nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD.
  • the chip may compnse an anay of at least 2, 3, 4, 5, 6, 7, 8 or more of these polynucleotides of the invention.
  • Solid supports and polynucleotides of the present invention attached to solid supports are further descnbed in I. 5) Integrated Systems
  • Another technique which may be used to analyze polymo ⁇ hisms, includes multicomponent integrated systems, which m iatunze and compartmentalize processes such as PCR and capillary electrophoresis reactions in a single functional device.
  • An example of such technique is disclosed US patent 5,589,136, which descnbes the integration of PCR amplification and capillary electrophoresis in chips.
  • microfluidic systems can be envisaged mainly when microfluidic systems are used These systems compnse a pattern of microchannels designed onto a glass, silicon, quartz, or plastic wafer included on a microchip. The movements of the samples are controlled by electric, electroosmotic or hydrostatic forces applied across different areas of the microchip
  • the microfluidic system may integrate nucleic acid amplification, microsequencing, capillary electrophoresis and a detection method such as laser-induced fluorescence detection. IV. Methods Of Genetic Analysis Using The Biallelic Markers Of The Present Invention
  • the biallelic markers may be used in paramet ⁇ c and non-parametnc linkage analysis methods.
  • the biallelic markers of the present invention are used to identify genes associated with detectable traits using association studies, an approach which does not require the use of affected families and which permits the identification of genes associated with complex and sporadic traits.
  • the genetic analysis using the biallelic markers of the present invention may be conducted on any scale.
  • the whole set of biallelic markers of the present invention or any subset of biallelic markers of the present invention may be used.
  • a subset of biallelic markers conesponding to one or several candidate genes may be used.
  • a subset of biallelic markers conespondmg to candidate genes from a particular disease pathway may be used.
  • biallelic markers of the present invention may be used.
  • any set of genetic markers including a biallelic marker of the present invention may be used.
  • a set of biallelic polymo ⁇ hisms that, could be used as genetic markers in combination with the biallelic markers of the present invention, has been descnbed m WO 98/20165.
  • the biallelic markers of the present invention may be included in any complete or partial genetic map of the human genome.
  • Linkage analysis is based upon establishing a conelation between the transmission of genetic markers and that of a specific trait throughout generations within a family
  • the aim of linkage analysis is to detect marker loci that show cosegregation with a trait of interest in pedigrees
  • non-parametric methods for linkage analysis are that they do not require specification of the mode of inheritance for the disease, they tend to be more useful for the analysis of complex traits.
  • non-parametric methods one tries to prove that the inheritance pattern of a chromosomal region is not consistent with random Mendehan segregation by showing that affected relatives inherit identical copies of the region more often than expected by chance. Affected relatives should show excess "allele sharing" even in the presence of incomplete penetrance and polygenic inheritance.
  • degree of agreement at a marker locus in two individuals can be measured either by the number of alleles identical by state (D3S) or by the number of alleles identical by descent (IBD).
  • D3S number of alleles identical by state
  • IBD number of alleles identical by descent
  • the biallelic markers of the present invention may be used in both parametric and non-parametric linkage analysis.
  • biallelic markers may be used in non-parametric methods which allow the mapping of genes involved in complex traits.
  • the biallelic markers of the present invention may be used in both ED- and IBS- methods to map genes affecting a complex trait. In such studies, taking advantage of the high density of biallelic markers, several adjacent biallelic marker loci may be pooled to achieve the efficiency attained by multi-allelic markers (Zhao et al., Am. J. Hum. Genet., 63:225-240, 1998).
  • the present invention comprises methods for identifying one or several genes among a set of candidate genes that are associated with a detectable trait using the biallelic markers of the present invention.
  • the present invention comprises methods to detect an association between a biallelic marker allele or a biallelic marker haplotype and a trait.
  • the invention comprises methods to identify a trait causing allele in linkage disequilibrium with any biallelic marker allele of the present invention.
  • the biallelic markers of the present invention are used to perform candidate gene association studies.
  • the biallelic markers of the present invention may be inco ⁇ orated in any map of genetic markers of the human genome in order to perform genome-wide association studies Methods to generate a high-density map of biallelic markers has been described in US Provisional Patent application senal number 60/082,614.
  • the biallelic markers of the present invention may further be inco ⁇ orated in any map of a specific candidate region of the genome (a specific chromosome or a specific chromosomal segment for example)
  • association studies may be conducted within the general population and are not limited to studies performed on related individuals in affected families Association studies are extremely valuable as they permit the analysis of sporadic or multifactor traits. Moreover, association studies represent a powerful method for fine-scale mapping enabling much finer mapping of trait causing alleles than linkage studies. Studies based on pedigrees often only nanow the location of the trait causing allele.
  • Association studies using the biallelic markers of the present invention can therefore be used to refine the location of a trait causing allele m a candidate region identified by Linkage Analysis methods Moreover, once a chromosome segment of interest has been identified, the presence of a candidate gene such as a candidate gene of the present invention, m the region of interest can provide a shortcut to the identification of the trait causing allele.
  • Biallelic markers of the present invention can be used to demonstrate that a candidate gene is associated with a trait Such uses are specifically contemplated in the present invention and claims. 1) Determining the frequency of a biallelic marker allele or of a biallelic marker haplotype in a population
  • Allelic frequencies of the biallelic markers in a population can be determined using one of the methods described above under the heading "Methods for genotyping an individual for biallelic markers", or any genotyping procedure suitable for this intended pu ⁇ ose. Genotyping pooled samples or individual samples can determine the frequency of a biallelic marker allele in a population.
  • One way to reduce the number of genotypings required is to use pooled samples
  • a major obstacle in using pooled samples is m terms of accuracy and reproducibility for determining accurate DNA concentrations in setting up the pools.
  • Genotyping individual samples provides higher sensitivity, reproducibility and accuracy and, is the prefened method used in the present invention.
  • each individual is genotyped separately and simple gene counting is applied to determine the frequency of an allele of a biallelic marker or of a genotype in a given population. Determining the frequency of a haplotype in a population
  • the gametic phase of haplotypes is unknown when diploid individuals are heterozygous at more than one locus Using genealogical information in families gametic phase can sometimes be mfened (Perhn et al., Am J Hum Genet , 55 777-787, 1994) When no genealogical information is available different strategies may be used.
  • One possibility is that the multiple-site heterozygous diploids can be eliminated from the analysis, keeping only the homozygotes and the single-site heterozygote individuals, but this approach might lead to a possible bias m the sample composition and the underestimation of low-frequency haplotypes
  • Another possibility is that single chromosomes can be studied independently, for example, by asymmetnc PCR amplification (see Newton et al., Nucleic Acids Res , 17:2503- 2516, 1989; Wu et al., Proc Natl Acad Sci USA, 86:2757, 1989) or by isolation of single chromosome by limit dilution followed
  • the pnnciple is to start filling a preliminary list of haplotypes present m the sample by examining unambiguous individuals, that is, the complete homozygotes and the single-site heterozygotes. Then other individuals in the same sample are screened for the possible occunence of previously recognised haplotypes. For each positive identification, the complementary haplotype is added to the list of recognised haplotypes, until the phase information for all individuals is either resolved or identified as unresolved. This method assigns a single haplotype to each multiheterozygous individual, whereas several haplotypes are possible when there are more than one heterozygous site.
  • EM algonthm (Dempster et al., J R. Stat Soc , 39B: 1-38, 1977) leading to maximum-likelihood estimates of haplotype frequencies under the assumption of Hardy- Weinberg proportions (random mating) is used (see Excoffier L and Slatkm M., Mol. Biol. Evol , 12(5): 921-927, 1995).
  • the EM algonthm is a generalised iterative maximum-likelihood approach to estimation that is useful when data are ambiguous and/or incomplete.
  • the EM algonthm is used to resolve heterozygotes into haplotypes. Haplotype estimations are further described below under the heading "Statistical methods". Any other method known m the art to determine or to estimate the frequency of a haplotype in a population may also be used
  • Linkage Disequilibrium analysis is the non-random association of alleles at two or more loci and represents a powerful tool for mapping genes involved in disease traits (see Ajioka R S. et a ⁇ ., Am J Hum Genet., 60.1439-1447, 1997).
  • Biallelic markers because they are densely spaced in the human genome and can be genotyped m more numerous numbers than other types of genetic markers (such as RFLP or VNTR markers), are particularly useful m genetic analysis based on linkage disequilibrium.
  • the biallelic markers of the present invention may be used in any linkage disequilibnum analysis method known in the art.
  • the pattern or curve of disequilibnum between disease and marker loci is expected to exhibit a maximum that occurs at the disease locus. Consequently, the amount of linkage disequilibnum between a disease allele and closely linked genetic markers may yield valuable information regarding the location of the disease gene.
  • For fine-scale mapping of a disease locus it is useful to have some knowledge of the patterns of linkage disequilibnum that exist between markers in the studied region. As mentioned above the mapping resolution achieved through the analysis of linkage disequilibrium is much higher than that of linkage studies. The high density of biallelic markers combined with linkage disequilibnum analysis provides powerful tools for fine-scale mapping. Different methods to calculate linkage disequilibnum are described below under the heading "Statistical Methods".
  • a specific allele in a given gene is directly involved in causing a particular trait, its frequency will be statistically increased in an affected (trait positive) population, when compared to the frequency in a trait negative population or in a random control population
  • the frequency of all other alleles present m the haplotype carrying the trait-causing allele will also be increased in trait positive individuals compared to trait negative individuals or random controls Therefore, association between the trait and any allele (specifically a biallelic marker allele) in linkage disequilibrium with the trait-causing allele will suffice to suggest the presence of a trait-related gene m that particular region.
  • Case-control populations can be genotyped for biallelic markers to identify associations that narrowly locate a trait causing allele As any marker m linkage disequilibrium with one given marker associated with a trait will be associated with the trait Linkage disequilibnum allows the relative frequencies in case-control populations of a limited number of genetic polymo ⁇ hisms (specifically biallelic markers) to be analysed as an alternative to screening all possible functional polymo ⁇ hisms in order to find trait-causmg alleles Association studies compare the frequency of marker alleles m unrelated case-control populations, and represent powerful tools for the dissection of complex traits.
  • Population-based association studies do not concern familial mhentance but compare the prevalence of a particular genetic marker, or a set of markers, in case-control populations. They are case-control studies based on companson of unrelated case (affected or trait positive) individuals and unrelated control (unaffected or trait negative or random) individuals.
  • the control group is composed of unaffected or trait negative individuals.
  • the control group is ethnically matched to the case population
  • the control group is preferably matched to the case-population for the mam known confusion factor for the trait under study (for example age-matched for an age-dependent trait)
  • individuals m the two samples are paired m such a way that they are expected to differ only in their disease status.
  • “trait positive population” "case population” and "affected population” are used interchangeably.
  • case-control populations An important step in the dissection of complex traits using association studies is the choice of case-control populations (see Lander and Schork, Science, 265, 2037-2048, 1994)
  • a major step in the choice of case-control populations is the clinical definition of a given trait or phenotype. Any genetic trait may be analysed by the association method proposed here by carefully selecting the individuals to be included in the trait positive and trait negative phenotypic groups. Four c ⁇ tena are often useful: clinical phenotype, age at onset, family history and seventy.
  • case-control populations consist of phenotypically homogeneous populations
  • trait positive and trait negative populations consist of phenotypically uniform populations of individuals representing each between 1 and 98%, preferably between 1 and 80%, more preferably between 1 and 50%, and more preferably between 1 and 30%, most preferably between 1 and
  • the general strategy to perform association studies using biallelic markers denved from a region carrying a candidate gene is to scan two groups of individuals (case-control populations) in order to measure and statistically compare the allele frequencies of the biallelic markers of the present invention in both groups. If a statistically significant association with a trait is identified for at least one or more of the analysed biallelic markers, one can assume that: either the associated allele is directly responsible for causing the trait (the associated allele is the trait causing allele), or more likely the associated allele is m linkage disequilibnum with the trait causing allele.
  • the specific charactenstics of the associated allele with respect to the candidate gene function usually gives further insight into the relationship between the associated allele and the trait (causal or in linkage disequilibnum). If the evidence indicates that the associated allele within the candidate gene is most probably not the trait causing allele but is in linkage disequilibnum with the real trait causing allele, then the trait causing allele can be found by sequencing the vicinity of the associated marker. Association studies are usually run in two successive steps.
  • a first phase the frequencies of a reduced number of biallelic markers from one or several candidate genes are determined in the trait positive and trait negative populations
  • a second phase of the analysis the identity of the candidate gene and the position of the genetic loci responsible for the given trait is further refined using a higher density of markers from the relevant region.
  • the candidate gene under study is relatively small in length, as it is the case for many of the candidate genes analysed included in the present invention, a single phase may be sufficient to establish significant associations Haplotype analysis
  • the mutant allele necessanly resides on a chromosome having a set of linked markers: the ancestral haplotype
  • This haplotype can be tracked through populations and its statistical association with a given trait can be analysed
  • Complementing single point (allelic) association studies with multi-point association studies also called haplotype studies increases the statistical power of association studies.
  • haplotype association study allows one to define the frequency and the type of the ancestral earner haplotype.
  • a haplotype analysis is important in that it increases the statistical power of an analysis involving individual markers.
  • a haplotype frequency analysis the frequency of the possible haplotypes based on vanous combinations of the identified biallelic markers of the invention is determined.
  • the haplotype frequency is then compared for distinct populations of trait positive and control individuals.
  • the number of trait positive individuals, which should be, subjected to this analysis to obtain statistically significant results usually ranges between 30 and 300, with a prefened number of individuals ranging between 50 and 150. The same considerations apply to the number of unaffected individuals (or random control) used in the study.
  • the results of this first analysis provide haplotype frequencies in case-control populations, for each evaluated haplotype frequency a p-value and an odd ratio are calculated
  • the biallelic markers of the present invention may also be used to identify patterns of biallelic markers associated with detectable traits resulting from polygemc interactions.
  • the analysis of genetic interaction between alleles at unlinked loci requires individual genotyping using the techniques descnbed herein.
  • allelic interaction among a selected set of biallelic markers with appropnate level of statistical significance can be considered as a haplotype analysis.
  • Interaction analysis consists in stratifying the case-control populations with respect to a given haplotype for the first loci and performing a haplotype analysis with the second loci with each subpopulation.
  • the biallelic markers of the present invention may further be used in TDT (transmission/disequihbnum test).
  • TDT tests for both linkage and association and is not affected by population stratification TDT requires data for affected individuals and their parents or data from unaffected sibs instead of from parents (see Spielmann S et al., Am J Hum Genet., 52.506-516, 1993, Schaid D J. et al , Genet Ep ⁇ dem ⁇ ol.,13 423-450, 1996, Spielmann S. and Ewens W J , Am J Hum Genet , 62 450-458, 1998)
  • Such combined tests generally reduce the false - positive enors produced by separate analyses IV.C.
  • Statistical methods generally reduce the false - positive enors produced by separate analyses IV.C. Statistical methods
  • any method known m the art to test whether a trait and a genotype show a statistically significant conelation may be used
  • haplotype frequencies can be estimated from the multilocus genotypic data. Any method known to person skilled in the art can be used to estimate haplotype frequencies (see Lange K., Mathematical and Statistical Methods for Genetic Analysis, Springer, New York, 1997, Weir, B.S., Genetic data Analysis II.
  • maximum-likelihood haplotype frequencies are computed using an Expectation- Maximization (EM) algonthm (see Dempster et al., J. R. Stat Soc , 39B 1-38, 1977; Excoffier L. and Slatkm M., Mol. Biol.
  • EM Expectation- Maximization
  • This procedure is an iterative process aiming at obtaining maximum- hkehhood estimates of haplotype frequencies from multi-locus genotype data when the gametic phase is unknown
  • Haplotype estimations are usually performed by applying the EM algonthm using for example the EM-HAPLO program (Hawley M.E. et al., Am. J Phys Anthropol , 18:104, 1994) or the Arlequin program (Schneider et al., Arlequin. a software for population genetics data analysis, University of Geneva, 1997)
  • the EM algonthm is a generalised iterative maximum likelihood approach to estimation and is bnefly descnbed below.
  • phenotypes will refer to multi-locus genotypes with unknown phase. Genotypes will refer to known-phase multi-locus genotypes
  • a stop c ⁇ te ⁇ on can be that the maximum difference between haplotype frequencies between two iterations is less than 10 "7 . These values can be adjusted according to the desired precision of estimations.
  • Equation 3 where genotype i occurs in phenotypey, and where h k and h t constitute genotype i. Each probability is de ⁇ ved according to eq.l, and eq.2 descnbed above.
  • Pt -pr (genotype iY s ' Equation 4 Where ⁇ ; ( is an indicator vanable which count the number of time haplotype t in genotype J. It takes the values of 0, 1 or 2.
  • linkage disequilibrium between any two genetic positions in practice linkage disequilibrium is measured by applying a statistical association test to haplotype data taken from a population.
  • Linkage disequilibrium between any pair of biallelic markers comprising at least one of the biallelic markers of the present invention (Mj, M j ) having alleles (a/bj) at marker Mj and alleles (a b j ) at marker M j can be calculated for every allele combination (aj,a j .
  • Linkage disequilibrium (LD) between pairs of biallelic markers (Mj, M j ) can also be calculated for every allele combination (ai,aj ; ai,bj ; bj,a j andb;,b j ), according to the maximum- likelihood estimate (MLE) for delta (the composite genotypic disequilibrium coefficient), as described by Weir (Weir B.S., Genetic Data Analysis, Sinauer Ass. Eds, 1996).
  • MLE maximum- likelihood estimate
  • n ! ⁇ phenotype ⁇ phenotype (a bj, a/aj)
  • n4 ⁇ phenotype (aj bj, a j b j ) and N is the number of individuals in the sample.
  • This formula allows linkage disequilibrium between alleles to be estimated when only genotype, and not haplotype, data are available.
  • Another means of calculating the linkage disequilibrium between markers is as follows. For a couple of biallelic markers, M t (a bj) and M j (a bJ), fitting the Hardy-Weinberg equilibrium, one can estimate the four possible haplotype frequencies in a given population according to the approach described above.
  • D iaj pr(haplotype(a f , ⁇ )) - pr(a t ).pr(aj ).
  • pr(ai) is the probability of allele
  • pr(a is the probability of allele ⁇
  • pr(haplotype (arada a ⁇ ) is estimated as in Equation 3 above.
  • Linkage disequilibrium among a set of biallelic markers having an adequate heterozygosity rate can be determined by genotyping between 50 and 1000 unrelated individuals, preferably between 75 and 200, more preferably around 100.
  • Methods for determining the statistical significance of a conelation between a phenotype and a genotype may be determined by any statistical test known in the art and with any accepted threshold of statistical significance being required. The application of particular methods and thresholds of significance are well with in the skill of the ordinary practitioner of the art. Testing for association is performed by determining the frequency of a biallelic marker allele in case and control populations and comparing these frequencies with a statistical test to determine if their is a statistically significant difference in frequency which would indicate a conelation between the trait and the biallelic marker allele under study.
  • a haplotype analysis is performed by estimating the frequencies of all possible haplotypes for a given set of biallelic markers in case and control populations, and comparing these frequencies with a statistical test to determine if their is a statistically significant conelation between the haplotype and the phenotype (trait) under study.
  • Any statistical tool useful to test for a statistically significant association between a genotype and a phenotype may be used.
  • the statistical test employed is a chi-square test with one degree of freedom. A p-value is calculated (the p-value is the probability that a statistic as large or larger than the observed one would occur by chance).
  • the p value related to a biallelic marker association is preferably about 1 x 10-2 or less, more preferably about 1 x 10-4 or less, for a single biallelic marker analysis and about 1 x 10-3 or less, still more preferably 1 x 10-6 or less and most preferably of about 1 x 10-8 or less, for a haplotype analysis involving several markers.
  • Phenotypic permutation In order to confirm the statistical significance of the first stage haplotype analysis descnbed above, it might be suitable to perform further analyses in which genotyping data from case-control individuals are pooled and randomised with respect to the trait phenotype. Each individual genotyping data is randomly allocated to two groups, which contain the same number of individuals as the case-control populations used to compile the data obtained in the first stage. A second stage haplotype analysis is preferably run on these artificial groups, preferably for the markers included in the haplotype of the first stage analysis showing the highest relative nsk coefficient. This expenment is reiterated preferably at least between 100 and 10000 times The repeated iterations allow the determination of the percentage of obtained haplotypes with a significant p-value level.
  • nsk factor m genetic epidemiology the nsk factor is the presence or the absence of a certain allele or haplotype at marker loci
  • RR relative nsk
  • F + is the frequency of the exposure to the ⁇ sk factor in cases and F " is the frequency of the exposure to the ⁇ sk factor in controls.
  • F + and F are calculated using the allelic or haplotype frequencies of the study and further depend on the underlying genetic model (dominant, recessive, additive).
  • AR attributable nsk
  • This measure is important m quantitating the role of a specific factor in disease etiology and m terms of the public health impact of a nsk factor
  • AR is the nsk attnbutable to a biallelic marker allele or a biallelic marker haplotype.
  • P E is the frequency of exposure to an allele or a haplotype within the population at large; and RR is the relative nsk which, is approximated with the odds ratio when the trait under study has a relatively low incidence in the general population.
  • IV.F Identification Of Biallelic Markers In Linkage Disequilibrium With The Biallelic
  • any marker in linkage disequilibnum with a first marker associated with a trait will be associated with the trait. Therefore, once an association has been demonstrated between a given biallelic marker and a trait, the discovery of additional biallelic markers associated with this trait is of great interest m order to increase the density of biallelic markers m this particular region. The causal gene or mutation will be found m the vicinity of the marker or set of markers showing the highest conelation with the trait.
  • Identification of additional markers m linkage disequilibnum with a given marker involves: (a) amplifying a genomic fragment compnsmg a first biallelic marker from a plurality of individuals; (b) identifying of second biallelic markers m the genomic region harbo ⁇ ng said first biallelic marker; (c) conducting a linkage disequilibnum analysis between said first biallelic marker and second biallelic markers; and (d) selecting said second biallelic markers as being in linkage disequilibrium with said first marker Subcombmations compnsmg steps (b) and (c) are also contemplated.
  • biallelic markers are descnbed herein and can be earned out by the skilled person without undue expenmentation.
  • the present invention then also concerns biallelic markers which are in linkage disequilibnum with any of the specific biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and which are expected to present similar charactenstics in terms of their respective association with a given trait
  • Example 5 illustrates the measurement of linkage disequilibnum between a publicly known biallelic marker, the "ApoE Site A", located within the Alzheimer's related ApoE gene, and other biallelic markers randomly denved from the genomic region containing the ApoE gene
  • the associated candidate gene can be scanned for mutations by companng the sequences of a selected number of trait positive and trait negative individuals.
  • functional regions such as exons and splice sites, promoters and other regulatory regions of the candidate gene are scanned for mutations
  • trait positive individuals carry the haplotype shown to be associated with the trait and trait negative individuals do not carry the haplotype or allele associated with the trait
  • the mutation detection procedure is essentially similar to that used for biallelic site identification
  • the method used to detect such mutations generally compnses the following steps: (a) amplification of a region of the candidate gene compnsmg a biallelic marker or a group of biallelic markers associated with the trait from DNA samples of trait positive patients and trait negative controls, (b) sequencing of the amplified region; (c) companson of DNA sequences from trait-positive patients and trait-negative controls, and (d) determination of mutations specific to trait-positive patients. Subcombinations which compnse steps (b) and (c) are specifically contemplated.
  • candidate polymo ⁇ hisms be then venfied by screening a larger population of cases and controls by means of any genotyping procedure such as those descnbed herein, preferably using a microsequencing technique in an individual test format
  • Polymo ⁇ hisms are considered as candidate mutations when present in cases and controls at frequencies compatible with the expected association results V.
  • Biallelic Markers Of The Invention In Methods Of Genetic Diagnostics are considered as candidate mutations when present in cases and controls at frequencies compatible with the expected association results V.
  • the biallelic markers of the present invention can also be used to develop diagnostics tests capable of identifying individuals who express a detectable trait as the result of a specific genotype or individuals whose genotype places them at nsk of developing a detectable trait at a subsequent time
  • the trait analyzed using the present diagnostics may be any detectable trait, including a disease, a response to an agent acting on a disease, or side effects to an agent acting on a disease
  • the diagnostic techniques of the present invention may employ a vanety of methodologies to determine whether a test subject has a biallelic marker pattern associated with an increased ⁇ sk of developing a detectable trait or whether the individual suffers from a detectable trait as a result of a particular mutation, including methods which enable the analysis of individual chromosomes for haplotypmg, such as family studies, single sperm DNA analysis or somatic hybnds
  • the present invention provides diagnostic methods to determine whether an individual is at risk of developing a disease or suffers from a disease resulting from a mutation
  • These methods involve obtaining a nucleic acid sample from the individual and, determining, whether the nucleic acid sample contains at least one allele or at least one biallelic marker haplotype, indicative of a risk of developing the trait or indicative that the individual expresses the trait as a result of possessing a particular candidate gene polymo ⁇ hism or mutation (trait-causing allele).
  • a nucleic acid sample is obtained from the individual and this sample is genotyped using methods descnbed above in IE
  • the diagnostics may be based on a single biallelic marker or a on group of biallelic markers
  • a nucleic acid sample is obtained from the test subject and the biallelic marker pattern of one or more of the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 is determined.
  • a PCR amplification is conducted on the nucleic acid sample to amplify regions m which polymo ⁇ hisms associated with a detectable phenotype have been identified.
  • the amplification products are sequenced to determine whether the individual possesses one or more polymo ⁇ hisms associated with a detectable phenotype
  • the pnmers used to generate amplification products may compnse the pnmers of SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773.
  • the nucleic acid sample is subjected to microsequencing reactions as descnbed above to determine whether the individual possesses one or more polymo ⁇ hisms associated with a detectable phenotype resulting from a mutation or a polymo ⁇ hism in a candidate gene.
  • the nucleic acid sample is contacted with one or more allele specific oligonucleotide probes which, specifically hybridize to one or more candidate gene alleles associated with a detectable phenotype.
  • Diagnostics which analyze and predict response to a drug or side effects to a drug, may be used to determine whether an individual should be treated with a particular drug. For example, if the diagnostic indicates a likelihood that an individual will respond positively to treatment with a particular drug, the drug may be administered to the individual. Conversely, if the diagnostic indicates that an individual is likely to respond negatively to treatment with a particular drug, an alternative course of treatment may be prescribed. A negative response may be defined as either the absence of an efficacious response or the presence of toxic side effects.
  • Clinical drug tnals represent another application for the markers of the present invention.
  • One or more markers indicative of response to an agent acting on a disease or to side effects to an agent acting on a disease may be identified using the methods descnbed above. Thereafter, potential participants in clinical tnals of such an agent may be screened to identify those individuals most likely to respond favorably to the drug and exclude those likely to expenence side effects. In that way, the effectiveness of drug treatment may be measured in individuals who respond positively to the drug, without lowenng the measurement as a result of the inclusion of individuals who are unlikely to respond positively m the study and without nsking undesirable safety problems.
  • a computer to based system may support the on-line coordination between the identification of biallelic markers and the conespondmg analysis of their frequency in the different groups.
  • 11773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773 encompasses the nucleotide sequences of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773, fragments of SEQ ID NOs.
  • nucleotide sequences compnsmg consisting essentially of, or consisting of any one of the following. a) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44,
  • nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 1 1599, and 11600 to 1 1773" further encompass nucleotide sequences homologous to: a) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos.
  • Homologous sequences refer to a sequence having at least 99%, 98%, 97%, 96%, 95%,
  • nucleic acid codes of the invention can be represented in the traditional single character format (See the inside back cover of Stryer, Lubert. Biochemistry, 3 rd edition. W. H Freeman & Co., New York.) or in any other format or code which records the identity of the nucleotides in a sequence.
  • nucleic acid codes of the invention further encompass all of the polynucleotides disclosed, descnbed or claimed m the present application.
  • the present invention specifically contemplates computer readable media and computer systems wherein such codes are stored individually or in any combination.
  • nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 can be stored, recorded, and manipulated on any medium which can be read and accessed by a computer.
  • the words "recorded” and “stored” refer to a process for storing information on a computer medium.
  • a skilled artisan can readily adopt any of the presently known methods for recording information on a computer readable medium to generate embodiments compnsmg one or more of the nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842,
  • a particularly prefened embodiment of the present invention is a computer readable medium having recorded thereon at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773.
  • Computer readable media include magnetically readable media, optically readable media, electronically readable media and magnetic/optical media.
  • the computer readable media may be a hard disk, a floppy disk, a magnetic tape, CD-ROM, Digital Versatile Disk (DVD), Random Access Memory (RAM), or Read Only Memory (ROM) as well as other types of other media known to those skilled in the art.
  • Embodiments of the present invention include systems, particularly computer systems which store and manipulate the sequence information descnbed herein.
  • a computer system 100 is illustrated in block diagram form in Figure 14
  • a computer system refers to the hardware components, software components, and data storage components used to analyze the nucleotide sequences of the nucleic acid codes of SEQ DD NOs.
  • the computer system 100 is a Sun Ente ⁇ nse 1000 server (Sun Microsystems, Palo Alto, CA).
  • the computer system 100 preferably includes a processor for processing, accessing and manipulating the sequence data
  • the processor 105 can be any well-known type of central processing unit, such as the Pentium DI from Intel Co ⁇ oration, or similar processor from Sun, Motorola, Compaq or International Business Machines
  • the computer system 100 is a general pu ⁇ ose system that compnses the processor 105 and one or more internal data storage components 110 for stonng data, and one or more data retrieving devices for retnevmg the data stored on the data storage components.
  • the processor 105 and one or more internal data storage components 110 for stonng data, and one or more data retrieving devices for retnevmg the data stored on the data storage components.
  • a skilled artisan can readily appreciate that any one of the cunently available computer systems are suitable
  • the computer system 100 includes a processor 105 connected to a bus which is connected to a main memory 115 (preferably implemented as RAM) and one or more internal data storage devices 110, such as a hard dnve and/or other computer readable media having data recorded thereon.
  • the computer system 100 further includes one or more data retnevmg device 118 for reading the data stored on the internal data storage devices 110
  • the data retnev g device 118 may represent, for example, a floppy disk dnve, a compact disk dnve, a magnetic tape dnve, etc.
  • the internal data storage device 110 is a removable computer readable medium such as a floppy disk, a compact disk, a magnetic tape, etc. containing control logic and/or data recorded thereon.
  • the computer system 100 may advantageously include or be programmed by appropnate software for reading the control logic and/or the data from the data storage component once inserted in the data retnevmg device.
  • the computer system 100 includes a display 120 which is used to display output to a computer user. It should also be noted that the computer system 100 can be linked to other computer systems 125a-c in a network or wide area network to provide centralized access to the computer system 100 Software for accessing and processing the nucleotide sequences of the nucleic acid codes of SEQ
  • DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 (such as search tools, compare tools, and modeling tools etc.) may reside in main memory 115 dunng execution.
  • the computer system 100 may further compnse a sequence comparer for companng the above-descnbed nucleic acid codes of SEQ DD Nos.
  • sequence comparer refers to one or more programs which are implemented on the computer system 100 to compare a nucleotide sequence with other nucleotide sequences and/or compounds stored within the data storage means.
  • the sequence comparer may compare the nucleotide sequences of the nucleic acid codes of SEQ DD Nos 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 stored on a computer readable medium to reference sequences stored on a computer readable medium to identify homologies or structural motifs
  • the vanous sequence comparer programs identified elsewhere m this patent specification are particularly contemplated for use m this aspect of the invention
  • Figure 15 is a flow diagram illustrating one embodiment of a process 200 for companng a new nucleotide or protein sequence with a database of sequences m order to determine the homology levels between the new sequence and the sequences m the database
  • the database of sequences can be a pnvate database stored withm the computer system 100, or a public database such as GENBANK that is available through the Internet
  • the process 200 begins at a start state 201 and then moves to a state 202 wherein the new sequence to be compared is stored to a memory m a computer system 100.
  • the memory could be any type of memory, including RAM or an internal storage device.
  • the process 200 then moves to a state 204 wherein a database of sequences is opened for analysis and companson.
  • the process 200 then moves to a state 206 wherein the first sequence stored in the database is read into a memory on the computer
  • a companson is then performed at a state 210 to determine if the first sequence is the same as the second sequence. It is important to note that this step is not limited to performing an exact companson between the new sequence and the first sequence in the database.
  • Well-known methods are known to those of skill in the art for companng two nucleotide or protein sequences, even if they are not identical. For example, gaps can be introduced into one sequence in order to raise the homology level between the two tested sequences.
  • the process 200 moves to a state 214 wherein the name of the sequence from the database is displayed to the user. This state notifies the user that the sequence with the displayed name fulfills the homology constraints that were entered.
  • the process 200 moves to a decision state 218 wherein a determination is made whether more sequences exist in the database. If no more sequences exist in the database, then the process 200 terminates at an end state 220. However, if more sequences do exist in the database, then the process 200 moves to a state 224 wherein a pointer is moved to the next sequence in the database so that it can be compared to the new sequence. In this manner, the new sequence is aligned and compared with every sequence in the database.
  • one aspect of the present invention is a computer system comprising a processor, a data storage device having stored thereon a nucleic acid code of SEQ DD Nos.
  • a data storage device having retrievably stored thereon reference nucleotide sequences or polypeptide sequences to be compared to the nucleic acid code of SEQ DD Nos.
  • sequence comparer may indicate a homology level between the sequences compared or identify structural motifs in the above described nucleic acid code of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to
  • the data storage device may have stored thereon the sequences of at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to
  • Another aspect of the present invention is a method for determining the level of homology between a nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 and a reference nucleotide sequence, comprising the steps of reading the nucleic acid code and the reference nucleotide sequence through the use of a computer program which determines homology levels and determining homology between the nucleic acid code and the reference nucleotide sequence with the computer program.
  • the computer program may be any of a number of computer programs for determining homology levels, including those specifically enumerated herein, including BLAST2N with the default parameters or with any modified parameters.
  • the method may be implemented using the computer systems descnbed above The method may also be performed by reading at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the above descnbed nucleic acid codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125,
  • Figure 16 is a flow diagram illustrating one embodiment of a process 250 m a computer for determining whether two sequences are homologous.
  • the process 250 begins at a start state 252 and then moves to a state 254 wherein a first sequence to be compared is stored to a memory
  • the second sequence to be compared is then stored to a memory at a state 256
  • the process 250 then moves to a state 260 wherein the first character in the first sequence is read and then to a state 262 wherein the first character of the second sequence is read
  • the sequence is a nucleotide sequence, then the character would normally be either A, T, C, G or U.
  • the sequence is a protein sequence, then it should be in the single letter amino acid code so that the first and sequence sequences can be easily compared.
  • the level of homology is determined by calculating the proportion of characters between the sequences that were the same out of the total number of sequences m the first sequence. Thus, if every character in a first 100 nucleotide sequence aligned with a every character m a second sequence, the homology level would be 100%
  • the computer program may be a computer program which compares the nucleotide sequences of the nucleic acid codes of the present invention, to reference nucleotide sequences in order to determine whether the nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 differs from a reference nucleic acid sequence at one or more positions.
  • such a program records the length and identity of inserted, deleted or substituted nucleotides with respect to the sequence of either the reference polynucleotide or the nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773.
  • the computer program may be a program which determines whether the nucleotide sequences of the nucleic acid codes of SEQ DD NOs.
  • 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 contain a biallelic marker or single nucleotide polymo ⁇ hism (SNP) with respect to a reference nucleotide sequence.
  • This single nucleotide polymo ⁇ hism may comprise a single base substitution, insertion, or deletion, while this biallelic marker may comprise about one to ten consecutive bases substituted, inserted or deleted.
  • another aspect of the present invention is a method for determining whether a nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 differs at one or more nucleotides from a reference nucleotide sequence comprising the steps of reading the nucleic acid code and the reference nucleotide sequence through use of a computer program which identifies differences between nucleic acid sequences and identifying differences between the nucleic acid code and the reference nucleotide sequence with the computer program.
  • the computer program is a program which identifies single nucleotide polymo ⁇ hisms.
  • the method may be implemented by the computer systems described above and the method illustrated in Figure 16. The method may also be performed by reading at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ DD NOs.
  • the computer based system may further comprise an identifier for identifying features within the nucleotide sequences of the nucleic acid codes of SEQ DD NOs.
  • An “identifier” refers to one or more programs which identifies certain features within the above-described nucleotide sequences of the nucleic acid codes of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773.
  • the identifier may compnse a program which identifies an open reading frame in the cDNAs codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 1 1600 to 11773
  • Figure 17 is a flow diagram illustrating one embodiment of an identifier process 300 for detecting the presence of a feature in a sequence
  • the process 300 begins at a start state 302 and then moves to a state 304 wherein a first sequence that is to be checked for features is stored to a memory 115 in the computer system 100
  • the process 300 then moves to a state 306 wherein a database of sequence features is opened
  • a database would include a list of each feature's attributes along with the name of the feature For example, a feature name could be "Initiation Cod
  • the process 300 moves to a state 308 wherein the first feature is read from the database A compa ⁇ son of the attribute of the first feature with the first sequence is then made at a state 310 A determination is then made at a decision state 316 whether the attnbute of the feature was found in the first sequence If the attribute was found, then the process 300 moves to a state 318 wherein the name of the found feature is displayed to the user
  • the process 300 then moves to a decision state 320 wherein a determination is made whether move features exist m the database. If no more features do exist, then the process 300 terminates at an end state 324. However, if more features do exist in the database, then the process 300 reads the next sequence feature at a state 326 and loops back to the state 310 wherein the attnbute of the next feature is compared against the first sequence.
  • the process 300 moves directly to the decision state 320 order to determine if any more features exist in the database
  • another aspect of the present invention is a method of identifying a feature within the nucleic acid codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374,
  • computer program compnses a computer program which identifies open reading frames
  • the method may be performed by reading a single sequence or at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 through the use of the computer program and identifying features withm the nucleic acid codes with the computer program.
  • the nucleic acid codes of SEQ ED NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 may be stored and manipulated in a vanety of data processor programs in a vanety of formats.
  • 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 may be stored as text m a word processing file, such as MicrosoftWORD or WORDPERFECT or as an ASCII file in a vanety of database programs familiar to those of skill m the art, such as DB2, SYBASE, or ORACLE.
  • word processing file such as MicrosoftWORD or WORDPERFECT
  • ASCII file in a vanety of database programs familiar to those of skill m the art, such as DB2, SYBASE, or ORACLE.
  • many computer programs and databases may be used as sequence comparers, identifiers, or sources of reference nucleotide sequences to be compared to the nucleic acid codes of SEQ DD NOs.
  • the programs and databases which may be used include, but are not limited to: MacPattem (EMBL), DiscoveryBase (Molecular Applications Group), GeneMine (Molecular Applications Group), Look (Molecular Applications Group), MacLook (Molecular Applications
  • Motifs which may be detected using the above programs include sequences encoding leucine zippers, helix-turn-hehx motifs, glycosylation sites, ubiquitmation sites, alpha helices, and beta sheets, signal sequences encoding signal peptides which direct the secretion of the encoded proteins, sequences implicated in transcnption regulation such as homeoboxes, acidic stretches, enzymatic active sites, substrate binding sites, and enzymatic cleavage sites.
  • nucleic acid codes of the invention further encompass all of the polynucleotides disclosed, descnbed or claimed in the present application Moreover, the present invention specifically contemplates the storage of such codes on computer readable media and computer systems individually or in any combination, as well as the use of such codes and combinations in the methods of VI. VII. Mapping and Maps Comprising the Biallelic Markers of the Invention
  • the human haploid genome contains an estimated 80,000 to 100,000 or more genes scattered on a 3 x 10 9 base-long double stranded DNA shared among the 24 chromosomes. Each human being is diploid, i e possesses two haploid genomes, one from paternal ongin, the other from maternal ongm
  • the sequence of the human genome varies among individuals in a population About 10 7 sites scattered along the 3 x 10 9 base pairs of DNA are polymo ⁇ hic, existing in at least two vanant forms called alleles. Most of these polymo ⁇ hic sites are generated by single base substitution mutations and are biallelic. Less than 10 5 polymo ⁇ hic sites are due to more complex changes and are very often multi-allehc, i.e.
  • any individual (diploid) can be either homozygous (twice the same allele) or heterozygous (two different alleles).
  • a given polymo ⁇ hism or rare mutation can be either neutral (no effect on trait), or functional, i.e responsible for a particular genetic trait.
  • the first step towards the identification of genes associated with a detectable trait consists m the localization of genomic regions containing trait-causmg genes using genetic mapping methods
  • the prefened traits contemplated withm the present invention relate to fields of therapeutic interest; m particular embodiments, they will be disease traits and/or drug response traits, reflecting drug efficacy or toxicity. Traits can either be "binary”, e.g. diabetic vs. non diabetic, or "quantitative", e.g elevated blood pressure Individuals affected by a quantitative trait can be classified according to an appropnate scale of trait values, e.g. blood pressure ranges. Each trait value range can then be analyzed as a binary trait.
  • Genetic mapping involves the analysis of the segregation of polymo ⁇ hic loci m trait positive and trait-negative populations Polymo ⁇ hic loci constitute a small fraction of the human genome (less than 1%), compared to the vast majonty of human genomic DNA which is identical in sequence among the chromosomes of different individuals.
  • genetic markers can be defined as genome-de ⁇ ved polynucleotides which are sufficiently polymo ⁇ hic to allow a reasonable probability that a randomly selected person will be heterozygous, and thus informative for genetic analysis by methods such as linkage analysis or association studies.
  • a genetic map consists of a collection of polymo ⁇ hic markers which have been positioned on the human chromosomes. Genetic maps may be combined with physical maps, collections of ordered overlapping fragments of genomic DNA whose anangement along the human chromosomes is known. The optimal genetic map should possess the following charactenstics.
  • the density of the genetic markers scattered along the genome should be sufficient to allow the identification and localization of any trait-related polymo ⁇ hism, - each marker should have an adequate level of heterozygosity, so as to be informative in a large percentage of different meioses,
  • maps of the present invention may be used in the individual marker and haplotype association analyses descnbed below without the necessity of determining the order of biallelic markers denved from a single BAC with respect to one another Construction of a Physical Map
  • the first step in constructing a high density genetic map of biallelic markers is the construction of a physical map.
  • Physical maps consist of ordered, overlapping cloned fragments of genomic DNA covenng a portion of the genome, preferably covenng one or all chromosomes.
  • Obtaining a physical map of the genome entails constructing and orde ⁇ ng a genomic DNA library. For an example of a complete explanation of the construction of a physical map from a BAC library see related PCT Application No.
  • biallelic markers may be used in single maker and haplotype association analyses regardless of the completeness of the conesponding physical contig harbonng them Using the procedures above, 3908 biallelic markers, each having two alleles, were identified using sequences obtained from BACs which had been localized on the genome.
  • markers were identified using pooled BACs and thereafter reassigned to individual BACs using STS screening procedures such as those descnbed in Examples 1 and 2
  • STS screening procedures such as those descnbed in Examples 1 and 2
  • SEQ DD Nos. 1 to 3908 The sequences of these biallelic markers are provided in the accompanying Sequence Listing as SEQ DD Nos. 1 to 3908 Although the sequences of SEQ DD Nos.
  • flanking sequences sunounding the polymo ⁇ hic bases of SEQ DD Nos 1 to 3908 may be lengthened or shortened to any extent compatible with their intended use and the present invention specifically contemplates such sequences
  • the sequences of these biallelic markers may be used to construct genomic maps as well as m the gene identification and diagnostic techniques descnbed herein. It will be appreciated that the biallelic markers refened to herein may be of any length compatible with their intended use provided that the markers include the polymo ⁇ hic base, and the present invention specifically contemplates such sequences.
  • Biallelic markers can be ordered to determine their positions along chromosomes, preferably subchromosomal regions, by methods known m the art as well as those disclosed in
  • the positions of the biallelic markers along chromosomes may be determined using a variety of methodologies.
  • radiation hybnd mapping is used.
  • Radiation hybnd (RH) mapping is a somatic cell genetic approach that can be used for high resolution mapping of the human genome
  • RH Radiation hybnd
  • cell lines containing one or more human chromosomes are lethally madiated, breaking each chromosome into fragments whose size depends on the radiation dose These fragments are rescued by fusion with cultured rodent cells, yielding subclones containing different portions of the human genome. This technique is descnbed by Benham et al. (Genomics 4:509-517, 1989) and Cox et al., (Science 250:245-250, 1990).
  • RH mapping has been used to generate a high-resolution whole genome radiation hybnd map of human chromosome 17q22-q25.3 across the genes for growth hormone (GH) and thymidine kinase (TK) (Foster et al., Genomics 33:185-192, 1996), the region sunounding the Gorhn syndrome gene (Obermayr et al., Eur. J. Hum. Genet.
  • PCR based techniques and human-rodent somatic cell hybnds may be used to determine the positions of the biallelic markers on the chromosomes.
  • oligonucleotide pnmer pairs which are capable of generating amplification products containing the polymo ⁇ hic bases of the biallelic markers are designed.
  • the oligonucleotide pnmers are 18-23 bp in length and are designed for PCR amplification
  • the creation of PCR pnmers from known sequences is well known to those with skill m the art. For a review of PCR technology see Erhch, H.A., PCR Technology: Principles and Applications for
  • PCR polymerase chain reactions
  • the PCR is performed in a microplate thermocycler (Techne) under the following conditions: 30 cycles of 94°C, 1 4 mm; 55°C, 2 min; and 72°C, 2 min; with a final extension at 72°C for 10 min
  • the amplified products are analyzed on a 6% polyacrylamide sequencing gel and visualized by autoradiography If the length of the resulting PCR product is identical to the length expected for an amplification product containing the polymo ⁇ hic base of the biallelic marker, then the PCR reaction is repeated with DNA templates from two panels of human-rodent somatic cell hybnds, BIOS PCRable DNA (BIOS)
  • PCR is used to screen a senes of somatic cell hybnd cell lines containing defined sets of human chromosomes for the presence of a given biallelic marker.
  • DNA is isolated from the somatic hybnds and used as starting templates for PCR reactions using the pnmer pairs from the biallelic marker. Only those somatic cell hybnds with chromosomes containing the human sequence conespondmg to the biallelic marker will yield an amplified fragment
  • the biallelic markers are assigned to a chromosome by analysis of the segregation pattern of PCR products from the somatic hybnd DNA templates.
  • the single human chromosome present in all cell hybnds that give nse to an amplified fragment is the chromosome containing that biallelic marker.
  • Example 2 descnbes a prefened method for positioning of biallelic markers on clones, such as BAC clones, obtained from genomic DNA hbranes. Using such procedures, a number of BAC clones carrying selected biallelic markers can be isolated. The position of these BAC clones on the human genome can be defined by performing STS screening as descnbed in Example 1. Preferably, to decrease the number of STSs to be tested, each BAC can be localized on chromosomal or subchromosomal regions by procedures such as those descnbed in Examples 3 and 4.
  • This localization will allow the selection of a subset of STSs conespondmg to the identified chromosomal or subchromosomal region Testing each BAC with such a subset of STSs and taking account of the position and order of the STSs along the genome will allow a refined positioning of the conespondmg biallelic marker along the genome.
  • the DNA library used to isolate BAC inserts or any type of genomic DNA fragments harbonng the selected biallelic markers already constitute a physical map of the genome or any portion thereof, using the known order of the DNA fragments will allow the order of the biallelic markers to be established
  • markers earned by the same fragment of genomic DNA such as the insert m a BAC clone, need not necessanly be ordered with respect to one another within the genomic fragment to conduct single point or haplotype association analyses.
  • the order of biallelic markers earned by the same fragment of genomic DNA may be determined
  • the positions of the biallelic markers used to construct the maps of the present invention may be assigned to subchromosomal locations using Fluorescence In Situ Hybndization (FISH) (Chenf et al., Proc Natl Acad Sci USA , 87.6639-6643 (1990)) FISH analysis is descnbed in Example 3
  • FISH Fluorescence In Situ Hybndization
  • the ordenng analyses may be conducted to generate an integrated genome wide genetic map compnsmg about 20,000, 40,000, 60,000, 80,000, 100,000, 120,000 biallelic markers with a roughly consistent number of biallelic marker per BAC.
  • the map includes one or more markers selected from the group consisting of the sequences of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
  • maps having the above-specified average numbers of biallelic markers per BAC which compnse smaller portions of the genome may also be constructed using the procedures provided herein.
  • the biallelic markers in the map are separated from one another by an average distance of 10-200kb, 15-150kb, 20-100kb, 100-150kb, 50-100kb, or 25-50kb.
  • Maps having the above-specified intermarker distances which compnse smaller portions of the genome, such as a set of chromosomes, a single chromosome, a particular subchromosomal region, or any other desired portion of the genome may also be constructed using the procedures provided herein.
  • Figure 2 showing the results of computer simulations of the distribution of intermarker spacing on a randomly distnaded set of biallelic markers, indicates the percentage of biallelic markers which will be spaced a given distance apart for a given number of markers/BAC in the genomic map (assuming 20,000 BACs constituting a minimally overlapping anay covenng the entire genome are evaluated) One hundred iterations were performed for each simulation (20,000 marker map, 40,000 marker map, 60,000 marker map, 120,000 marker map)
  • inter-marker distances 98% of inter-marker distances will be lower than 150kb provided 60,000 evenly distnaded markers are generated (3 per BAC); 90% of mter-marker distances will be lower than 150kb provided 40,000 evenly distnaded markers are generated (2 per BAC), and 50% of inter-marker distances will be lower than 150kb provided 20,000 evenly distributed markers are generated ( 1 per BAC)
  • the present invention then also concerns biallelic markers in linkage disequilibrium with the specific biallelic markers descnbed above and which are expected to present similar characteristics in terms of their respective association with a given trait.
  • the present invention concerns the biallelic markers that are in linkage disequilibrium with the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374,
  • LD among a set of biallelic markers having an adequate heterozygosity rate can be determined by genotyping between 50 and 1000 unrelated individuals, preferably between 75 and 200, more preferably around 100 Genotyping a biallelic marker consists of determining the specific allele earned by an individual at the given polymo ⁇ hic base of the biallelic marker Genotyping can be performed using similar methods as those descnbed above for the generation of the biallelic markers, or using other genotyping methods such as those further descnbed below.
  • Genome-wide linkage disequilibnum mapping aims at identifying, for any trait- causing allele being searched, at least one biallelic marker m linkage disequilibnum with said trait-causmg allele.
  • the biallelic markers therein have average inter-marker distances of 150kb or less, 75 kb or less, or 50 kb or less, 30kb or less, or 25kb or less to accommodate the fact that, m some regions of the genome, the detection of linkage disequilibnum requires lower inter-marker distances
  • the present invention provides methods to generate biallelic marker maps with average mter-marker distances of 150kb or less.
  • the mean distance between biallelic markers constituting the high density map will be less than 75kb, preferably less than 50kb.
  • Further prefened maps according to the present invention contain markers that are less than 37.5kb apart.
  • the average inter-marker spacing for the biallelic markers constituting very high density maps is less than 30kb, most preferably less than 25kb.
  • One embodiment of the present invention compnses methods for identifying and isolating genes associated with a detectable trait using the biallelic marker maps of the present invention.
  • Linkage analysis is based upon establishing a conelation between the transmission of genetic markers and that of a specific trait throughout generations withm a family.
  • all members of a series of affected families are genotyped with a few hundred markers, typically microsatelhte markers, which are distnaded at an average density of one every 10 Mb.
  • haplotypmg parental haploid genomes
  • Linkage analysis suffers from a vanety of drawbacks First, linkage analysis is limited by its reliance on the choice of a genetic model suitable for each studied trait Furthermore, as already mentioned, the resolution attainable using linkage analysis is limited, and complementary studies are required to refine the analysis of the typical 2Mb to 20Mb regions initially identified through linkage analysis
  • linkage analysis cannot be applied to the study of traits for which no large informative families are available. Typically, this will be the case m any attempt to identify trait-causing alleles involved in sporadic cases, such as alleles associated with positive or negative responses to drug treatment.
  • the present genetic maps and biallelic markers may be used to identify and isolate genes associated with detectable traits using association studies, an approach which does not require the use of affected families and which permits the identification of genes associated with sporadic traits Association Studies
  • any gene responsible or partly responsible for a given trait will be m linkage disequilibnum with some flanking markers
  • specific alleles of these flanking markers which are associated with the gene or genes responsible for the trait are identified
  • linkage disequilibnum mapping refers to locating a single gene which is responsible for the trait, it will be appreciated that the same techniques may also be used to identify genes which are partially responsible for the trait Association studies may be conducted within the general population (as opposed to the linkage analysis techniques discussed above which are limited to studies performed on related individuals in one or several affected families)
  • allele a of biallelic marker A may be directly responsible for trait T (e.g , Apo E e 4 site A and Alzheimer's disease)
  • T e.g , Apo E e 4 site A and Alzheimer's disease
  • the majority of the biallelic markers used m genetic mapping studies are selected randomly, they mainly map outside of genes.
  • the likelihood of allele a being a functional mutation directly related to trait T is very low
  • an association between a biallelic marker A and a trait T may also occur when the biallelic marker is very closely linked to the trait locus.
  • an association occurs when allele a is in linkage disequilibnum with the trait-causing allele.
  • the gene responsible for the trait or one of the genes responsible for the trait As will be further exemplified below, using a group of biallelic markers which are m close proximity to the gene responsible for the trait the location of the causal gene can be deduced from the profile of the association curve between the biallelic markers and the trait. The causal gene will usually be found in the vicinity of the marker showing the highest association with the trait
  • an association between a biallelic marker and a trait may occur when people with the trait and people without the trait conespond to genetically different subsets of the population who, coincidentally, also differ in the frequency of allele a (population stratification). This phenomenon may be avoided by using ethnically matched large heterogeneous samples.
  • Association studies are particularly suited to the efficient identification of genes that present common polymo ⁇ hisms, and are involved in multifactonal traits whose frequency is relatively higher than that of diseases with monofacto ⁇ al mhentance Association studies mamly consist of four steps: recruitment of trait-positive (T+) and control populations, preferably trait-negative (T-) populations with well-defined phenotypes, identification of a candidate region suspected of harbonng a trait causing gene, identification of said gene among candidate genes in the region, and finally validation of mutat ⁇ on(s) responsible for the trait in said trait causing gene.
  • the trait-positive should be well-defined, preferably the control phenotype is a well-defined trait-negative phenotype as well.
  • the trait under study should preferably follow a bimodal distribution in the population under study, presenting two clear non-overlappmg phenotypes, trait-positive and trait-negative
  • any genetic trait may still be analyzed using the association method proposed herein by carefully selecting the individuals to be included in the trait- positive group and preferably the trait-negative phenotypic group as well.
  • the selection procedure ideally involves selecting individuals at opposite ends of the non-bimodal phenotype spectrum of the trait under study, so as to include in these trait-positive and trait- negative populations individuals who clearly represent non-overlappmg, preferably extreme phenotypes
  • Figure 3 shows, for a senes of hypothetical sample sizes, the p-value significance obtained in association studies performed using individual markers from the high-density biallelic map, according to vanous hypotheses regarding the difference of allelic frequencies between the trait-positive and trait-negative samples. It indicates that, all cases, samples ranging from 150 to 500 individuals are numerous enough to achieve statistical significance. It will be appreciated that bigger or smaller groups can be used to perform association studies according to the methods of the present invention.
  • a marker/trait association study is performed that compares the genotype frequency of each biallelic marker in the above described trait-positive and trait- negative populations by means of a chi square statistical test (one degree of freedom).
  • a haplotype association analysis is performed to define the frequency and the type of the ancestral earner haplotype. Haplotype analysis, by combining the informativeness of a set of biallelic markers increases the power of the association analysis, allowing false positive and/or negative data that may result from the single marker studies to be eliminated.
  • Genotyping can be performed using any method descnbed in m, including the microsequencing procedure descnbed m Example 8
  • a third step consists of completely sequencing the BAC inserts harbonng the markers identified in the association analyzes
  • the functional sequences within the candidate region e g. exons, splice sites, promoters, and other potential regulatory regions
  • the functional sequences within the candidate region are scanned for mutations which are responsible for the trait by comparing the sequences of the functional regions in a selected number of trait-positive and trait-negative individuals using appropnate software. Tools for sequence analysis are further described m Example 9.
  • candidate mutations are then validated by screening a larger population of trait-positive and trait-negative individuals using genotyping techniques descnbed below.
  • polymo ⁇ hisms are confirmed as candidate mutations when the validation population shows association results compatible with those found between the mutation and the trait m the test population
  • the trait-positive and trait-negative populations are genotyped using an approp ⁇ ate number of biallelic markers.
  • the markers may include one or more of the markers of SEQ DD Nos- 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
  • the markers used to define a region beanng a candidate gene may be distributed at an average density of 1 marker per 10-200 kb.
  • the markers used to define a region beanng a candidate gene are distnaded at an average density of 1 marker every 15-150 kb.
  • the markers used to define a region beanng a candidate gene are distnaded at an average density of 1 marker every 20-100kb.
  • the markers used to define a region beanng a candidate gene are distnaded at an average density of 1 marker every 100 to 150kb.
  • the markers used to define a region beanng a candidate gene are distnaded at an average density of 1 marker every 50 to lOOkb.
  • the biallelic markers used to define a region beanng a candidate gene are distnaded at an average density of 1 marker every 25-50 kilobases.
  • the marker density of the map will be adapted to take the linkage disequilibnum distribution in the genomic region of interest into account
  • the initial identification of a candidate genomic region harbonng a gene associated with a detectable phenotype may be conducted using a preliminary map containing a few thousand biallelic markers. Thereafter, the genomic region harbonng the gene responsible for the detectable trait may be better delineated using a map containing a larger number of biallelic markers. Furthermore, the genomic region harbonng the gene responsible for the detectable trait may be further delineated using a high density map of biallelic markers Finally, the gene associated with the detectable trait may be identified and isolated using a very high density biallelic marker map Example 6 descnbes a procedure for identifying a candidate region harbonng a gene associated with a detectable trait and provides simulated results for this procedure.
  • Example 6 compares the results of simulated analyzes using markers denved from maps having 3,000, 20,000, and 60,000 markers, the number of markers contained in the map is not restncted to these exemplary figures Rather, Example 6 exemplifies the increasing refinement of the candidate region with increasing marker density
  • haplotype studies can be performed using groups of markers located in proximity to one another within regions of the genome. For example, using the methods described above in which the association of an individual marker with a detectable phenotype was analyzed using maps of 3,000 markers, 20,000 markers, and 60,000 markers, a se ⁇ es of haplotype studies can be performed using groups of contiguous markers from such maps or from maps having higher marker densities
  • a senes of successive haplotype studies including groups of markers spanning regions of more than 1 Mb may be performed.
  • the biallelic markers included in each of these groups may be located within a genomic region spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1Mb
  • the genomic regions containing the groups of biallelic markers used m the successive haplotype analyses are overlapping.
  • biallelic markers need not completely cover the genomic regions of the above-specified lengths but may instead be obtained from incomplete contigs having one or more gaps therein.
  • biallelic markers may be used in single point and haplotype association analyses regardless of the completeness of the conespondmg physical contig harbonng them
  • Genome-wide mapping using association studies with dense enough anays of markers permit a case-by-case best estimate of p-value significance thresholds.
  • a conesponding association between the trait and a studied marker will be deemed not significant, while for a p-value below such a threshold, said association will be deemed significant. If the p-value is significant, the genomic region around the marker will be further scrutinized for a trait-causmg gene.
  • p-value significance thresholds be assessed for each case/control population compa ⁇ son. Both the genetic distance between sampled population- "stratificat ⁇ on”-and the dispersion due to random selection of samples may indeed influence the p-value significance thresholds. It will be appreciated that the above approaches may be conducted on any scale (i.e. over the whole genome, a set of chromosomes, a single chromosome, a particular subchromosomal region, or any other desired portion of the genome). As mentioned above, once significance thresholds have been assessed, population sample sizes may be adapted as exemplified in Figure 3. Example 7 below illustrates the increase in statistical power brought to an association study by a haplotype analysis.
  • Examples 5 and 7 generated from individual and haplotype studies using a biallelic marker set of an average density equal to ca. 40kb in the region of an Alzheimer's disease trait causing gene, indicate that all biallelic markers of sufficient informative content located withm a ca. 200 kb genomic region around a trait-causmg allele can potentially be successfully used to localize a trait causing gene with the methods provided by the present invention.
  • a sequence analysis process will allow the detection of all genes located within said region, together with a potential functional characterization of said genes.
  • the identified functional features may allow prefened trait-causing candidates to be chosen from among the identified genes.
  • More biallelic markers may then be generated within said candidate genes, and used to perform refined association studies that will support the identification of the trait causing gene. Sequence analysis processes are described in Example 9.
  • Examples 10-18 illustrate the application of the above methods using biallelic markers to identify a gene associated with a complex disease, prostate cancer, within a ca. 450 kb candidate region. Additional details of the identification of the gene associated with prostate cancer are provided in the U.S. Patent Application entitled “Prostate Cancer Gene” Serial No. 08/996,306.
  • genes associated with detectable traits may be identified as follows.
  • Candidate genomic regions suspected of harboring a gene associated with the trait may be identified using techniques such as those described herein. In such techniques, the allelic frequencies of biallelic markers are compared in nucleic acid samples derived from individuals expressing the detectable trait and individuals who do not express the detectable trait. In this manner, candidate genomic regions suspected of harboring a gene associated with the detectable trait under investigation are identified.
  • the existence of one or more genes associated with the detectable trait within the candidate region is confirmed by identifying more biallelic markers lying in the candidate region.
  • a first haplotype analysis is performed for each possible combination of groups of biallelic markers within the genomic region suspected of harboring a trait-associated gene.
  • each group may comprise three biallelic markers.
  • the frequency of each possible haplotype (for groups of three markers there are 8 possible haplotypes) in individuals expressing the trait and individuals who do not express the trait is estimated.
  • the a haplotype estimation method is applied as described in IV. for example the haplotype frequencies may be estimated using the Expectation-
  • a second haplotype analysis is performed for each possible combination of groups of biallelic markers within the genomic regions which are not suspected of harboring a trait- associated gene.
  • each group may comprise three biallelic markers.
  • the frequency of each possible haplotype for groups of three markers there are 8 possible haplotypes in individuals expressing the trait and individuals who do not express the trait is estimated.
  • the frequencies of each of the possible haplotypes of the grouped markers (or each allele of individual markers) in individuals expressing the trait and individuals who do not express the trait are compared. For example, the frequencies may be compared by performing a chi-squared analysis.
  • the haplotype (or the allele of each individual marker) having the greatest association with the trait is selected. This process is repeated for each group of biallelic markers (or each allele of the individual markers) to generate a distribution of association values, which will be refened to herein as the "random" distribution.
  • the trait-associated distribution and the random distribution are then compared to one another to determine if there are significant differences between them.
  • the trait- associated distribution and the random distribution can be compared using either the
  • the candidate genomic region is unlikely to contain a gene associated with the detectable trait. Accordingly, no further analysis of the candidate genomic region is performed. While Examples 10 to 26 illustrate the use of the maps and markers of the present invention for identifying a new gene associated with a complex disease within a 2Mb genomic region for establishing that a candidate gene is, at least partially, responsible for a disease, the maps and markers of the present invention may also be used to identify one or more biallelic markers or one or more genes associated with other detectable phenotypes, including drug response, drug toxicity, or drug efficacy.
  • a "positive response" to a medicament can be defined as comprising a reduction of the symptoms related to the disease or condition to be treated.
  • a "negative response" to a medicament can be defined as comprising either a lack of positive response to the medicament which does not lead to a symptom reduction or to a side-effect observed following administration of the medicament.
  • Drug efficacy, response and tolerance/toxicity can be considered as multifactorial traits involving a genetic component in the same way as complex diseases such as Alzheimer's disease, prostate cancer, hypertension or diabetes.
  • the identification of genes involved in drug efficacy and toxicity could be achieved following a positional cloning approach, e.g. performing linkage analysis within families in order to obtain the subchromosomal location of the gene(s).
  • this type of analysis is actually impractical in the case of drug responsiveness, due to the lack of availability of familial cases.
  • the likelihood of having more than one individual in a particular family being exposed to the same drug at the same time is very low. Therefore, drug efficacy and toxicity can only be analyzed as sporadic traits.
  • the above mentioned groups are recruited according to phenotyping criteria having the characteristics described above, so that the phenotypes defining the different groups are non-overlapping, preferably extreme phenotypes.
  • phenotyping criteria have the bimodal distribution described above.
  • association and haplotype analyses may be performed as described herein to identify one or more biallelic markers associated with drug response, preferably drug toxicity or drug efficacy.
  • identification of such one or more biallelic markers allows one to conduct diagnostic tests to determine whether the administration of a drug to an individual will result in drug response, preferably drug toxicity, or drug efficacy.
  • the methods described above for identifying a gene associated with prostate cancer and biallelic markers indicative of a risk of suffering from asthma may be utilized to identify genes associated with other detectable phenotypes.
  • the above methods may be used with any marker or combination of markers included in the maps of the present invention, including the biallelic markers of SEQ ED Nos.: 1 to 3809 or the sequences complementary thereto.
  • the general strategy to perform the association studies using the maps and markers of the present invention is to scan two groups of individuals (trait-positive individuals and trait-negative controls) characterized by a well defined phenotype in order to measure the allele frequencies of the biallelic markers in each of these groups.
  • the frequencies of markers with inter-marker spacing of about 150 kb are determined in each group. More preferably, the frequencies of markers with intermarker spacing of about 75 kb are determined in each group. Even more preferably, markers with inter-marker spacing of about 50 kb, about 37.5kb, about 30kb, or about 25kb will be tested in each population. In some embodiments the frequenices of 1, 5, 10, 20, 50, 100, 500, 1000, 2000, 3000, or all of the biallelic markers of SEQ DD Nos.: 1 to 3908 or the sequences complementary thereto are measured in each population.
  • the frequencies of 1, 5, 10, 20, 50, 100, 500, 1000, 2000, or 3000 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 1 to 3908 or the sequences complementary thereto are measured in each population.
  • the frequenices of 1, 5, 10, 20, 50, 100, 500, 1000, 2000, or all of the biallelic markers of SEQ DD Nos.: 1 to 2260 or the sequences complementary thereto are measured in each population.
  • the frequencies of 1, 5, 10, 20, 50, 100, 500, 1000, or 2000 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 1 to 2260 or the sequences complementary thereto are measured in each population.
  • the frequenices of 1, 5, 10, 20, 50, 100, 500, 1000, or all of the biallelic markers of SEQ DD Nos.: 2261 to 3734 or the sequences complementary thereto are measured in each population.
  • the frequencies of 1, 5, 10, 20, 50, 100, 500, 1000 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 2261 to 3734 or the sequences complementary thereto are measured in each population.
  • the frequenices of 1, 5, 10, 20, 50, 100, or all of the biallelic markers of SEQ ID Nos.: 3735 to 3908 or the sequences complementary thereto are measured in each population.
  • the frequencies of 1, 5, 10, 20, 50, or 100 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 3735 to 3908 or the sequences complementary thereto are measured in each population.
  • the frequencies of about 20,000, or about 40,000 biallelic markers are determined in each population. In a highly prefened embodiment, the frequencies of about 60,000, about 80,000, about 100,000, or about 120,000 biallelic markers are determined in each population. In some embodiments, haplotype analyses may be run using groups of markers located within regions spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1Mb.
  • Allele frequency can be measured using any genotyping method described herein including microsequencing techniques; prefened high throughput microsequencing procedures are further exemplified in DI; it will be further appreciated that any other large scale genotyping method suitable with the intended pu ⁇ ose contemplated herein may also be used. It will be appreciated that it is not necessary to use a full high density biallelic marker map in order to start a genome-wide association study. Maps having higher densities of biallelic markers (two or more markers per BAC, average inter-marker spacing of about 75kb or less) may then be generated by starting first on those BACs for which a candidate association has been established at the first step.
  • the biallelic markers included in each of these groups may be located within a genomic region spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1Mb. It will be appreciated that the ordered DNA fragments containing these groups of biallelic markers need not completely cover the genomic regions of these lengths but may instead be incomplete contigs having one or more gaps therein. As discussed in further detail below, biallelic markers may be used in association studies and haplotype analyses regardless of the completeness of the conesponding physical contig harboring them, provided linkage disequilibrium between the markers can be assessed.
  • the maps will provide not only the confirmation of the association, but also a shortcut towards the identification of the gene involved in the trait under study.
  • the markers showing positive association to the trait are in linkage disequilibrium with the trait loci, the causal gene will be physically located in the vicinity of these markers. Regions identified through association studies using high density maps will on average have a 20 - 40 times shorter length than those identified by linkage analysis (2 to 20 Mb).
  • BACs from which the most highly associated markers were derived are completely sequenced and the mutations in the causal gene are searched by applying genomic analysis tools.
  • genomic analysis tools e.g. exons and splice sites, promoters and other regulatory regions
  • trait-positive samples being compared to identify causal mutations are selected among those carrying the ancestral haplotype; in these embodiments, control samples are chosen from individuals not carrying said ancestral haplotype. In further embodiments, trait-positive samples being compared to identify causal mutations are selected among those showing haplotypes that are as close as possible to the ancestral haplotype; in these embodiments, control samples are chosen from individuals not carrying any of the haplotypes selected for the case population.
  • the maps and biallelic markers of the present invention may also be used to identify patterns of biallelic markers associated with detectable traits resulting from polygenic interactions.
  • the analysis of genetic interaction between alleles at unlinked loci requires individual genotyping using the techniques described herein.
  • the analysis of allelic interaction among a selected set of biallelic markers with appropriate p-values can be considered as a haplotype analysis, similar to those described in further details within the present invention.
  • the maps and biallelic markers of the present invention may be used in more targeted approaches for identifying individuals likely to exhibit a particular detectable trait or individuals who exhibit a particular detectable trait as a consequence of possessing a particular allele of a gene associated with the detectable trait.
  • the biallelic markers and maps of the present invention may be used to identify individuals who carry an allele of a known gene that is suspected of being associated with a particular detectable trait.
  • the target genes may be genes having alleles which predispose an individual to suffer from a specific disease state.
  • the target genes may be genes having alleles that predispose an individual to exhibit a desired or undesired response to a drug or other pharmaceutical composition, a food, or any administered compound.
  • the known gene may encode any of a variety of types of biomolecules.
  • the known genes targeted in such analyzes may be genes known to be involved in a particular step in a metabolic pathway in which disruptions may cause a detectable trait.
  • the target genes may be genes encoding receptors or ligands which bind to receptors in which disruptions may cause a detectable trait, genes encoding transporters, genes encoding proteins with signaling activities, genes encoding proteins involved in the immune response, genes encoding proteins involved in hematopoesis, or genes encoding proteins involved in wound healing. It will be appreciated that the target genes are not limited to those specifically enumerated above, but may be any gene known to be or suspected of being associated with a detectable trait.
  • the maps and markers of the present invention may be used to identify genes associated with drug response.
  • the biallelic markers of the present invention may also be used to select individuals for inclusion in the clinical trials of a drug.
  • the markers of SEQ DD Nos.: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto may be used in targeted approaches to identify individuals at risk of developing a detectable trait, for example a complex disease or desired/undesired drug response, or to identify individuals exhibiting said trait.
  • the present invention provides methods to establish putative associations between any of the biallelic markers described herein and any detectable traits, including those specifically described herein.
  • biallelic markers which are in linkage disequilibrium with any of the above disclosed markers may be identified.
  • more biallelic markers in linkage disequilibrium with said associated biallelic markers may be generated and used to perform targeted approaches aiming at identifying individuals exhibiting, or likely to exhibit, said detectable trait, according to the methods provided herein.
  • biallelic markers in linkage disequilibrium with said candidate gene may be identified and used in targeted approaches, such as the approaches utilized above for the asthma-associated gene and the Apo E gene.
  • Example 1 Ordering of a BAC Library Screening Clones with STSs The BAC library is screened with a set of PCR-typeable STSs to identify clones containing the STSs. To facilitate PCR screening of several thousand clones, for example 200,000 clones, pools of clones are prepared.
  • Three-dimensional pools of the BAC libraries are prepared as described in Chumakov et al. and are screened for the ability to generate an amplification fragment in amplification reactions conducted using primers derived from the ordered STSs. (Chumakov et al. (1995), supra).
  • a BAC library typically contains 200,000 BAC clones. Since the average size of each insert is 100-300 kb, the overall size of such a library is equivalent to the size of at least about
  • This library is stored as an anay of individual clones in 518 384-well plates. It can be divided into 74 primary pools (7 plates each). Each primary pool can then be divided into 48 subpools prepared by using a three-dimensional pooling system based on the plate, row and column address of each clone (more particularly, 7 subpools consisting of all clones residing in a given microtiter plate; 16 subpools consisting of all clones in a given row;
  • the three dimensional pools may be screened with 45,000 STSs whose positions relative to one another and locations along the genome are known.
  • the three dimensional pools are screened with about 30,000 STSs whose positions relative to one another and locations along the genome are known.
  • the three dimensional pools are screened with about 20,000 STSs whose positions relative to one another and locations along the genome are known.
  • Amplification products resulting from the amplification reactions are detected by conventional agarose gel electrophoresis combined with automatic image capturing and processing.
  • PCR screening for a STS involves three steps: (1) identifying the positive primary pools; (2) for each positive primary pool, identifying the positive plate, row and column 'subpools' to obtain the address of the positive clone; (3) directly confirming the PCR assay on the identified clone.
  • PCR assays are performed with primers specifically defining the STS. Screening is conducted as follows. First BAC DNA containing the genomic inserts is prepared as follows. Bacteria containing the BACs are grown overnight at 37°C in 120 ⁇ l of LB containing chloramphenicol (12 ⁇ g/ml). DNA is extracted by the following protocol: Centrifuge 10 min at 4°C and 2000 ⁇ m
  • PCR assays are performed using the following protocol:
  • the amplification is performed on a Genius II thermocycler. After heating at 95°C for 10 min, 40 cycles are performed. Each cycle comprises: 30 sec at 95°C, 54°C for 1 min, and 30 sec at 72°C. For final elongation, 10 min at 72°C end the amplification.
  • PCR products are analyzed on 1% agarose gel with 0.1 mg/ml ethidium bromide.
  • a YAC (Yeast Artificial Chromosome) library can be used.
  • the very large insert size, of the order of 1 megabase, is the main advantage of the YAC libraries.
  • the library can typically include about 33,000 YAC clones as described in Chumakov et al. (1995, supra).
  • the YAC screening protocol may be the same as the one used for BAC screening.
  • BAC insert size may be determined by Pulsed Field Gel Electrophoresis after digestion with the restriction enzyme Notl.
  • BAC clones may cover at least lOOkb of contiguous genomic DNA, at least 250kb of contiguous genomic DNA, at least 500kb of contiguous genomic DNA, at least 2Mb of contiguous genomic DNA, at least 5Mb of contiguous genomic DNA, at least 10Mb of contiguous genomic DNA, or at least 20Mb of contiguous genomic DNA.
  • Example 2 Screening BAC libraries with biallelic markers Amplification primers enabling the specific amplification of DNA fragments carrying the biallelic markers, including the map-related biallelic markers of the invention, may be used to screen clones in any genomic DNA library, preferably the BAC libraries described above for the presence of the biallelic markers.
  • Pairs of primers of SEQ DD Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 were designed which allow the amplification of fragments carrying the biallelic markers of SEQ DD Nos: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
  • the amplification primers of SEQ ID Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 may be used to screen clones in a genomic DNA library for the presence of the biallelic markers of SEQ ID Nos: 1 to 3908, 1 to
  • amplification primers for the biallelic markers of SEQ DD Nos: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 need not be identical to the primers of SEQ DD Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. Rather, they can be any other primers allowing the specific amplification of any DNA fragment carrying the markers and may be designed using techniques familiar to those skilled in the art.
  • the amplification primers may be oligonucleotides of 8, 10, 15, 20 or more bases in length which enable the amplification of any fragment carrying the polymo ⁇ hic site in the markers.
  • the polymo ⁇ hic base may be in the center of the amplification product or, alternatively, it may be located off-center.
  • the amplification product produced using these primers may be at least 100 bases in length (i.e. 50 nucleotides on each side of the polymo ⁇ hic base in amplification products in which the polymo ⁇ hic base is centrally located). In other embodiments, the amplification product produced using these primers may be at least 500 bases in length (i.e.
  • the amplification product produced using these primers may be at least 1000 bases in length (i.e. 500 nucleotides on each side of the polymo ⁇ hic base in amplification products in which the polymo ⁇ hic base is centrally located).
  • Amplification primers such as those described above are included within the scope of the present invention.
  • the BAC clones to be screened are distributed in three dimensional pools as described in Example 1.
  • Amplification reactions are conducted on the pooled BAC clones using primers specific for the biallelic markers to identify BAC clones which contain the biallelic markers, using procedures essentially similar to those described in Example 1.
  • Amplification products resulting from the amplification reactions are detected by conventional agarose gel electrophoresis combined with automatic image capturing and processing.
  • PCR screening for a biallelic marker involves three steps: (1) identifying the positive primary pools; (2) for each positive primary pools, identifying the positive plate, row and column 'subpools' to obtain the address of the positive clone; (3) directly confirming the PCR assay on the identified clone. PCR assays are performed with primers defining the biallelic marker.
  • BAC DNA is isolated as follows. Bacteria containing the genomic inserts are grown overnight at 37°C in 120 ⁇ l of LB containing chloramphenicol (12 ⁇ g/ml). DNA is extracted by the following protocol:
  • PCR assays are performed using the following protocol:
  • the amplification is performed on a Genius II thermocycler. After heating at 95°C for 10 min, 40 cycles are performed. Each cycle comprises: 30 sec at 95°C, 54°C for 1 min, and 30 sec at 72°C. For final elongation, 10 min at 72°C end the amplification.
  • PCR products are analyzed on 1% agarose gel with 0.1 mg/ml ethidium bromide.
  • Metaphase chromosomes are prepared from phytohemagglutinin (PHA)-stimulated blood cell donors.
  • PHA-stimulated lymphocytes from healthy males are cultured for 72 h in RPMI-1640 medium.
  • methotrexate (10 mM) is added for 17 h, followed by addition of 5-bromodeoxyuridine (5-BudR, 0.1 mM) for 6 h.
  • Colcemid (1 mg/ml) is added for the last 15 min before harvesting the cells.
  • BAC clones carrying the biallelic markers used to construct the maps of the present invention including the biallelic markers of SEQ ID Nos: 1 to 3908, 1 to 2260, 2261 to 3374,
  • BACs or portions thereof including fragments carrying said biallelic markers, obtained for example from amplification reactions using pairs of primers of SEQ ED Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773, can be used as probes to be hybridized with metaphasic chromosomes.
  • the hybridization probes to be used in the contemplated method may be generated using alternative methods well known to those skilled in the art.
  • Hybridization probes may have any length suitable for this intended pu ⁇ ose.
  • Probes are then labeled with biotin- 16 dUTP by nick translation according to the manufacturer's instructions (Bethesda Research Laboratories, Bethesda, MD), purified using a
  • the slides are treated with proteinase K (10 mg/100 ml in 20 mM Tris-HCl, 2 mM CaCl 2 ) at 37°C for 8 min and dehydrated.
  • the hybridization mixture containing the probe is placed on the slide, covered with a coverslip, sealed with rubber cement and incubated overnight in a humid chamber at 37°C.
  • the biotinylated probe is detected by avidin- FITC and amplified with additional layers of biotinylated goat anti-avidin and avidin-FITC.
  • fluorescent R-bands are obtained as previously described (Cherif et al.,(1990) supra.). The slides are observed under a LEICA fluorescence microscope (DMRXA). Chromosomes are counterstained with propidium iodide and the fluorescent signal of the probe appears as two symmetrical yellow-green spots on both chromatids of the fluorescent R-band chromosome (red).
  • DMRXA LEICA fluorescence microscope
  • FIG. 1 is a cytogenetic map of chromosome 21 indicating the subchromosomal regions therein.
  • Amplification primers for generating amplification products containing the polymo ⁇ hic bases of these markers are also provided in the accompanying sequence listing.
  • microsequencing primers for use in determining the identities of the polymo ⁇ hic bases of these biallelic markers are provided in the accompanying Sequence Listing.
  • the rate at which biallelic markers may be assigned to subchromosomal regions may be enhanced through automation. For example, probe preparation may be performed in a microtiter plate format, using adequate robots. The rate at which biallelic markers may be assigned to subchromosomal regions may be enhanced using techniques which permit the in situ hybridization of multiple probes on a single microscope slide, such as those disclosed in Larin et al., Nucleic Acids Research 22: 3689-3692 (1994). In the largest test format described, different probes were hybridized simultaneously by applying them directly from a 96-well microtiter dish which was inverted on a glass plate.
  • a further benefit of conducting the analysis on one slide is that it facilitates automation, since a microscope having a moving stage and the capability of detecting fluorescent signals in different metaphase chromosomes could provide the coordinates of each probe on the metaphase chromosomes distributed on the 96 well dish.
  • Example 4 describes an alternative method to position biallelic markers which allows their assignment to human chromosomes.
  • Example 4 Assignment of Biallelic Markers to Human Chromosomes The biallelic markers used to construct the maps of the present invention, including the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto, may be assigned to a human chromosome using monosomal analysis as described below.
  • the chromosomal localization of a biallelic marker can be performed through the use of somatic cell hybrid panels. For example 24 panels, each panel containing a different human chromosome, may be used (Russell et al., Somat Cell Mol. Genet 22:425-431 (1996);
  • the biallelic markers are localized as follows.
  • the DNA of each somatic cell hybrid is extracted and purified.
  • Genomic DNA samples from a somatic cell hybrid panel are prepared as follows. Cells are lysed overnight at 42°C with 3.7 ml of lysis solution composed of:
  • the pellet is dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water.
  • PCR assay is performed on genomic DNA with primers defining the biallelic marker.
  • the PCR assay is performed as described above for BAC screening.
  • the PCR products are analyzed on a 1% agarose gel containing 0.2 mg/ml ethidium bromide.
  • the 3 major isoforms of human Apolipoprotein E (apoE2, -E3, and -E4), as identified by isoelectric focusing, are coded for by 3 alleles (e 2, 3, and 4).
  • the e 2, e 3, and e 4 isoforms differ in amino acid sequence at 2 sites, residue 112 (called site A) and residue 158 (called site B).
  • the ancestral isoform of the protein is Apo E3, which at sites A B contains cysteine/arginine, while ApoE2 and -E4 contain cysteine/cysteine and arginine/arginine, respectively (Weisgraber, K.H. et al., J. Biol. Chem. 256: 9077-9083
  • Apo E e 4 is currently considered as a major susceptibility risk factor for Alzheimer's disease development in individuals of different ethnic groups (specially in Caucasians and Japanese compared to Hispanics or African Americans), across all ages between 40 and 90 years, and in both men and women, as reported recently in a study performed on 5930
  • Alzheimer's disease patients and 8607 controls (Farrer et al., JAMA 278:1349-1356 (1997)). More specifically, the frequency of a C base coding for arginine 112 at site A is significantly increased in Alzheimer's disease patients.
  • biallelic markers that are in the vicinity of the Apo E site A were generated and the association of one of their alleles with Alzheimer's disease was analyzed.
  • An Apo E public marker (stSG94) was used to screen a human genome BAC library as previously described.
  • a BAC which gave a unique FISH hybridization signal on chromosomal region 19ql3.2.3, the chromosomal region harboring the Apo E gene, was selected for finding biallelic markers in linkage disequilibrium with the Apo E gene as follows.
  • This BAC contained an insert of 205 kb that was subcloned as previously described. Fifty BAC subclones were randomly selected and sequenced. Twenty five subclone sequences were selected and used to design twenty five pairs of PCR primers allowing 500 bp-amplicons to be generated. These PCR primers were then used to amplify the conesponding genomic sequences in a pool of DNA from 100 unrelated individuals (blood donors of French origin) as already described.
  • Amplification products from pooled DNA were sequenced and analyzed for the presence of biallelic polymo ⁇ hisms, as already described.
  • Five amplicons were shown to contain a polymo ⁇ hic base in the pool of 100 unrelated individuals, and therefore these polymo ⁇ hisms were selected as random biallelic markers in the vicinity of the Apo E gene.
  • SEQ FD Nos: 3124 and 4169 An additional pair of primers (SEQ FD Nos: 3124 and 4169) was designed that allows amplification of the genomic fragment carrying the biallelic polymo ⁇ hism conesponding to the ApoE marker (99-2452-54; C/T; designated SEQ DD NO: 3914 in the accompanying Sequence Listing; publicly known as Apo E site A (Weisgraber et al. (1981), supra; Rail et al. (1982), supra) to be amplified.
  • the five random biallelic markers plus the Apo E site A marker were physically ordered by PCR screening of the conesponding amplicons using all available BACs originally selected from the genomic DNA libraries, as previously described, using the public Apo E marker stSG94.
  • the amplicon's order derived from this BAC screening is as follows: (99- 344-439/99-366-274) - (99-365-344/99-2452-54) - 99-359-308 - 99-355-219, where parentheses indicate that the exact order of the respective amplicons could't be established.
  • Linkage disequilibrium among the six biallelic markers five random markers plus the
  • Apo E site A was determined by genotyping the same 100 unrelated individuals from whom the random biallelic markers were identified.
  • DNA samples and amplification products from genomic PCR were obtained in similar conditions as those described above for the generation of biallelic markers, and subjected to automated microsequencing reactions using fluorescent ddNTPs (specific fluorescence for each ddNTP) and the appropriate microsequencing primers having a 3 ' end immediately upstream of the polymo ⁇ hic base in the biallelic markers.
  • fluorescent ddNTPs specific fluorescence for each ddNTP
  • the appropriate microsequencing primers Once specifically extended at the 3' end by a DNA polymerase using the complementary fluorescent dideoxynucleotide analog (thermal cycling), the microsequencing primer was precipitated to remove the uninco ⁇ orated fluorescent ddNTPs.
  • the reaction products were analyzed by electrophoresis on ABI 377 sequencing machines. Results were automatically analyzed by an appropriate software further described in Example 8.
  • Alzheimer's disease patients were recruited according to clinical inclusion criteria based on the MMSE test.
  • the 248 control cases included in this study were both ethnically- and age-matched to the affected cases. Both affected and control individuals conesponded to unrelated cases.
  • the identities of the polymo ⁇ hic bases of each of the biallelic markers was determined in each of these individuals using the methods described above. Techniques for conducting association studies are further described below.
  • linkage disequilibrium between 2 biallelic markers tends to decrease when their inter-marker distance is greater than 50kb, and is further decreased when the inter-marker distance is greater than 75kb. It was further observed that when 2 biallelic markers were further than 150kb apart, most often no significant linkage disequilibrium between them could be evidenced. It will be appreciated that the size and history of the sample population used to measure linkage disequilibrium between markers may influence the distance beyond which linkage disequilibrium tends not to be detectable. Assuming that linkage disequilibrium can be measured between markers spanning regions up to an average of 150kb long, biallelic marker maps will allow genome-wide linkage disequilibrium mapping, provided they have an average inter-marker distance lower than
  • the initial identification of a candidate genomic region harboring a gene associated with a detectable trait may be conducted using a genome-wide map comprising about 20,000 biallelic markers.
  • the candidate genomic region may be further defined using a map having a higher marker density, such as a map comprising about 40,000 markers, about 60,000 markers, about 80,000 markers, about 100,000 markers, or about 120,000 markers.
  • the use of high density maps such as those described above allows the identification of genes which are truly associated with detectable traits, since the coincidental associations will be randomly distributed along the genome while the true associations will map within one or more discrete genomic regions.
  • biallelic markers located in the vicinity of a gene associated with a detectable trait will give rise to broad peaks in graphs plotting the frequencies of the biallelic markers in trait-positive individuals versus control individuals.
  • biallelic markers which are not in the vicinity of the gene associated with the detectable trait will produce unique points in such a plot.
  • Figures 4, 5, and 6 provide a simulated illustration of the above principles.
  • an association analysis conducted with a map comprising about 3,000 biallelic markers yields a group of points.
  • the points become broad peaks indicative of the location of a gene associated with a detectable trait.
  • the biallelic markers used in the initial association analysis may be obtained from a map comprising about 20,000 biallelic markers, as illustrated by the simulation results shown in Figure 5.
  • the association analysis with 3,000 markers suggests peaks near markers 9 and 17.
  • a second analysis is performed using additional markers in the vicinity of markers 9 and 17, as illustrated in the simulated results of Figure 5, using a map of about 20,000 markers. This step again indicates an association in the close vicinity of marker 17, since more markers in this region show an association with the trait. However, none of the additional markers around marker 9 shows a significant association with the trait, which makes marker 9 a potential false positive.
  • one or more of the biallelic markers selected from the group consisting of SEQ ED Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are used in the second analysis.
  • a third analysis may be obtained with a map comprising about 60,000 biallelic markers.
  • one or more of the biallelic markers selected from the group consisting of SEQ ED Nos: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are used in the third association analysis.
  • more markers lying around marker 17 exhibit a high degree of association with the detectable trait.
  • no association is confirmed in the vicinity of marker 9.
  • the genomic region sunounding marker 17 can thus be considered a candidate region for the potential trait of this simulation.
  • marker 99-365-344 that was already found associated with Alzheimer's disease was not included in the haplotype study. Only biallelic markers 99-344-439, 99-355- 219, 99-359-308, and 99-366-274, which did not show any significant association with Alzheimer's disease when taken individually, were used.
  • This first haplotype analysis measured frequencies of all possible two-, three-, or four-marker haplotypes in the
  • Haplotype 7 comprises SEQ DD No. 3910 with the T allele of marker 99-366-274, SEQ DD No. 3911 with the G allele of marker 99-359-308 and SEQ DD No. 3912 with the G allele of marker 99-355- 219).
  • the haplotype association analysis thus clearly increased the statistical power of the individual marker association studies by more than four orders of magnitude when compared to single-marker analysis from p values > E-01 for the individual markers to p value ⁇ 2 E-06 for the four-marker "haplotype 8". See Table 3.
  • marker 99-365-344 was included in the haplotype analyzes.
  • the frequency differences between the affected and non affected populations was calculated for all two-, three-, four- or five-marker haplotypes involving markers: 99-344-439; 99-355-219; 99- 359-308; 99-366-274; and 99-365-344.
  • the most significant p-values obtained in each category of haplotype were examined depending on which markers were involved or not within the haplotype. This showed that all haplotypes which included marker 99-365-344 showed a significant association with Alzheimer's disease (p-values in the range of E-04 to E-l 1).
  • Example 8 Genotyping of biallelic markers using microsequencing procedures Several microsequencing protocols conducted in liquid phase are well known to those skilled in the art.
  • a first possible detection analysis allowing the allele characterization of the microsequencing reaction products relies on detecting fluorescent ddNTP- extended microsequencing primers after gel electrophoresis.
  • a first alternative to this approach consists in performing a liquid phase microsequencing reaction, the analysis of which may be carried out in solid phase.
  • the microsequencing reaction may be performed using 5 '-biotinylated oligonucleotide primers and fluorescein-dideoxynucleotides.
  • the biotinylated oligonucleotide is annealed to the target nucleic acid sequence immediately adjacent to the polymo ⁇ hic nucleotide position of interest. It is then specifically extended at its 3 '-end following a PCR cycle, wherein the labeled dideoxynucleotide analog complementary to the polymo ⁇ hic base is inco ⁇ orated.
  • the biotinylated primer is then captured on a microtiter plate coated with streptavidin. The analysis is thus entirely carried out in a microtiter plate format.
  • the inco ⁇ orated ddNTP is detected by a fluorescein antibody - alkaline phosphatase conjugate.
  • this microsequencing analysis is performed as follows. 20 ⁇ l of the microsequencing reaction is added to 80 ⁇ l of capture buffer (SSC 2X, 2.5% PEG 8000, 0.25 M Tris pH7.5, 1.8%) BSA, 0.05% Tween 20) and incubated for 20 minutes on a microtiter plate coated with streptavidin (Boehringer). The plate is rinsed once with washing buffer (0.1 M Tris pH 7.5, 0.1 M NaCl, 0.1% Tween 20). 100 ⁇ l of anti-fluorescein antibody conjugated with phosphatase alkaline, diluted 1/5000 in washing buffer containing 1.8% BSA is added to the microtiter plate.
  • capture buffer SSC 2X, 2.5% PEG 8000, 0.25 M Tris pH7.5, 1.8%) BSA, 0.05% Tween 20
  • washing buffer 0.1 M Tris pH 7.5, 0.1 M NaCl, 0.1% Tween 20
  • the antibody is incubated on the microtiter plate for 20 minutes. After washing the microtiter plate four times, 100 ⁇ l of 4-methylumbelliferyl phosphate (Sigma) diluted to 0.4 mg/ml in 0.1 M diethanolamine pH 9.6, lOmM MgCl 2 are added. The detection of the microsequencing reaction is carried out on a fluorimeter (Dynatech) after 20 minutes of incubation.
  • solid phase microsequencing reactions have been developed, for which either the oligonucleotide microsequencing primers or the PCR-amplified products derived from the DNA fragment of interest are immobilized.
  • immobilization can be canied out via an interaction between biotinylated DNA and streptavidin-coated microtitration wells or avidin-coated polystyrene particles.
  • the PCR reaction generating the amplicons to be genotyped can be performed directly in solid phase conditions, following procedures such as those described in WO 96/13609.
  • inco ⁇ orated ddNTPs can either be radiolabeled (see Syvanen, Clin. Chim. Ada. 226:225-236 (1994)) or linked to fluorescein (see Livak and Hainer, Hum. Metal 3:379-385 (1994)).
  • the detection of radiolabeled ddNTPs can be achieved through scintillation-based techniques.
  • the detection of fluorescein-linked ddNTPs can be based on the binding of antifluorescein antibody conjugated with alkaline phosphatase, followed by incubation with a chromogenic substrate (such as p-nitrophenyl phosphate).
  • DNP dinitrophenyl
  • anti-DNP alkaline phosphatase conjugate see Harju et al., Clin C ⁇ ew:39(l lPt l):2282-2287 (1993)
  • Nyren et al. (Anal. Biochem. 208:171-175 (1993)) have described a solid-phase DNA sequencing procedure that relies on the detection of DNA polymerase activity by an enzymatic luminometric inorganic pyrophosphate detection assay (ELDDA).
  • ELDDA luminometric inorganic pyrophosphate detection assay
  • the PCR-amplified products are biotinylated and immobilized on beads.
  • the microsequencing primer is annealed and four aliquots of this mixture are separately incubated with DNA polymerase and one of the four different ddNTPs. After the reaction, the resulting fragments are washed and used as substrates in a primer extension reaction with all four dNTPs present.
  • the progress of the DNA-directed polymerization reactions is monitored with the ELDDA.
  • Inco ⁇ oration of a ddNTP in the first reaction prevents the formation of pyrophosphate during the subsequent dNTP reaction.
  • no ddNTP inco ⁇ oration in the first reaction gives extensive pyrophosphate release during the dNTP reaction and this leads to generation of light throughout the ELDDA reactions. From the ELDDA results, the identity of the first base after the primer is easily deduced.
  • Example 9 Sequence Analysis DNA sequences, such as BAC inserts, containing the region carrying the candidate gene associated with the detectable trait are sequenced and their sequence is analyzed using automated software which eliminates repeat sequences while retaining potential gene sequences.
  • the potential gene sequences are compared to numerous databases to identify potential exons using a set of scoring algorithms such as trained Hidden Markov Models, statistical analysis models (including promoter prediction tools) and the GRALL neural network.
  • Prefened databases for use in this analysis, the construction and use of which are further detailed in Example 17, include the following:
  • NRPU Non-Redundant Protein-Unique database
  • NRPU is a non-redundant merge of the publicly available NBRF/PIR, Genpept, and SwissProt databases. Homologies found with NRPU allow the identification of regions potentially coding for already known proteins or related to known proteins (translated exons).
  • NREST fNon-Redundant EST database NREST is a merge of the EST subsection of the publicly available GenBank database. Homologies found with NREST allow the location of potentially transcribed regions (translated or non-translated exons).
  • NRN Non-Redundant Nucleic acid database: NRN is a merge of GenBank, EMBL and their daily updates.
  • Any sequence giving a positive hit with NRPU, NREST or an "excellent" score using GRAIL or/and other scoring algorithms is considered a potential functional region, and is then considered a candidate for genomic analysis.
  • genes associated with detectable traits may be identified.
  • Example 10 YAC Contig Construction in the Candidate Genomic Region
  • genes associated with distinct cancer types are located within a particular region of the human genome. More specifically, this region was likely to harbor a gene associated with prostate cancer.
  • a YAC contig which contains the candidate genomic region was constructed as follows.
  • the CEPH-Genethon YAC map for the entire human genome (Chumakov et al. (1995), supra) was used for detailed contig building in the genomic region containing genetic markers known to map in the candidate genomic region.
  • Screening data available for several publicly available genetic markers were used to select a set of CEPH YACs localized within the candidate region.
  • This set of YACs was tested by PCR with the above mentioned genetic markers as well as with other publicly available markers supposedly located within the candidate region.
  • a YAC STS contig map was generated around genetic markers known to map in this genomic region.
  • Two CEPH YACs were found to constitute a minimal tiling path in this region, with an estimated size of ca. 2 Megabases.
  • Example 11 describes the identification of sets of biallelic markers within the candidate genomic region.
  • Example 11 BAC contig construction and Biallelic Markers isolation within the candidate chromosomal region.
  • BAC contig covering the candidate genomic region was constructed as follows. BAC libraries were obtained as described in Woo et al., Nucleic Acids Res. 22:4922-
  • Figure 9 shows the locations of the biallelic markers along the BAC contig. This first set of markers conesponds to a medium density map of the candidate locus, with an intermarker distance averaging 50kb-150kb.
  • a second set of biallelic markers was then generated as described above in order to provide a very high-density map of the region identified using the first set of markers which can be used to conduct association studies, as explained below.
  • This very high density map has markers spaced on average every 2-50kb.
  • DNA samples were obtained from individuals suffering from prostate cancer and unaffected individuals as described in Example 12.
  • Example 12 Collection of DNA Samples from Affected and Non-affected Individuals
  • Prostate cancer patients were recruited according to clinical inclusion criteria based on pathological or radical prostatectomy records.
  • Control cases included in this study were both ethnically- and age-matched to the affected cases; they were checked for both the absence of all clinical and biological criteria defining the presence or the risk of prostate cancer, and for the absence of related familial prostate cancer cases. Both affected and control individuals were all unrelated.
  • the two following groups of independent individuals were used in the association studies.
  • the first group comprising individuals suffering from prostate cancer, contained 185 individuals. Of these 185 cases of prostate cancer, 47 cases were sporadic and 138 cases were familial.
  • the control group contained 104 non-diseased individuals.
  • Haplotype analysis was conducted using additional diseased (total samples: 281) and control samples (total samples: 130), from individuals recruited according to similar criteria. DNA was extracted from peripheral venous blood of all individuals as described in related WrPO application No. PCT/D398/00193.
  • Example 13 Genotyping Affected and Control Individuals Genotyping was performed using the following microsequencing procedure.
  • Amplification was performed on each DNA sample using primers designed as previously explained.
  • the pairs of primers of SEQ DD Nos.: 7849 to 7860 and 11780 to 11791 were used to generate amplicons harboring the biallelic markers of SEQ DD Nos: 3915 to 3926 or the sequences complementary thereto (markers 99-123-381, 4-26-29, 4-14-240, 4-77-151, 99-217- 277, 4-67-40, 99-213-164, 99-221-377, 99-135-196, 99-1482-32, 4-73-134, and 4-65-324) using the protocols described in related WD > O application No. PCT/TB98/00193.
  • Microsequencing primers were designed for each of the biallelic markers, as previously described. After purification of the amplification products, the microsequencing reaction mixture was prepared by adding, in a 20 ⁇ l final volume: 10 pmol microsequencing oligonucleotide, 1 U Thermosequenase (Amersham E79000G), 1.25 ⁇ l Thermosequenase buffer (260 mM Tris HCI pH 9.5, 65 mM MgCl 2 ), and the two appropriate fluorescent ddNTPs (Perkin Elmer, Dye Terminator Set 401095) complementary to the nucleotides at the polymo ⁇ hic site of each biallelic marker tested, following the manufacturer's recommendations.
  • 10 pmol microsequencing oligonucleotide 1 U Thermosequenase (Amersham E79000G)
  • 1.25 ⁇ l Thermosequenase buffer 260 mM Tris HCI pH
  • the software evaluates such factors as whether the intensities of the signals resulting from the above microsequencing procedures are weak, normal, or saturated, or whether the signals are ambiguous.
  • the software identifies significant peaks (according to shape and height criteria). Among the significant peaks, peaks conesponding to the targeted site are identified based on their position. When two significant peaks are detected for the same position, each sample is categorized as homozygous or heterozygous based on the height ratio.
  • the position of the gene responsible for prostate cancer was further refined using the very high density set of markers including the markers of SEQ DD Nos: 3915 to 3926 or the sequences complementary thereto (markers 99-123-381, 4-26-29, 4- 14-240, 4-77-151, 99-217-277, 4-67-40, 99-213-164, 99-221-377, 99-135-196, 99-1482-32, 4-
  • the second phase of the analysis confirmed that the gene responsible for prostate cancer was near the biallelic marker designated 4-67-40, most probably within a ca. 150kb region comprising the marker.
  • a haplotype analysis was also performed as described in Example 15.
  • Example 15 Haplotype analysis The allelic frequencies of each of the alleles of biallelic markers 99-123-381, 4-26-29, 4-14-240, 4-77-151, 99-217-277, 4-67-40, 99-213-164, 99-221-377, and 99-135-196 were determined in the affected and unaffected populations.
  • Table 4 lists the internal identification numbers of the markers used in the haplotype analysis (SEQ DD Nos: 3915-3923), the alleles of each marker, the most frequent allele in both unaffected individuals and individuals suffering from prostate cancer, the least frequent allele in both unaffected individuals and individuals suffering from prostate cancer, and the frequencies of the least frequent alleles in each population.
  • Figures 11 and 12 aggregate association analysis results with sequencing results - generated following the procedures further described in Example 16, which permitted the physical order and the distance between markers to be estimated.
  • Example 16 Identification of the Genomic Sequence in the Candidate Region Template DNA for sequencing the PG1 gene was obtained as follows. BACs E and F from Fig. 9 were subcloned as previously described. Plasmid inserts were first amplified by PCR on PE 9600 thermocyclers (Perkin-Elmer), using appropriate primers, AmpliTaqGold (Perkin-
  • PCR products were then sequenced using automatic ABI Prism 377 sequencers (Perkin Elmer, Applied Biosystems Division, Foster City, CA). Sequencing reactions were performed using PE 9600 thermocyclers (Perkin Elmer) with standard dye-primer chemistry and
  • ThermoSequenase (Amersham Life Science). The primers were labeled with the JOE, FAM, ROX and TAMRA dyes.
  • the dNTPs and ddNTPs used in the sequencing reactions were purchased from Boehringer. Sequencing buffer, reagent concentrations and cycling conditions were as recommended by Amersham. Following the sequencing reaction, the samples were precipitated with EtOH, resuspended in formamide loading buffer, and loaded on a standard 4% acrylamide gel. Electrophoresis was performed for 2.5 hours at 3000V on an ABI 377 sequencer, and the sequence data were collected and analyzed using the ABI Prism DNA Sequencing Analysis Software, version 2.1.2.
  • sequence data obtained as described above were transfened to a proprietary database, where quality control and validation steps were performed.
  • a proprietary base-caller flagged suspect peaks, taking into account the shape of the peaks, the inter-peak resolution, and the noise level.
  • the proprietary base-caller also performed an automatic trimming. Any stretch of 25 or fewer bases having more than 4 suspect peaks was considered unreliable and was discarded.
  • the sequence fragments from BAC subclones isolated as described above were assembled using Gap4 software from R. Staden (Bonfield et al. 1995). This software allows the reconstruction of a single sequence from sequence fragments. The sequence deduced from the alignment of different fragments is called the consensus sequence. Directed sequencing techniques (primer walking) were used to complete sequences and link contigs.
  • Example 17 Identification of Functional Sequences Potential exons in BAC-derived human genomic sequences were located by homology searches on protein, nucleic acid and EST (Expressed Sequence Tags) public databases. Main public databases were locally reconstructed as mentioned in Example 9.
  • the protein database, NRPU Non-redundant Protein Unique
  • Genpept Bossembly, Nucleic Acids Res. 24:1-5 (1996)
  • Swissprot Bairoch, A. and Apweiler, R., Nucleic Acids Res. 24:21-25 (1996)
  • PIR/NBRF George et al., Nucleic Acids Res. 24:17-20 (1996)
  • the EST local database is composed by the gbest section (1-9) of GenBank (Benson et al. (1996), supra), and thus contains all publicly available transcript fragments. Homologies found with this database allowed the localization of potentially transcribed regions.
  • the local nucleic acid database contained all sections of GenBank and EMBL (Rodriguez-Tome et al., Nucleic Acids Res. 24:6-12 (1996)) except the EST sections. Redundant data were eliminated as previously described. Similarity searches in protein or nucleic acid databases were performed using the
  • Example 18 Analysis of Gene Structure
  • the intron/exon structure of the gene was finally completely deduced by aligning the mRNA sequence from the cDNA obtained as described above and the genomic DNA sequence obtained as described above. This alignment permitted the determination of the positions of the introns and exons, the positions of the start and end nucleotides defining each of the at least 8 exons, the locations and phases of the 5' and 3' splice sites, the position of the stop codon, and the position of the polyadenylation site to be determined in the genomic sequence.
  • This analysis also yielded the positions of the coding region in the mRNA, and the locations of the polyadenylation signal and polyA stretch in the mRNA.
  • the gene identified as described above comprises at least 8 exons and spans more than 52kb.
  • a G/C rich putative promoter region was identified upstream of the coding sequence.
  • a CCAAT in the putative promoter was also identified.
  • the promoter region was identified as described in Prestridge, D.S., Predicting Pol II Promoter Sequences Using Transcription Factor Binding Sites, J. Mol. Biol. 249:923-932 (1995).
  • Additional analysis using conventional techniques such as a 5 'RACE reaction using the Marathon-Ready human prostate cDNA kit from Clontech (Catalog. No. PT1156-1), may be performed to confirm that the 5' of the cDNA obtained above is the authentic 5' end in the mRNA.
  • the 5'sequence of the transcript can be determined by conducting a PCR amplification with a series of primers extending from the 5'end of the identified coding region.
  • Detection of biallelic markers in the candidate gene DNA extraction Donors were unrelated and healthy. They presented a sufficient diversity for being representative of a French heterogeneous population. The DNA from 100 individuals was extracted and tested for the detection of the biallelic markers. 30 ml of peripheral venous blood were taken from each donor in the presence of
  • EDTA EDTA.
  • Cells (pellet) were collected after centrifugation for 10 minutes at 2000 ⁇ m. Red cells were lysed by a lysis solution (50 ml final volume: 10 mM Tris pH7.6; 5 mM MgC12; 10 mM NaCl). The solution was centrifuged (10 minutes, 2000 ⁇ m) as many times as necessary to eliminate the residual red cells present in the supernatant, after resuspension of the pellet in the lysis solution. The pellet of white cells was lysed overnight at 42°C with 3.7 ml of lysis solution composed of:
  • TE 10-2 (Tris-HCl 10 mM, EDTA 2 mM) / NaCl 0.4 M - 200 ⁇ l SDS 10% - 500 ⁇ l K-proteinase (2 mg K-proteinase in TE 10-2 / NaCl 0.4 M).
  • the pellet was dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water.
  • OD 260 / OD 280 ratio was determined. Only DNA preparations having a OD 260 / OD 280 ratio between 1.8 and 2 were used in the subsequent examples described below.
  • the pool was constituted by mixing equivalent quantities of DNA from each individual.
  • the amplification of specific genomic sequences of the DNA samples of Example 19 was carried out on the pool of DNA obtained previously using the amplification primers of SEQ DD Nos: 7861 to 7865 and 11792 to 11796. In addition, 50 individual samples were similarly amplified.
  • Pairs of first primers were designed to amplify the promoter region, exons, and 3' end of the candidate asthma-associated gene using the sequence information of the candidate gene and the OSP software (Hillier & Green, 1991). These first primers were about 20 nucleotides in length and contained a common oligonucleotide tail upstream of the specific bases targeted for amplification which was useful for sequencing. The synthesis of these primers was performed following the phosphoramidite method, on a GENSET UFPS 24.1 synthesizer.
  • DNA amplification was performed on a Genius II thermocycler. After heating at 94°C for 10 min, 40 cycles were performed. Each cycle comprised: 30 sec at 94 °C, 55°C for 1 min, and 30 sec at 72°C. For final elongation, 7 min at 72°C ended the amplification.
  • the quantities of the amplification products obtained were determined on 96-well microtiter plates, using a fluorometer and Picogreen as intercalant agent (Molecular Probes).
  • the sequencing of the amplified DNA obtained in Example 20 was canied out on ABI 377 sequencers.
  • the sequences of the amplification products were determined using automated dideoxy terminator sequencing reactions with a dye terminator cycle sequencing protocol.
  • the products of the sequencing reactions were run on sequencing gels and the sequences were analyzed as formerly described.
  • sequence data were further evaluated using the above mentioned polymo ⁇ hism analysis software designed to detect the presence of biallelic markers among the pooled amplified fragments.
  • the polymo ⁇ hism search was based on the presence of superimposed peaks in the electrophoresis pattern resulting from different bases occurring at the same position as described previously.
  • the fourth fragment of amplification carrying exon 3 (not shown in the Table) was not polymo ⁇ hic in the tested samples (1 pool + 50 individuals).
  • Example 21 The biallelic markers identified in Example 21 were further confirmed and their respective frequencies were determined through microsequencing. Microsequencing was carried out for each individual DNA sample described in Example 19.
  • Amplification from genomic DNA of individuals was performed by PCR as described above for the detection of the biallelic markers with the same set of PCR primers described above.
  • the prefened primers used in microsequencing had about 19 nucleotides in length and hybridized just upstream of the considered polymo ⁇ hic base.
  • microsequencing reaction was performed as described in Example 13.
  • Example 23 Association study between asthma and the biallelic markers of the candidate gene Collection of DNA samples from affected and non-affected individuals
  • the asthmatic population used to perform association studies in order to establish whether the candidate gene was an asthma-causing gene consisted of 298 individuals. More than 90 % of these 298 asthmatic individuals had a Caucasian ethnic background.
  • the control population consisted of 373 unaffected individuals, among which 279 French (at least 70 % were of Caucasian origin) and 94 American (at least 90 % were of Caucasian origin). DNA samples were obtained from asthmatic and non-asthmatic individuals as described above.
  • Example 24 Association study between asthma and the biallelic markers of the candidate gene Genotyping of affected and control individuals
  • the general strategy to perform the association studies was to individually scan the DNA samples from all individuals in each of the populations described above in order to establish the allele frequencies of the above described biallelic markers in each of these populations.
  • Allelic frequencies of the above-described biallelic markers in each population were determined by performing microsequencing reactions on amplified fragments obtained by genomic PCR performed on the DNA samples from each individual. Genomic PCR and microsequencing were performed as detailed above in Examples 20 and 22 using the described amplification and microsequencing primers.
  • Example 25 Association study between asthma and the biallelic markers of the candidate gene
  • Table 6 shows the results of the association study between five biallelic markers in the candidate gene and asthma.
  • allelic frequencies for each of the biallelic markers of Table 7 were separately measured within the French control population (279 individuals) and the American control population (94 individuals). The differences in allele frequencies between the two populations were between 1% and 7%, with p-values above 10 " '. These data confirmed that the combined French American control population (373 individuals) was homogeneous enough to be used as a control population for the present association study.
  • haplotype association analysis As already shown, one way of increasing the statistical power of individual markers, is by performing haplotype association analysis.
  • a haplotype analysis for association of markers in the candidate gene and asthma was performed by estimating the frequencies of all possible haplotypes for biallelic markers 10-32-357, 10-33-234, 10-33-327, 10-35-358 and 10- 35-390 in the asthmatic and control populations described in Example 25 (Table 6), and comparing these frequencies by means of a chi square statistical test (one degree of freedom).
  • Haplotype estimations were performed by applying the Expectation-Maximization (EM) algorithm (Excoffier L & Slatkin M, 1995, Mol.Biol.Evol. 12 :921-927), using the EM- HAPLO program (Hawley ME, Pakstis AJ & Kidd KK, 1994, Am .Phys.Anthropol. 18:104).
  • EM Expectation-Maximization
  • a two-marker haplotype covering markers 10-32-357 and 10-35-390 presented a p value of 8.47x10-6, an odds ratio of 2.02 and haplotype frequencies of 0.2 for asthmatic and 0.11 for control populations respectively.
  • a three-marker haplotype covering markers 10-33-234, 10-33-327 and 10-35-358 presented a p value of 2.81x10-4, an odds ratio of 1.68 and haplotype frequencies of 0.27 for asthmatic and 0.18 for control populations respectively.
  • a five-marker haplotype covering markers 10-32-357, 10-33-234, 10-33-327, 10-35- 358 and 10-35-390 presented a p value of 3.95x10- 5, an odds ratio of 2.22 and haplotype frequencies of 0.18 for asthmatic and 0.09 for control populations respectively.
  • Haplotype association analysis thus increased the statistical power of the individual marker association studies when compared to single-marker analysis (from p values between 10 " ' and 2X10 '5 for the individual markers to p values between 3X10 " and 8X10 '6 for the three-marker haplotype, haplotype 2).
  • the significance of the values obtained for the haplotype association analysis was evaluated by the following computer simulation test.
  • the genotype data from the asthmatic and control individuals were pooled and randomly allocated to two groups which contained the same number of individuals as the trait-positive and trait-negative groups used to produce the data summarized in Table 7.
  • a haplotype analysis was then run on these artificial groups for the three haplotypes presented in Table 6. This experiment was reiterated 1000 times and the results are shown in Table 8.
  • the pellet of white cells is lysed overnight at 42°C with 3.7 ml of lysis solution composed of:
  • TE 10-2 (Tris-HCl 10 mM, EDTA 2 mM) / NaCl 0.4 M - 200 ⁇ l SDS 10% - 500 ⁇ l K-proteinase (2 mg K-proteinase in TE 10-2 / NaCl 0.4 M).
  • 1 ml saturated NaCl (6M) (1/3.5 v/v) is added. After vigorous agitation, the solution is centrifuged for 20 minutes at 10000 ⁇ m.
  • 2 to 3 volumes of 100% ethanol are added to the previous supernatant, and the solution is centrifuged for 30 minutes at 2000 ⁇ m.
  • the DNA solution is rinsed three times with 70% ethanol to eliminate salts, and centrifuged for 20 minutes at 2000 ⁇ m.
  • the pellet is dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water.
  • OD 260 / OD 280 ratio is determined. Only DNA preparations having a OD 260 / OD 280 ratio between 1.8 and 2 are used in the subsequent steps described below.
  • genomic DNA from every individual in the given population has been extracted, it is prefened that a fraction of each DNA sample is separated, after which a pool of DNA is constituted by assembling equivalent DNA amounts of the separated fractions into a single one.

Abstract

The present invention relates to genomic maps comprising biallelic markers, new biallelic markers, and methods of using biallelic markers. Primers hybridizing to regions flanking these biallelic markers are also provided. This invention provides polynucleotides and methods suitable for genotyping a nucleic acid containing sample for one or more biallelic markers of the invention. Further, the invention provides a number of methods utilizing the biallelic markers of the invention including methods to detect a statistical correlation between a biallelic marker allele and a phenotype and/or between a biallelic marker haplotype and a phenotype.

Description

BIALLELIC MARKERS FOR USE IN CONSTRUCTING A HIGH DENSITY DISEQUILIBRIUM MAP OF THE HUMAN GENOME
Background of the Invention
Recent advances in genetic engineering and bioinformatics have enabled the manipulation and characterization of large portions of the human genome. While efforts to obtain the full sequence of the human genome are rapidly progressing, there are many practical uses for genetic information which can be implemented with partial knowledge of the sequence of the human genome.
As the full sequence of the human genome is assembled, the partial sequence information available can be used to identify genes responsible for detectable human traits, such as genes associated with human diseases, and to develop diagnostic tests capable of identifying individuals who express a detectable trait as the result of a specific genotype or individuals whose genotype places them at risk of developing a detectable trait at a subsequent time. Each of these applications for partial genomic sequence information is based upon the assembly of genetic and physical maps which order the known genomic sequences along the human chromosomes.
The present invention relates to an ordered set of human genomic sequences comprising single nucleotide polymorphisms, as well as the use of these polymoφhisms as a high resolution map of the human genome, methods of identifying genes associated with detectable human traits, and diagnostics for identifying individuals who carry a gene which causes them to express a detectable trait or which places them at risk of expressing a detectable trait in the future.
Advantages of the biallelic markers of the present invention
The map-related biallelic markers of the present invention offer a number of important advantages over other genetic markers such as RFLP (Restriction fragment length polymoφhism), VNTR (Variable Number of Tandem Repeats) markers and earlier STS- (sequence tagged sites) derived markers.
The first generation of markers, were RFLPs, which are variations that modify the length of a restriction fragment. But methods used to identify and to type RFLPs are relatively wasteful of materials, effort, and time. Since they are biallelic markers (they present only two alleles, the restriction site being either present or absent), their maximum heterozygosity is 0.5. The theoretical number of RFLPs distributed along the entire human genome is more than 10 , which leads to a potential average inter-marker distance of 30 kilobases. However, in reality the number of evenly distributed RFLPs which occur at a sufficient frequency in the population to make them useful for tracking of genetic polymoφhisms is very limited.
The second generation of genetic markers were VNTRs, which can be categorized as either minisatellites or microsatellites. Minisatellites are tandemly repeated DNA sequences present in units of 5-50 repeats which are distributed along regions of the human chromosomes ranging from 0.1 to 20 kilobases in length. Since they present many possible alleles, their informative content is very high. Minisatellites are scored by performing Southern blots to identify the number of tandem repeats present in a nucleic acid sample from the individual being tested. However, there are only 10 potential VNTRs that can be typed by Southern blotting. Thus, the number of easily typed informative markers in these maps is far too small for the average distance between informative markers to fulfill the requirements for a useful genetic map. Moreover, both RFLP and VNTR markers are costly and time- consuming to develop and assay in large numbers.
Initial attempts to construct genetic maps based on non-RFLP biallelic markers have focused on identifying biallelic markers lying within sequence tagged sites (STS), pieces of genomic DNA having a known sequence and averaging about 250 bases in length. More than 30,000 STSs have been identified and ordered along the genome (Hudson et al., Science 270:1945-1954 (1995); Schuler et al., Science 274:540-546 (1996)). For example, the Whitehead Institute and Genethon's integrated map contains 15,086 STSs. These sequence tagged sites can be screened to identify polymoφhisms, preferably
Single Nucleotide Polymoφhisms (SNPs), more preferably non RFLP biallelic markers therein. Generally polymoφhisms are identified by determining the sequence of the STSs in 5 to 10 individuals.
Wang et al. (Cold Spring harbor laboratory: Abstracts of papers presented on genome Mapping and sequencing -p. \1 (May 14-18, 1997)) recently announced the identification and mapping of 750 Single Nucleotide Polymoφhisms issued from the sequencing of 12,000 STSs from the Whitehead/MIT map, in eight unrelated individuals. The map was assembled using a high throughput system based on the utilization of DNA chip technology available from Affymetrix (Chee et al., Science 274:610-614 (1996)). However, according to experimental data and statistical calculations, less than one out of 10 of all STSs mapped today will contain an informative Single Nucleotide Polymoφhism. This is primarily due to the short length of existing STSs (usually less than 250 bp). If one assumes 106 informative SNPs spread along the human genome, there would on average be one marker of interest every 3X109/106, i.e. every 3,000 bp. The probability that one such marker is present on a 250 bp stretch is thus less than 1/10. While it could produce a high density map, the STS approach based on currently existing markers does not put any systematic effort into making sure that the markers obtained are optimally distributed throughout the entire genome. Instead, polymoφhisms are limited to those locations for which STSs are available. The even distribution of markers along the chromosomes is critical to the future success of genetic analyses. In particular, a high density map having appropriately spaced markers is essential for conducting association studies on sporadic cases, aiming at identifying genes responsible for detectable traits such as those which are described below.
As will be further explained below, genetic studies have mostly relied in the past on a statistical approach called linkage analysis, which took advantage of microsatellite markers to study their inheritance pattern within families from which a sufficient number of individuals presented the studied trait. Because of intrinsic limitations of linkage analysis, which will be further detailed below, and because these studies necessitate the recruitment of adequate family pedigrees, they are not well suited to the genetic analysis of all traits, particularly those for which only sporadic cases are available (e.g. drug response traits), or those which have a low penetrance within the studied population.
Association studies enabled by the biallelic markers of the present invention offer an alternative to linkage analysis. Combined with the use of a high density map of appropriately spaced, sufficiently informative markers, association studies, including linkage disequilibrium-based genome wide association studies, will enable the identification of most genes involved in complex traits.
Single nucleotide polymoφhism or biallelic markers can be used in the same manner as RFLPs and VNTRs but offer several advantages. Single nucleotide polymoφhisms are densely spaced in the human genome and represent the most frequent type of variation. An estimated number of more than 107 sites are scattered along the 3xl09 base pairs of the human genome. Therefore, single nucleotide polymoφhisms occur at a greater frequency and with greater uniformity than RFLP or VNTR markers which means that there is a greater probability that such a marker will be found in close proximity to a genetic locus of interest. Single nucleotide polymoφhisms are less variable than VNTR markers but are mutationally more stable.
Also, the different forms of a characterized single nucleotide polymoφhism, such as the biallelic markers of the present invention, are often easier to distinguish and can therefore be typed easily on a routine basis. Biallelic markers have single nucleotide based alleles and they have only two common alleles, which allows highly parallel detection and automated scoring. The biallelic markers of the present invention offer the possibility of rapid, high- throughput genotyping of a large number of individuals. Biallelic markers are densely spaced in the genome, sufficiently informative and can be assayed in large numbers. The combined effects of these advantages make biallelic markers extremely valuable in genetic studies. Biallelic markers can be used in linkage studies in families, in allele sharing methods, in linkage disequilibrium studies in populations, in association studies of case-control populations. An important aspect of the present invention is that biallelic markers allow association studies to be performed to identify genes involved in complex traits. Association studies examine the frequency of marker alleles in unrelated case- and control-populations and are generally employed in the detection of polygenic or sporadic traits. Association studies may be conducted within the general population and are not limited to studies performed on related individuals in affected families
(linkage studies). Biallelic markers in different genes can be screened in parallel for direct association with disease or response to a treatment. This multiple gene approach is a powerful tool for a variety of human genetic studies as it provides the necessary statistical power to examine the synergistic effect of multiple genetic factors on a particular phenotype, drug response, sporadic trait, or disease state with a complex genetic etiology.
The present invention relates to a high density linkage disequilibrium-based genetic maps of the human genome which comprise the map-related biallelic markers of the invention and will allow the identification of genes responsible for detectable traits using genome-wide association studies and linkage disequilibrium mapping.
Summary of the Invention The present invention is based on the discovery of a set of novel map-related biallelic markers. See Table 1. The position of these markers and knowledge of the surrounding sequence has been used to design polynucleotide compositions which are useful in high density mapping of the human genome as well as in determining the identity of nucleotides at the marker position, and more complex association and haplotyping studies which are useful in determining the genetic basis for disease states. In addition, the compositions and methods of the invention find use in the identification of the targets for the development of pharmaceutical agents and diagnostic methods, as well as the characterization of the differential efficacious responses to and side effects from pharmaceutical agents acting on a disease as well as other treatments.
A first embodiment of the present invention is a map of the human genome comprising an ordered array of biallelic markers, wherein at least 1, 2, 5, 10, 20, 25, 30, 50, 100, 200, 500, 1000, 2000 or 3000 of said biallelic markers are map-related biallelic markers. In addition, the maps of the present invention encompass maps with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally said ordered array comprises at least 20,000, 40,000, 60,000, 80,000, 100,000, or 120,000 biallelic markers; optionally, wherein said biallelic markers are separated from one another by an average distance of 10kb-200 kb, 15kb-150 kb, 20kb-100 kb, 100kb-150 kb, 50-100kb, or 25 kb-50 kb in the human genome; optionally, said biallelic markers are distributed at an average density of at least one biallelic marker every 150kb, 50 kb, or 30 kb m the human genome; or optionally, wherein, all of said biallelic markers are selected to have a heterozygosity rates of at least about 0.18, 0.32, or 0.42.
A second embodiment of the invention encompasses isolated, purified or recombinant polynucleotides consisting of, consisting essentially of, or compnsmg a contiguous span of nucleotides of a sequence selected as an individual or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908, 3935 to 7842, 7866 to 11773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 11599, and 11600 to 11773, or the complements thereof, wherein said contiguous span is at least 8, 10, 12, 15, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD. The present invention also relates to polynucleotides hybridizing under stπngent or intermediate conditions to a sequence selected from the group consisting of
SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908, 3935 to 7842, 7866 to 11773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 11599, and 11600 to 11773 and the complements thereof. In addition, the polynucleotides of the invention encompass polynucleotides with any further limitation descnbed in this disclosure, or those following, specified alone or in any combination: said contiguous span may optionally compnse a map-related biallelic marker; optionally either the 1ST or the 2ND allele of the respective SEQ LD No., as indicated m Table 1, may be specified as being present at said map- related biallelic marker; optionally, said biallelic marker may be within 6, 5, 4, 3, 2, or 1 nucleotides of the center of said polynucleotide or at the center of said polynucleotide; optionally, said polynucleotide may consists of, or consist essentially of a contiguous span which ranges in length from 8, 10, 12, 15, 18 or 20 to 21, 25, 35, 40, 43, or 47 nucleotides; optionally, said polynucleotide may consists of, or consist essentially of a contiguous span which ranges m length from 8, 10, 12, 15, 18 or 20 to 21, 25, 35, 40, 43, or 47 nucleotides, or be specified as being 12, 15, 18, 20, 25, 35, 40, 43, or 47 nucleotides in length and including an map-related biallelic marker of said sequence, and optionally the 1st allele of Table 1 is present at said biallelic marker; optionally, the 3' end of said contiguous span may be present at the 3' end of said polynucleotide; optionally, biallelic marker may be present at the 3' end of said polynucleotide; optionally, the 3' end of said polynucleotide may be located within or at least 2, 4, 6, 8, or 10 nucleotides upstream of a map-related biallelic marker in said sequence, to the extent that such a distance is consistent with the lengths of the particular Sequence LD; optionally, the 3' end of said polynucleotide may be located 1 nucleotide upstream of a map-related biallelic marker in said sequence; and optionally, said polynucleotide may further comprise a label.
A third embodiment of the invention encompasses any polynucleotide of the invention attached to a solid support. In addition, the polynucleotides of the invention which are attached to a solid support encompass polynucleotides with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said polynucleotides may be specified as attached individually or in groups of at least 2, 5, 8, 10, 12, 15, 20, 25, 50, 100, 200, or 500 distinct polynucleotides of the inventions to a single solid support; optionally, polynucleotides other than those of the invention may attached to the same solid support as polynucleotides of the invention; optionally, when multiple polynucleotides are attached to a solid support they may be attached at random locations, or in an ordered array; optionally, said ordered array may be addressable.
A fourth embodiment of the invention encompasses the use of any polynucleotide for, or any polynucleotide for use in, determining the identity of nucleotides at a map-related biallelic marker. In addition, the polynucleotides of the invention for use in determining the identity of nucleotides at a map-related biallelic marker encompass polynucleotides with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said polynucleotide may comprise a sequence disclosed in the present specification; optionally, said polynucleotide may consist of, or consist essentially of any polynucleotide described in the present specification; optionally, said determining may be performed in a hybridization assay, sequencing assay, microsequencing assay, or an enzyme-based mismatch detection assay; optionally, said polynucleotide may be attached to a solid support, array, or addressable array; optionally, said polynucleotide may be labeled.
A fifth embodiment of the invention encompasses the use of any polynucleotide for, or any polynucleotide for use in, amplifying a segment of nucleotides comprising a map- related biallelic marker. In addition, the polynucleotides of the invention for use in amplifying a segment of nucleotides comprising a map-related biallelic marker encompass polynucleotides with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said polynucleotide may consist of, consist essentially of, or comprise a sequence selected individually or in any combination from the group consisting of SEQ LD
Nos. 3935 to 7842, 7866 to 11773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 11599, and 11600 to 11773; optionally, said polynucleotide may consist of, or consist essentially of any polynucleotide described in the present specification; optionally, said amplifying may be performed by a PCR or LCR. Optionally, said polynucleotide may be attached to a solid support, array, or addressable array. Optionally, said polynucleotide may be labeled.
A sixth embodiment of the invention encompasses methods of genotyping a biological sample comprising determining the identity of a nucleotide at a map-related biallelic marker. In addition, the genotyping methods of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said method further comprises determining the identity of a second nucleotide at said biallelic marker, wherein said first nucleotide and second nucleotide are not base paired (by Watson & Crick base pairing) to one another; optionally, said biological sample is derived from a single individual or subject; optionally, said method is performed in vitro; optionally, said biallelic marker is determined for both copies of said biallelic marker present in said individual's genome; optionally, said biological sample is derived from multiple subjects or individuals; optionally, said method further comprises amplifying a portion of said sequence comprising the biallelic marker prior to said determining step; optionally, wherein said amplifying is performed by PCR, LCR, or replication of a recombinant vector comprising an origin of replication and said portion in a host cell; optionally, wherein said determining is performed by a hybridization assay, sequencing assay, microsequencing assay, or an enzyme-based mismatch detection assay.
A seventh embodiment of the invention comprises methods of estimating the frequency of an allele in a population comprising genotyping individuals from said population for a map-related biallelic marker and determining the proportional representation of said biallelic marker in said population. In addition, the methods of estimating the frequency of an allele in a population of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ Nos. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, determining the frequency of a biallelic marker allele in a population may be accomplished by determining the identity of the nucleotides for both copies of said biallelic marker present in the genome of each individual in said population and calculating the proportional representation of said nucleotide at said map-related biallelic marker for the population; optionally, determining the frequency of a biallelic marker allele in a population may be accomplished by performing a genotyping method on a pooled biological sample derived from a representative number of individuals, or each individual, in said population, and calculating the proportional amount of said nucleotide compared with the total.
An eighth embodiment of the invention comprises methods of detecting an association between an allele and a phenotype, comprising the steps of a) determining the frequency of at least one map-related biallelic marker allele in a trait positive population, b) determining the frequency of said map-related biallelic marker allele in a control population and; c) determining whether a statistically significant association exists between said genotype and said phenotype. In addition, the methods of detecting an association between an allele and a phenotype of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map- related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said control population may be a trait-negative population, or a random population; optionally, wherein said phenotype is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity; optionally, the determining steps a) and b) are performed on all of the biallelic markers of SEQ LD Nos. 1 to 3908.
An ninth embodiment of the present invention encompasses methods of estimating the frequency of a haplotype for a set of biallelic markers in a population, comprising the steps of: a) genotyping each individual in said population for at least one map-related biallelic marker, b) genotyping each individual in said population for a second biallelic marker by determining the identity of the nucleotides at said second biallelic marker for both copies of said second biallelic marker present in the genome; and c) applying a haplotype determination method to the identities of the nucleotides determined in steps a) and b) to obtain an estimate of said frequency. In addition, the methods of estimating the frequency of a haplotype of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally said haplotype determination method is selected from the group consisting of asymmetric PCR amplification, double PCR amplification of specific alleles, the Clark method, or an expectation maximization algorithm; optionally, said map-related biallelic marker may be selected individually or in any combination from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said second biallelic marker is a map-related biallelic marker; optionally, the identity of the nucleotides at the biallelic markers in every one of the sequences of SEQ LD No. 1 to 3908 is determined in steps a) and b).
A tenth embodiment of the present invention encompasses methods of detecting an association between a haplotype and a phenotype, comprising the steps of: a) estimating the frequency of at least one haplotype in a trait positive population according to a method of estimating the frequency of a haplotype of the invention; b) estimating the frequency of said haplotype in a control population according to the method of estimating the frequency of a haplotype of the invention; and c) determining whether a statistically significant association exists between said haplotype and said phenotype. In addition, the methods of detecting an association between a haplotype and a phenotype of the invention encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said control population may be a trait-negative population, or a random population; optionally, wherein said phenotype is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity; optionally, the identity of the nucleotides at the biallelic markers in every one of the following sequences: SEQ ID No. 1 to 3908 is included in the estimating steps a) and b).
An eleventh embodiment of the present invention is a method of identifying a gene associated with a detectable trait comprising the steps of: a) determining the frequency of each allele of at least one map-related biallelic marker in individuals having the detectable trait and individuals lacking the detectable trait; b) identifying at least one alleles of one or biallelic markers having a statistically significant association with the detectable trait; and c) identifying a gene in linkage disequilibrium with said allele. In addition, the methods of the present invention for identifying a gene associated with a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, wherein the method further comprises d) identifying a mutation in the gene identified in step c) which is associated with the detectable trait; optionally, wherein the individuals having the detectable trait and the individuals lacking the detectable trait are readily distinguishable from one another; optionally, wherein the individuals having the detectable trait and the individuals lacking the detectable trait are selected from a bimodal population; optionally, wherein the individuals having the detectable trait are at one extreme of the population and the individuals lacking the detectable trait are at the other extreme of the population; optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said detectable trait is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity. A twelfth embodiment of the present invention is a method of identifying biallelic markers associated with a detectable trait comprising the steps of: a) determining the frequencies of a set of biallelic markers comprising at least one map-related biallelic marker in individuals who express said detectable trait and individuals who do not express said detectable trait; and b) identifying one or more biallelic markers in said set which are statistically associated with the expression of said detectable trait. In addition, the methods of the present invention for identifying biallelic markers associated with a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said detectable trait is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity.
A thirteenth embodiment of the present invention is a method of identifying biallelic marker(s) in linkage disequilibrium with a trait causing allele or in linkage disequilibrium with a trait-associated biallelic marker comprising the steps of: a) selecting at least one map-related biallelic marker which is in the genomic region suspected of containing the trait-causing allele or the trait-associated biallelic marker; and b) determining which of the map-related biallelic markers are associated with the trait-causing allele or in linkage disequilibrium with the trait- associated biallelic marker. In addition, the methods of the present invention for identifying biallelic marker(s) in linkage disequilibrium with a trait causing allele or in linkage disequilibrium with a trait-associated biallelic marker encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said detectable trait is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity.
A fourteenth embodiment of the present invention is a method for determining whether an individual is at risk of developing a detectable trait or suffers from a detectable trait comprising the steps of: a) obtaining a nucleic acid sample from the individual; b) screening the nucleic acid sample with at least one map-related biallelic marker; and c) determining whether the nucleic acid sample contains at least one allele of said map-related biallelic marker statistically associated with the detectable trait. In addition, the methods of the present invention for determining whether an individual is at risk of developing a detectable trait or suffers from a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said detectable trait is selected from the group consisting of disease, treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity.
A fifteenth embodiment of the present invention is a method of administering a drug or a treatment comprising the steps of: a) obtaining a nucleic acid sample from an individual; b) determining the identity of the polymoφhic base of at least one map-related biallelic marker which is associated with a positive response to the treatment or the drug; or at least one biallelic map-related marker which is associated with a negative response to the treatment or the drug; and c) administering the treatment or the drug to the individual if the nucleic acid sample contains said biallelic marker associated with a positive response to the treatment or the drug or if the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug. In addition, the methods of the present invention for administering a drug or a treatment encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; or optionally, the administering step comprises administering the drug or the treatment to the individual if the nucleic acid sample contains said biallelic marker associated with a positive response to the treatment or the drug and the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug. A sixteenth embodiment of the present invention is a method of selecting an individual for inclusion in a clinical trial of a treatment or drug comprising the steps of: a) obtaining a nucleic acid sample from an individual; b) determining the identity of the polymoφhic base of at least one map-related biallelic marker which is associated with a positive response to the treatment or the drug, or at least one map-related biallelic marker which is associated with a negative response to the treatment or the drug in the nucleic acid sample, and c) including the individual in the clinical trial if the nucleic acid sample contains said map-related biallelic marker associated with a positive response to the treatment or the drug or if the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug. In addition, the methods of the present invention for selecting an individual for inclusion in a clinical trial of a treatment or drug encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, the including step comprises administering the drug or the treatment to the individual if the nucleic acid sample contains said biallelic marker associated with a positive response to the treatment or the drug and the nucleic acid sample lacks said biallelic marker associated with a negative response to the treatment or the drug.
A seventeenth embodiment of the present invention is a method of identifying a gene associated with a detectable trait comprising the steps of: a) selecting a gene suspected of being associated with a detectable trait; and b) identifying at least one map-related biallelic marker within said gene which is associated with said detectable trait. In addition, the methods of the present invention for identifying a gene associated with a detectable trait encompass methods with any further limitation described in this disclosure, or those following, specified alone or in any combination: optionally, said map-related biallelic marker may be in a sequence selected individually or in any combination from the group consisting of
SEQ LD No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, the identifying step comprises determining the frequencies of the map-related biallelic marker(s) in individuals who express said detectable trait and individuals who do not express said detectable trait and identifying one or more biallelic markers which are statistically associated with the expression of the detectable trait.
Additional embodiments are set forth in the Detailed Description of the Invention and in the Examples.
Brief Description of the Drawings Figure 1 is a cytogenetic map of chromosome 21. Figure 2a shows the results of a computer simulation of the distribution of inter- marker spacing on a randomly distributed set of biallelic markers indicating the percentage of biallelic markers which will be spaced a given distance apart for 1 , 2, or 3 markers/BAC in a genomic map (assuming a set of 20,000 minimally overlapping BACs covering the genome are evaluated).
Figure 2b shows the results of a computer simulation of the distribution of inter- marker spacing on a randomly distributed set of biallelic markers indicating the percentage of biallelic markers which will be spaced a given distance apart for 1, 3, or 6 markers/BAC in a genomic map (assuming a set of 20,000 minimally overlapping BACs covering the genome are evaluated).
Figure 3 shows, for a series of hypothetical sample sizes, the p-value significance obtained in association studies performed using individual markers from the high-density biallelic map, according to various hypotheses regarding the difference of allelic frequencies between the trait-positive and trait-negative samples. Figure 4 is a hypothetical association analysis conducted with a map comprising about
3,000 biallelic markers.
Figure 5 is a hypothetical association analysis conducted with a map comprising about 20,000 biallelic markers.
Figure 6 is a hypothetical association analysis conducted with a map comprising about 60,000 biallelic markers.
Figure 7 is a haplotype analysis using biallelic markers in the Apo E region. Figure 8 is a simulated haplotype analysis using the biallelic markers in the Apo E region included in the haplotype analysis of Figure 7.
Figure 9 shows a minimal array of overlapping clones which was chosen for further studies of biallelic markers associated with prostate cancer, the positions of STS markers known to map in the candidate genomic region along the contig, and the locations of biallelic markers along the BAC contig harboring a genomic region harboring a candidate gene associated with prostate cancer which were identified using the methods of the present invention. Figure 10 is a rough localization of a candidate gene for prostate cancer which was obtained by determining the frequencies of the biallelic markers of Figure 9 in affected and unaffected populations.
Figure 11 is a further refinement of the localization of the candidate gene for prostate cancer using additional biallelic markers which were not included in the rough localization illustrated in Figure 10. Figure 12 is a haplotype analysis using the biallelic markers in the genomic region of the gene associated with prostate cancer.
Figure 13 is a simulated haplotype using the six markers included in haplotype 5 of Figure 12. Figure 14 is a block diagram of an exemplary computer system.
Figure 15 is a flow diagram illustrating one embodiment of a process 200 for comparing a new nucleotide or protein sequence with a database of sequences in order to determine the homology levels between the new sequence and the sequences in the database.
Figure 16 is a flow diagram illustrating one embodiment of a process 250 in a computer for determining whether two sequences are homologous.
Figure 17 is a flow diagram illustrating one embodiment of an identifier process 300 for detecting the presence of a feature in a sequence.
Brief Description Of The Sequence Listing SEQ LD Nos. 1 to 3908 contain nucleotide sequences comprising a portion of the map- related biallelic markers of the invention.
SEQ LD Nos. 3909 to 3934 contain nucleotide sequences comprising a portion of the map-related biallelic markers which are shown to be associated with Alzheimer's disease, prostate cancer or asthma as described in the Examples. SEQ LD Nos. 3935 to 7842 contain nucleotide sequences of upstream amplification primers (PU) designed to amplify sequences containing the biallelic markers of SEQ LD Nos. 1 to 3908.
SEQ LD Nos. 7843 to 7865 contain nucleotide sequences of upstream amplification primers (PU) designed to amplify sequences containing the biallelic markers of SEQ LD Nos. 3909 to 3934.
SEQ ID Nos. 7866 to 11773 contain nucleotide sequences of downstream amplification primers (RP) designed to amplify sequences containing the biallelic markers of SEQ LD Nos. 1 to 3908.
SEQ LD Nos. 11774 to 11796 contain nucleotide sequences of downstream amplification primers (RP) designed to amplify sequences containing the biallelic markers of
SEQ LD Nos. 3909 to 3934.
Detailed Description of the Embodiments Before describing the invention in greater detail, the following definitions are set forth to illustrate and define the meaning and scope of the terms used to describe the invention herein. Definitions
As used interchangeably herein, the terms "nucleic acids" "oligonucleotides", and "polynucleotides" include RNA, DNA, or RNA/DNA hybrid sequences of more than one nucleotide in either single chain or duplex form. The term "nucleotide" as used herein as an adjective to describe molecules comprising RNA, DNA, or RNA/DNA hybrid sequences of any length in single-stranded or duplex form. The term "nucleotide" is also used herein as a noun to refer to individual nucleotides or varieties of nucleotides, meaning a molecule, or individual unit in a larger nucleic acid molecule, comprising a purine or pyrimidine, a ribose or deoxyribose sugar moiety, and a phosphate group, or phosphodiester linkage in the case of nucleotides within an oligonucleotide or polynucleotide. Although the term "nucleotide" is also used herein to encompass "modified nucleotides" which comprise at least one modifications (a) an alternative linking group, (b) an analogous form of purine, (c) an analogous form of pyrimidine, or (d) an analogous sugar, for examples of analogous linking groups, purine, pyrimidines, and sugars see for example PCT publication No. WO 95/04064. However, the polynucleotides of the invention are preferably comprised of greater than 50% conventional deoxyribose nucleotides, and most preferably greater than 90% conventional deoxyribose nucleotides. The polynucleotide sequences of the invention may be prepared by any known method, including synthetic, recombinant, ex vivo generation, or a combination thereof, as well as utilizing any purification methods known in the art. The term "purified" is used herein to describe a polynucleotide or polynucleotide vector of the invention which has been separated from other compounds including, but not limited to other nucleic acids, carbohydrates, lipids and proteins (such as the enzymes used in the synthesis of the polynucleotide), or the separation of covalently closed polynucleotides from linear polynucleotides. A polynucleotide is substantially pure when at least about 50 %, preferably 60 to 75% of a sample exhibits a single polynucleotide sequence and conformation
(linear versus covalently close). A substantially pure polynucleotide typically comprises about 50 %, preferably 60 to 90% weight/weight of a nucleic acid sample, more usually about 95%, and preferably is over about 99% pure. Polynucleotide purity or homogeneity may be indicated by a number of means well known in the art, such as agarose or polyacrylamide gel electrophoresis of a sample, followed by visualizing a single polynucleotide band upon staining the gel. For certain puφoses higher resolution can be provided by using HPLC or other means well known in the art.
The term "primer" denotes a specific oligonucleotide sequence which is complementary to a target nucleotide sequence and used to hybridize to the target nucleotide sequence. A primer serves as an initiation point for nucleotide polymerization catalyzed by either DNA polymerase, RNA polymerase or reverse transcriptase. The term "probe" denotes a defined nucleic acid segment (or nucleotide analog segment, e.g., polynucleotide as defined herein) which can be used to identify a specific polynucleotide sequence present in samples, said nucleic acid segment comprising a nucleotide sequence complementary of the specific polynucleotide sequence to be identified. The terms "detectable trait" "trait" and "phenotype" are used interchangeably herein and refer to any visible, detectable or otherwise measurable property of an organism such as symptoms of, or susceptibility to a disease for example. Typically the terms "detectable trait" "trait" or "phenotype" are used herein to refer to symptoms of, or susceptibility to a disease; or to refer to an individual's response to an agent, drug, or treatment acting on a disease; or to refer to symptoms of, or susceptibility to side effects to an agent acting on a disease.
The term "treatment" is used herein to encompass any medical intervention known in the art including, for example, the administration of pharmaceutical agents, medically prescribed changes in diet, or habits such as a reduction in smoking or drinking, surgery, the application of medical devices, and the application or reduction of certain physical conditions, for example, light or radiation.
The term "allele" is used herein to refer to variants of a nucleotide sequence. A biallelic polymoφhism has two forms; designated herein as the 1ST allele and the 2ND allele. Diploid organisms may be homozygous or heterozygous for an allelic form.
The term "heterozygosity rate" is used herein to refer to the incidence of individuals in a population, which are heterozygous at a particular allele. In a biallelic system the heterozygosity rate is on average equal to 2Pa(l-Pa), where Pa is the frequency of the least common allele. In order to be useful in genetic studies a genetic marker should have an adequate level of heterozygosity to allow a reasonable probability that a randomly selected person will be heterozygous. The term "genotype" as used herein refers the identity of the alleles present in an individual or a sample. In the context of the present invention a genotype preferably refers to the description of the biallelic marker alleles present in an individual or a sample. The term "genotyping" a sample or an individual for a biallelic marker consists of determining the specific allele or the specific nucleotide carried by an individual at a biallelic marker. The term "mutation" as used herein refers to a difference in DNA sequence between or among different genomes or individuals which has a frequency below 1%.
The term "haplotype" refers to a combination of alleles present in an individual or a sample. In the context of the present invention a haplotype preferably refers to a combination of biallelic marker alleles found in a given individual and which may be associated with a phenotype.
The term "polymoφhism" as used herein refers to the occurrence of two or more alternative genomic sequences or alleles between or among different genomes or individuals. "Polymoφhic" refers to the condition in which two or more variants of a specific genomic sequence can be found in a population. A "polymoφhic site" is the locus at which the variation occurs. A single nucleotide polymoφhism is a single base pair change. Typically a single nucleotide polymoφhism is the replacement of one nucleotide by another nucleotide at the polymoφhic site. Deletion of a single nucleotide or insertion of a single nucleotide, also give rise to single nucleotide polymoφhisms. In the context of the present invention "single nucleotide polymorphism" preferably refers to a single nucleotide substitution. Typically, between different genomes or between different individuals, the polymoφhic site may be occupied by two different nucleotides.
The terms "biallelic polymoφhism" and "biallelic marker" are used interchangeably herein to refer to a polymoφhism having two alleles at a fairly high frequency in the population, preferably a single nucleotide polymoφhism. A "biallelic marker allele" refers to the nucleotide variants present at a biallelic marker site. Typically the frequency of the less common allele of the biallelic markers of the present invention has been validated to be greater than 1%, preferably the frequency is greater than 10%, more preferably the frequency is at least 20% (i.e. heterozygosity rate of at least 0.32), even more preferably the frequency is at least 30% (i.e. heterozygosity rate of at least 0.42). A biallelic marker wherein the frequency of the less common allele is 30% or more is termed a "high quality biallelic marker."
The location of nucleotides in a polynucleotide with respect to the center of the polynucleotide are described herein in the following manner. When a polynucleotide has an odd number of nucleotides, the nucleotide at an equal distance from the 3' and 5' ends of the polynucleotide is considered to be "at the center" of the polynucleotide, and any nucleotide immediately adjacent to the nucleotide at the center, or the nucleotide at the center itself is considered to be "within 1 nucleotide of the center." With an odd number of nucleotides in a polynucleotide any of the five nucleotides positions in the middle of the polynucleotide would be considered to be within 2 nucleotides of the center, and so on. When a polynucleotide has an even number of nucleotides, there would be a bond and not a nucleotide at the center of the polynucleotide. Thus, either of the two central nucleotides would be considered to be "within
1 nucleotide of the center" and any of the four nucleotides in the middle of the polynucleotide would be considered to be "within 2 nucleotides of the center", and so on. For polymoφhisms which involve the substitution, insertion or deletion of 1 or more nucleotides, the polymoφhism, allele or biallelic marker is "at the center" of a polynucleotide if the difference between the distance from the substituted, inserted, or deleted polynucleotides of the polymoφhism and the 3' end of the polynucleotide, and the distance from the substituted, inserted, or deleted polynucleotides of the polymoφhism and the 5' end of the polynucleotide is zero or one nucleotide. If this difference is 0 to 3, then the polymoφhism is considered to be "within 1 nucleotide of the center." If the difference is 0 to 5, the polymoφhism is considered to be "within 2 nucleotides of the center." If the difference is 0 to 7, the polymoφhism is considered to be "within 3 nucleotides of the center," and so on. For polymoφhisms which involve the substitution, insertion or deletion of 1 or more nucleotides, the polymoφhism, allele or biallelic marker is "at the center" of a polynucleotide if the difference between the distance from the substituted, inserted, or deleted polynucleotides of the polymoφhism and the 3' end of the polynucleotide, and the distance from the substituted, inserted, or deleted polynucleotides of the polymoφhism and the 5' end of the polynucleotide is zero or one nucleotide. If this difference is 0 to 3, then the polymoφhism is considered to be "within 1 nucleotide of the center." If the difference is 0 to 5, the polymoφhism is considered to be "within 2 nucleotides of the center." If the difference is 0 to 7, the polymoφhism is considered to be "within 3 nucleotides of the center," and so on. The term "upstream" is used herein to refer to a location which, is toward the 5' end of the polynucleotide from a specific reference point.
The terms "base paired" and "Watson & Crick base paired" are used interchangeably herein to refer to nucleotides which can be hydrogen bonded to one another be virtue of their sequence identities in a manner like that found in double-helical DNA with thymine or uracil residues linked to adenine residues by two hydrogen bonds and cytosine and guanine residues linked by three hydrogen bonds (See Stryer, L., Biochemistry, 4th edition, 1995).
The terms "complementary" or "complement thereof are used herein to refer to the sequences of polynucleotides which is capable of forming Watson & Crick base pairing with another specified polynucleotide throughout the entirety of the complementary region. This term is applied to pairs of polynucleotides based solely upon their sequences and not any particular set of conditions under which the two polynucleotides would actually bind.
As used herein the term "map-related biallelic marker" relates to a biallelic marker in linkage disequilibrium with any of the sequences disclosed in SEQ LD Nos. 1 to 3908 which contain a biallelic marker of the map. The term map-related biallelic marker encompasses all of the biallelic markers disclosed in SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to
3908. The preferred map-related biallelic marker alleles of the present invention include each one of the alleles selected individually or in any combination from the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, as identified in field <223> of the allele feature in the appended Sequence Listing, individually or in groups consisting of all the possible combinations of the alleles. The terms "1ST allele" and "2ND allele" refer to the nucleotide located at the polymoφhic base of a polynucleotide sequence containing a biallelic marker, as identified in field <222> of the allele feature in the appended Sequence Listing for each Sequence ID number. As used herein, the polymoφhic base is located at nucleotide position 24 for each of SEQ LD Nos. 1 to 3908, with the exception of SEQ LD Nos. 914, 1013, 2544, 3434, 3795, and
3028. The polymoφhic base is located at nucleotide position 23 for SEQ LD Nos. 914, 1013 and 2544, at nucleotide position 21 for SEQ ID No. 3028, at nucleotide position 20 for SEQ LD No. 3434. I. Biallelic Markers And Polynucleotides Comprising Biallelic Markers Polynucleotides of the present invention
The present invention encompasses polynucleotides for use as primers and probes in the methods of the invention. All of the polynucleotides of the invention may be specified as being isolated, purified or recombinant. These polynucleotides may consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence from any sequence in the Sequence Listing as well as sequences which are complementary thereto ("complements thereof). The "contiguous span" may be at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD. It should be noted that the polynucleotides of the present invention are not limited to having the exact flanking sequences surrounding the polymoφhic bases which, are enumerated in the Sequence Listing. Rather, it will be appreciated that the flanking sequences surrounding the biallelic markers, or any of the primers of probes of the invention which, are more distant from the markers, may be lengthened or shortened to any extent compatible with their intended use and the present invention specifically contemplates such sequences. It will be appreciated that the polynucleotides referred to in the Sequence Listing may be of any length compatible with their intended use. Also the flanking regions outside of the contiguous span need not be homologous to native flanking sequences which actually occur in human subjects. The addition of any nucleotide sequence, which is compatible with the nucleotides intended use is specifically contemplated. The contiguous span may optionally include the map-related biallelic marker in said sequence. Biallelic markers generally consist of a polymoφhism at one single base position. Each biallelic marker therefore corresponds to two forms of a polynucleotide sequence which, when compared with one another, present a nucleotide modification at one position. Usually, the nucleotide modification involves the substitution of one nucleotide for another. Optionally either the 1ST allele or the 2ND allele of the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 may be specified as being present at the map-related biallelic marker. Preferred polynucleotides may consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence from SEQ LD Nos. 1 to 2260 as well as sequences which are complementary thereto. The "contiguous span" may be at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular
Sequence LD. Particularly preferred are polynucleotides which consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence of any of SEQ LD Nos. 1 to 2260, or the complements thereof, wherein the 1ST allele of the biallelic marker of the SEQ ID No. is present at the map-related biallelic marker. Other preferred polynucleotides consist of, consist essentially of, or comprise a contiguous span of nucleotides of any of SEQ LD Nos. 1 to 2260, or the complements thereof, wherein the 2ND allele of the biallelic marker of the SEQ LD No. is present at the map-related biallelic marker. Preferred polynucleotides may consist of, consist essentially of, or comprise a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD No., of a sequence from SEQ LD Nos. 2261 to 3734 as well as sequences which are complementary thereto. Particularly preferred are polynucleotides which consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence of any of SEQ LD Nos. 2261 to 3734, or the complements thereof, wherein the 1 ST allele of the biallelic marker of the SEQ LD No. is present at the map-related biallelic marker. Other preferred polynucleotides consist of, consist essentially of, or comprise a contiguous span of nucleotides of any of SEQ LD Nos. 2261 to 3734, or the complements thereof, wherein the 2ND allele of the biallelic marker of the SEQ ED No. is present at the map-related biallelic marker. Preferred polynucleotides may consist of, consist essentially of, or comprise a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD No., of a sequence from SEQ LD Nos. 3735 to 3908 as well as sequences which are complementary thereto. Particularly preferred are polynucleotides which consist of, consist essentially of, or comprise a contiguous span of nucleotides of a sequence of any of SEQ ID Nos. 3735 to 3908, or the complements thereof, wherein the 1ST allele of the biallelic marker of the SEQ DD No. is present at the map-related biallelic marker. Other preferred polynucleotides consist of, consist essentially of, or comprise a contiguous span of nucleotides of any of SEQ LD Nos. 3735 to 3908, or the complements thereof, wherein the 2ND allele of the biallelic marker of the SEQ LD No. is present at the map-related biallelic marker. Also encompassed by the polynucleotides of the present invention are polynucleotides which consist of, consist essentially of, or comprise a contiguous span at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD, of a sequence from SEQ ED Nos. 1201, 3242, 3907 and 3908 as well as sequences which are complementary thereto, wherein said contiguous span of SEQ LD Nos. 1201 or 3242 contains a "G" at the polymoφhic base, or wherein said contiguous span of SEQ LD Nos. 3907 or 3908 contain an "A" at the polymoφhic base.
The invention also relates to polynucleotides that hybridize, under conditions of high or intermediate stringency, to a polynucleotide of a sequence from any of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 as well as sequences which are complementary thereto. Preferably such polynucleotides are at least 8, 10, 12, 15, 18, 19, 20,
22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a polynucleotide of these lengths is consistent with the lengths of the particular Sequence ID. Preferred polynucleotides comprise a map-related biallelic marker. Optionally either the 1 ST or the 2ND allele of the biallelic markers disclosed in the SEQ LD No. may be specified as being present at the map-related biallelic marker. Conditions of high and intermediate stringency are further described in LILC.4.
The primers of the present invention may be designed from the disclosed sequences using any method known in the art. A preferred set of primers is fashioned such that the 3' end of the contiguous span of identity with the sequences of the Sequence Listing is present at the 3' end of the primer. Such a configuration allows the 3' end of the primer to hybridize to a selected nucleic acid sequence and dramatically increases the efficiency of the primer for amplification or sequencing reactions.
In a preferred set of primers the contiguous span is found in one of the sequences described in SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 or the complements thereof. The invention also relates to polynucleotides consisting of, consisting essentially of, or comprising a contiguous span of nucleotides of a sequence from SEQ LD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773, as well as sequences which are complementary thereto, wherein the "contiguous span" may be at least 8, 10, 12, 15, 18, 19,
20, or 21 nucleotides in length, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence LD No.
Allele specific primers may be designed such that a biallelic marker is at the 3' end of the contiguous span and the contiguous span is present at the 3' end of the primer. Such allele specific primers tend to selectively prime an amplification or sequencing reaction so long as they are used with a nucleic acid sample that contains one of the two alleles present at a biallelic marker. The 3' end of primer of the invention may be located within or at least 2, 4, 6, 8, 10, to the extent that this distance is consistent with the particular Sequence LD, nucleotides upstream of a map-related biallelic marker in said sequence or at any other location which is appropriate for their intended use in sequencing, amplification or the location of novel sequences or markers. Primers with their 3' ends located 1 nucleotide upstream of a map-related biallelic marker have a special utility as microsequencing assays. Preferred microsequencing primers are described in SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, where for each of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, the sense microsequencing primer contains the complement of the 19 nucleotides having their 3' ends located 1 nucleotide upstream of the polymoφhic base of the respective SEQ LD No, and where the antisense microsequencing primer contains the complement of the 19 nucleotides of the complementary strand, nucleotides of the primer having their 3' end located 1 nucleotide upstream of the polymoφhic base on the complementary strand to the respective SEQ LD No. The most preferred of said microsequencing primers for each of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and
3735 to 3908 are microsequencing primers indicated as "A" or "S" in Table 1, which have been validated in microsequencing experiments.
The probes of the present invention may be designed from the disclosed sequences for any method known in the art, particularly methods which allow for testing if a particular sequence or marker disclosed herein is present. A preferred set of probes may be designed for use in the hybridization assays of the invention in any manner known in the art such that they selectively bind to one allele of a biallelic marker, but not the other under any particular set of assay conditions. Preferred hybridization probes may consists of, consist essentially of, or comprise a contiguous span of SEQ LD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, or the complement thereof, which ranges in length from least 8, 10, 12, 15, 18, 19, 20,
22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD No., or be specified as being 12, 15, 18, 19, 20, 25, 35, 40, 43, 44, 45, 46 or 47 nucleotides in length and including the map-related biallelic marker of said sequence. Optionally the 1st allele or 2nd allele of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 may be specified as being present at the biallelic marker site. Optionally, said biallelic marker may be within 6, 5, 4, 3, 2, or 1 nucleotides of the center of the hybridization probe or at the center of said probe. Any of the polynucleotides of the present invention can be labeled, if desired, by incoφorating a label detectable by spectroscopic, photochemical, biochemical, immunochemical, or chemical means. For example, useful labels include radioactive substances, fluorescent dyes or biotin. Preferably, polynucleotides are labeled at their 3' and 5' ends. A label can also be used to capture the primer, so as to facilitate the immobilization of either the primer or a primer extension product, such as amplified DNA, on a solid support. A capture label is attached to the primers or probes and can be a specific binding member which forms a binding pair with the solid's phase reagent's specific binding member (e.g. biotin and streptavidin). Therefore depending upon the type of label carried by a polynucleotide or a probe, it may be employed to capture or to detect the target DNA. Further, it will be understood that the polynucleotides, primers or probes provided herein, may, themselves, serve as the capture label. For example, in the case where a solid phase reagent's binding member is a nucleic acid sequence, it may be selected such that it binds a complementary portion of a primer or probe to thereby immobilize the primer or probe to the solid phase. In cases where a polynucleotide probe itself serves as the binding member, those skilled in the art will recognize that the probe will contain a sequence or "tail" that is not complementary to the target. In the case where a polynucleotide primer itself serves as the capture label, at least a portion of the primer will be free to hybridize with a nucleic acid on a solid phase. DNA Labeling techniques are well known to the skilled technician.
Any of the polynucleotides, primers and probes of the present invention can be conveniently immobilized on a solid support. Solid supports are known to those skilled in the art and include the walls of wells of a reaction tray, test tubes, polystyrene beads, magnetic beads, nitrocellulose strips, membranes, microparticles such as latex particles, sheep (or other animal) red blood cells, duracytes® and others. The solid support is not critical and can be selected by one skilled in the art. Thus, latex particles, microparticles, magnetic or nonmagnetic beads, membranes, plastic tubes, walls of microtiter wells, glass or silicon chips, sheep (or other suitable animal's) red blood cells and duracytes are all suitable examples. Suitable methods for immobilizing nucleic acids on solid phases include ionic, hydrophobic, covalent interactions and the like. A solid support, as used herein, refers to any material which is insoluble, or can be made insoluble by a subsequent reaction. The solid support can be chosen for its intrinsic ability to attract and immobilize the capture reagent. Alternatively, the solid phase can retain an additional receptor which has the ability to attract and immobilize the capture reagent. The additional receptor can include a charged substance that is oppositely charged with respect to the capture reagent itself or to a charged substance conjugated to the capture reagent. As yet another alternative, the receptor molecule can be any specific binding member which is immobilized upon (attached to) the solid support and which has the ability to immobilize the capture reagent through a specific binding reaction. The receptor molecule enables the indirect binding of the capture reagent to a solid support material before the performance of the assay or during the performance of the assay. The solid phase thus can be a plastic, derivatized plastic, magnetic or non-magnetic metal, glass or silicon surface of a test tube, microtiter well, sheet, bead, microparticle, chip, sheep (or other suitable animal's) red blood cells, duracytes® and other configurations known to those of ordinary skill in the art. The polynucleotides of the invention can be attached to or immobilized on a solid support individually or in groups of at least 2, 5, 8, 10, 12, 15, 20, or 25 distinct polynucleotides of the inventions to a single solid support. In addition, polynucleotides other than those of the invention may attached to the same solid support as one or more polynucleotides of the invention.
Any polynucleotide provided herein may be attached in overlapping areas or at random locations on the solid support. Alternatively the polynucleotides of the invention may be attached in an ordered array wherein each polynucleotide is attached to a distinct region of the solid support which does not overlap with the attachment site of any other polynucleotide. Preferably, such an ordered array of polynucleotides is designed to be "addressable" where the distinct locations are recorded and can be accessed as part of an assay procedure. Addressable polynucleotide arrays typically comprise a plurality of different oligonucleotide probes that are coupled to a surface of a substrate in different known locations. The knowledge of the precise location of each polynucleotides location makes these "addressable" arrays particularly useful in hybridization assays. Any addressable array technology known in the art can be employed with the polynucleotides of the invention. One particular embodiment of these polynucleotide arrays is known as the Genechips™, and has been generally described in US Patent 5,143,854; PCT publications WO 90/15070 and 92/10092. These arrays may generally be produced using mechanical synthesis methods or light directed synthesis methods, which incoφorate a combination of photolithographic methods and solid phase oligonucleotide synthesis (Fodor et al., Science, 251:767-777, 1991). The immobilization of arrays of oligonucleotides on solid supports has been rendered possible by the development of a technology generally identified as "Very Large Scale Immobilized Polymer Synthesis"
(VLSIPS™) in which, typically, probes are immobilized in a high density array on a solid surface of a chip, examples of VLSIPS™ technologies are provided in US Patents 5,143,854 and 5,412,087 and in PCT Publications WO 90/15070, WO 92/10092 and WO 95/11995, which describe methods for forming oligonucleotide arrays through techniques such as light- directed synthesis techniques. In designing strategies aimed at providing arrays of nucleotides immobilized on solid supports, further presentation strategies were developed to order and display the oligonucleotide arrays on the chips in an attempt to maximize hybridization patterns and sequence information, examples of such presentation strategies are disclosed in PCT Publications WO 94/12305, WO 94/11530, WO 97/29212 and WO 97/31256. Oligonucleotide arrays may comprise at least one of the sequences selected from the group consisting of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and the sequences complementary thereto, or a fragment thereof of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 consecutive nucleotides, to the extent that fragments of these lengths is consistent with the lengths of the particular Sequence ID, for determining whether a sample contains one or more alleles of the biallelic markers of the present invention. Oligonucleotide arrays may also comprise at least one of the sequences selected from the group consisting of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, and the sequences complementary thereto, or a fragment thereof of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 consecutive nucleotides, to the extent that fragments of these lengths is consistent with the lengths of the particular Sequence ID, for amplifying one or more alleles of the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908. In other embodiments, arrays may also comprise at least one of the sequences selected from the group consisting of SEQ ED Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and the sequences complementary thereto, or a fragment thereof of at 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 consecutive nucleotides, to the extent that fragments of these lengths is consistent with the lengths of the particular Sequence ED, for conducting microsequencing analyses to determine whether a sample contains one or more alleles of the biallelic markers of the invention. In still further embodiments, the oligonucleotide array may comprise at least one of the sequences selecting from the group consisting of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and the sequences complementary thereto, or a fragment thereof of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that fragments of these lengths is consistent with the lengths of the particular
Sequence ED, for determining whether a sample contains one or more alleles of the biallelic markers of the present invention.
In designing strategies aimed at providing arrays of nucleotides immobilized on solid supports, further presentation strategies were developed to order and display the probe arrays on the chips in an attempt to maximize hybridization patterns and sequence information.
Examples of such presentation strategies are disclosed in PCT Publications WO 94/12305, WO 94/11530, WO 97/29212 and WO 97/31256. Each DNA chip can contain thousands to millions of individual synthetic DNA probes arranged in a gnd-hke pattern and mmiatunzed to the size of a dime In some embodiments, the efficiency of hybridization of nucleic acids in the sample with the probes attached to the chip may be improved by using polyacrylamide gel pads isolated from one another by hydrophobic regions in which the DNA probes are covalently linked to an acrylamide matnx
The polymoφhic bases present in the biallelic marker or markers of the sample nucleic acids are determined as follows Probes which contain at least a portion of one or more of the biallelic markers of the present invention are synthesized either in situ or by conventional synthesis and immobilized on an appropnate chip using methods known to the skilled technician.
Any one or more alleles of the biallelic markers descnbed herein (SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto) or fragments thereof containing the polymoφhic bases, may be fixed to a solid support, such as a microchip or other immobilizing surface. The fragments of these nucleic acids may compnse at least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides of the biallelic markers descnbed herein Preferably, the fragments include the polymoφhic bases of the biallelic markers.
A nucleic acid sample is applied to the immobilizing surface and analyzed to determine the identities of the polymoφhic bases of one or more of the biallelic markers. In some embodiments, the solid support may also include one or more of the amplification pnmers descnbed herein, or fragments compnsmg at least 10, at least 15, or at least 20 consecutive nucleotides thereof, for generating an amplification product containing the polymoφhic bases of the biallelic markers to be analyzed m the sample.
Another embodiment of the present invention is a solid support which includes one or more of the microsequencing pnmers of the invention, or fragments compnsmg at least 10, at least 15, or at least 20 consecutive nucleotides thereof and having a 3' terminus immediately upstream of the polymoφhic base of the corresponding biallelic marker, for determining the identity of the polymoφhic base of the one or more biallelic markers fixed to the solid support For example, one embodiment of the present mvention is an array of nucleic acids fixed to a solid support, such as a microchip, bead, or other immobilizing surface, compnsmg one or more of the biallelic markers in the maps of the present invention or a fragment compnsmg at least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides thereof including the polymoφhic base. For example, the array may compnse 1, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, or 3000 of the biallelic markers selected from the group consisting of SEQ DD Nos.: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto, or a fragment compnsmg at least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides thereof including the polymoφhic base.
Another embodiment of the present invention is an array compnsmg amplification pnmers for generating amplification products containing the polymoφhic bases of one or more, at least five, at least 10, at least 20, at least 100, at least 200, at least 300, at least 400, or more than 400 of the biallelic markers m the maps of the present invention. For example, the array may compnse amplification pnmers for generating amplification products containing the polymoφhic bases of at least 1,5, 10, 20, 50, 100, 200, 300, 400, 500, 1000, 2000, or 3000, of the biallelic markers selected from the group consisting of SEQ DD Nos.: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto. In such arrays, the amplification pnmers included in the array are capable of amplifying the biallelic marker sequences to be detected in the nucleic acid sample applied to the array (i.e. the amplification pnmers correspond to the biallelic markers affixed to the array - see Table 1). Thus, the arrays may include one or more of the amplification pnmers of SEQ DD Nos.: 3935 to 7842, 7866 to 11773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 11599, and 11600 to 11773 corresponding to the one or more biallelic markers of SEQ DD Nos. 1 to 3908, 1 to
2260, 2261 to 3374, and 3735 to 3908 which are included in the array.
Another embodiment of the present invention is an array which includes microsequencing pnmers capable of determining the identity of the polymoφhic bases of at least 1, 5, 10, 20, 50, 100, 200, 300, 500, 1000, 2000, or 3000 of the present invention. For example, the array may compnse microsequencing pnmers capable of determining the identity of the polymoφhic bases of one or more, at least five, at least 10, at least 20, at least 100, at least 200, at least 300, at least 400, or more than 400 of the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
Arrays containing any combination of the above nucleic acids which permits the specific detection or identification of the polymoφhic bases of the biallelic markers in the maps of the present invention, including any combination of biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are also within the scope of the present mvention. For example, the anay may compnse both the biallelic markers and amplification pnmers capable of generating amplification products containing the polymoφhic bases of the biallelic markers. Alternatively, the array may comprise both amplification pnmers capable of generating amplification products containing the polymoφhic bases of the biallelic markers and microsequencing pnmers capable of determining the identities of the polymoφhic bases of these markers.
Although the above examples describe arrays compnsmg specific groups of biallelic markers and, in some embodiments, specific amplification pnmers and microsequencing pnmers, it will be appreciated that the present invention encompasses arrays including any biallelic marker, group of biallelic markers, amplification primer, group of amplification primers, microsequencing primer, or group of amplification primers described herein, as well as any combination of the preceding nucleic acids.
The present invention also encompasses diagnostic kits comprising one or more polynucleotides of the invention, optionally with a portion or all of the necessary reagents and instructions for genotyping a test subject by determining the identity of a nucleotide at a map- related biallelic marker. The polynucleotides of a kit may optionally be attached to a solid support, or be part of an array or addressable array of polynucleotides. The kit may provide for the determination of the identity of the nucleotide at a marker position by any method known in the art including, but not limited to, a sequencing assay method, a microsequencing assay method, a hybridization assay method, or an allele specific amplification method. Optionally such a kit may include instructions for scoring the results of the determination with respect to the test subjects' risk of contracting a diseases involving a disease, likely response to an agent acting on a disease, or chances of suffering from side effects to an agent acting on a disease.
II. Methods For De Novo Identification Of Biallelic Markers
Any of a variety of methods can be used to screen a genomic fragment for single nucleotide polymoφhisms such as differential hybridization with oligonucleotide probes, detection of changes in the mobility measured by gel electrophoresis or direct sequencing of the amplified nucleic acid. A preferred method for identifying biallelic markers involves comparative sequencing of genomic DNA fragments from an appropriate number of unrelated individuals.
In a first embodiment, DNA samples from unrelated individuals are pooled together, following which the genomic DNA of interest is amplified and sequenced. The nucleotide sequences thus obtained are then analyzed to identify significant polymoφhisms. One of the major advantages of this method resides in the fact that the pooling of the DNA samples substantially reduces the number of DNA amplification reactions and sequencing reactions, which must be carried out. Moreover, this method is sufficiently sensitive so that a biallelic marker obtained thereby usually demonstrates a sufficient frequency of its less common allele to be useful in conducting association studies. Usually, the frequency of the least common allele of a biallelic marker identified by this method is at least 10%.
In a second embodiment, the DNA samples are not pooled and are therefore amplified and sequenced individually. This method is usually preferred when biallelic markers need to be identified in order to perform association studies within candidate genes. Preferably, highly relevant gene regions such as promoter regions or exon regions may be screened for biallelic markers. A biallelic marker obtained using this method may show a lower degree of informativeness for conducting association studies, e.g. if the frequency of its less frequent allele may be less than about 10%. Such a biallelic marker will however be sufficiently informative to conduct association studies and it will further be appreciated that including less informative biallelic markers in the genetic analysis studies of the present invention, may allow in some cases the direct identification of causal mutations, which may, depending on their penetrance, be rare mutations.
The following is a description of the various parameters of a preferred method used by the inventors for the identification of the biallelic markers of the present invention. II.A. Genomic DNA samples The genomic DNA samples from which the biallelic markers of the present invention are generated are preferably obtained from unrelated individuals conesponding to a heterogeneous population of known ethnic background. The number of individuals from whom DNA samples are obtained can vary substantially, preferably from about 10 to about 1000, more preferably from about 50 to about 200 individuals. Usually, DNA samples are collected from at least about 100 individuals in order to have sufficient polymoφhic diversity in a given population to identify as many markers as possible and to generate statistically significant results.
As for the source of the genomic DNA to be subjected to analysis, any test sample can be foreseen without any particular limitation. These test samples include biological samples, which can be tested by the methods of the present invention described herein, and include human and animal body fluids such as whole blood, serum, plasma, cerebrospinal fluid, urine, lymph fluids, and various external secretions of the respiratory, intestinal and genitourinary tracts, tears, saliva, milk, white blood cells, myelomas and the like; biological fluids such as cell culture supematants; fixed tissue specimens including tumor and non-tumor tissue and lymph node tissues; bone marrow aspirates and fixed cell specimens. The preferred source of genomic DNA used in the present invention is from peripheral venous blood of each donor. Techniques to prepare genomic DNA from biological samples are well known to the skilled technician. Details of a prefened embodiment are provided in Example 27. The person skilled in the art can choose to amplify pooled or unpooled DNA samples. II.B. DNA Amplification
The identification of biallelic markers in a sample of genomic DNA may be facilitated through the use of DNA amplification methods. DNA samples can be pooled or unpooled for the amplification step. DNA amplification techniques are well known to those skilled in the art. Various methods to amplify DNA fragments carrying biallelic markers are further described hereinafter in HI.B. The PCR technology is the preferred amplification technique used to identify new biallelic markers. In a first embodiment, biallelic markers are identified using genomic sequence information generated by the inventors. Genomic DNA fragments, such as the inserts of the BAC clones descnbed above, are sequenced and used to design pnmers for the amplification of 500 bp fragments. These 500 bp fragments are amplified from genomic DNA and are scanned for biallelic markers. Pnmers may be designed using the OSP software (Hilher L and Green P., 1991). All pnmers may contain, upstream of the specific target bases, a common oligonucleotide tail that serves as a sequencing pnmer. Those skilled in the art are familiar with pnmer extensions, which can be used for these puφoses.
In another embodiment of the invention, genomic sequences of candidate genes are available m public databases allowing direct screening for biallelic markers. Prefened pnmers, useful for the amplification of genomic sequences encoding the candidate genes, focus on promoters, exons and splice sites of the genes. A biallelic marker present in these functional regions of the gene have a higher probability to be a causal mutation.
Prefened pnmers include those disclosed in SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to
11773.
II.C. Sequencing Of Amplified Genomic DNA And Identification Of Single Nucleotide Polymorphisms
The amplification products generated as descnbed above, are then sequenced using any method known and available to the skilled technician. Methods for sequencing DNA using either the dideoxy-mediated method (Sanger method) or the Maxam-Gilbert method are widely known to those of ordinary skill in the art. Such methods are for example disclosed m Maniatis et al. (Molecular Cloning, A Laboratory Manual, Cold Spnng Harbor Press, Second Edition, 1989). Alternative approaches include hybndization to high-density DNA probe anays as descnbed in Chee et al. (Science 274, 610, 1996).
Preferably, the amplified DNA is subjected to automated dideoxy terminator sequencing reactions using a dye-pnmer cycle sequencing protocol. The products of the sequencing reactions are run on sequencing gels and the sequences are determined using gel image analysis. The polymoφhism search is based on the presence of supenmposed peaks in the electrophoresis pattern resulting from different bases occurnng at the same position
Because each dideoxy terminator is labeled with a different fluorescent molecule, the two peaks conespondmg to a biallelic site present distinct colors conespondmg to two different nucleotides at the same position on the sequence. However, the presence of two peaks can be an artifact due to background noise. To exclude such an artifact, the two DNA strands are sequenced and a companson between the peaks is earned out. In order to be registered as a polymoφhic sequence, the polymoφhism has to be detected on both strands The above procedure permits those amplification products, which contain biallelic markers to be identified The detection limit for the frequency of biallelic polymoφhisms detected by sequencing pools of 100 individuals is approximately 0.1 for the minor allele, as verified by sequencing pools of known allelic frequencies. However, more than 90% of the biallelic polymoφhisms detected by the pooling method have a frequency for the minor allele higher than 0.25. Therefore, the biallelic markers selected by this method have a frequency of at least 0.1 for the minor allele and less than 0.9 for the major allele. Preferably at least 0.2 for the minor allele and less than 0.8 for the major allele, more preferably at least 0.3 for the minor allele and less than 0.7 for the major allele, thus a heterozygosity rate higher than 0.18, preferably higher than 0.32, more preferably higher than 0.42.
In another embodiment, biallelic markers are detected by sequencing individual DNA samples, the frequency of the minor allele of such a biallelic marker may be less than 0.1.
The markers earned by the same fragment of genomic DNA, such as the insert in a BAC clone, need not necessanly be ordered with respect to one another within the genomic fragment to conduct association studies. However, m some embodiments of the present invention, the order of biallelic markers earned by the same fragment of genomic DNA are determined.
II.D. Validation of the biallelic markers of the present invention
The polymoφhisms are evaluated for their usefulness as genetic markers by validating that both alleles are present in a population. Validation of the biallelic markers is accomplished by genotyping a group of individuals by a method of the invention and demonstrating that both alleles are present. Microsequencing is a prefened method of genotyping alleles. The validation by genotyping step may be performed on individual samples denved from each individual in the group or by genotyping a pooled sample denved from more than one individual. The group can be as small as one individual if that individual is heterozygous for the allele m question. Preferably the group contains at least three individuals, more preferably the group contains five or six individuals, so that a single validation test will be more likely to result m the validation of more of the biallelic markers that are being tested. It should be noted, however, that when the validation test is performed on a small group it may result in a false negative result if as a result of sampling error none of the individuals tested carnes one of the two alleles. Thus, the validation process is less useful in demonstrating that a particular initial result is an artifact, than it is at demonstrating that there is a bonafide biallelic marker at a particular position in a sequence. All of the genotyping, haplotypmg, association, and interaction study methods of the invention may optionally be performed solely with validated biallelic markers.
II.E. Evaluation of the frequency of the biallelic markers of the present invention The validated biallelic markers are further evaluated for their usefulness as genetic markers by determining the frequency of the least common allele at the biallelic marker site The determination of the least common allele is accomplished by genotyping a group of individuals by a method of the invention and demonstrating that both alleles are present This determination of frequency by genotyping step may be performed on individual samples denved from each individual in the group or by genotyping a pooled sample derived from more than one individual The group must be large enough to be representative of the population as a whole Preferably the group contains at least 20 individuals, more preferably the group contains at least 50 individuals, most preferably the group contains at least 100 individuals Of course the larger the group the greater the accuracy of the frequency determination because of reduced sampling enor. A biallelic marker wherein the frequency of the less common allele is 30% or more is termed a "high quality biallelic marker." All of the genotyping, haplotypmg, association, and interaction study methods of the invention may optionally be performed solely with high quality biallelic markers III. Methods Of Genotyping An Individual For Biallelic Markers
Methods are provided to genotype a biological sample for one or more biallelic markers of the present invention, all of which may be performed in vitro. Such methods of genotyping comprise determining the identity of a nucleotide at a map-related biallelic marker by any method known in the art. These methods find use in genotyping case-control populations m association studies as well as individuals in the context of detection of alleles of biallelic markers which, are known to be associated with a given trait, m which case both copies of the biallelic marker present in individual's genome are determined so that an individual may be classified as homozygous or heterozygous for a particular allele.
These genotyping methods can be performed nucleic acid samples denved from a single individual or pooled DNA samples.
Genotyping can be performed using similar methods as those descnbed above for the identification of the biallelic markers, or using other genotyping methods such as those further descnbed below. In prefened embodiments, the comparison of sequences of amplified genomic fragments from different individuals is used to identify new biallelic markers whereas microsequencing is used for genotyping known biallelic markers m diagnostic and association study applications. III.A. Source of DNA for genotyping
Any source of nucleic acids, in punfied or non-purified form, can be utilized as the starting nucleic acid, provided it contains or is suspected of containing the specific nucleic acid sequence desired. DNA or RNA may be extracted from cells, tissues, body fluids and the like as descnbed above in II.A While nucleic acids for use m the genotyping methods of the invention can be denved from any mammalian source, the test subjects and individuals from which nucleic acid samples are taken are generally understood to be human III.B. Amplification Of DNA Fragments Comprising Biallelic Markers
Methods and polynucleotides are provided to amplify a segment of nucleotides compnsmg one or more biallelic marker of the present invention. It will be appreciated that amplification of DNA fragments compnsmg biallelic markers may be used in vanous methods and for various puφoses and is not restricted to genotyping. Nevertheless, many genotyping methods, although not all, require the previous amplification of the DNA region carrying the biallelic marker of interest. Such methods specifically increase the concentration or total number of sequences that span the biallelic marker or include that site and sequences located either distal or proximal to it. Diagnostic assays may also rely on amplification of DNA segments carrying a biallelic marker of the present invention.
Amplification of DNA may be achieved by any method known m the art. The established PCR (polymerase chain reaction) method or by developments thereof or alternatives. Amplification methods which can be utilized herein include but are not limited to Ligase Cham Reaction (LCR) as descnbed in EP A 320 308 and EP A 439 182, Gap LCR (Wolcott, M.J., Chn. Mcrobiol. Rev. 5:370-386), the so-called "NASBA" or "3SR" technique descnbed m Guatelh J.C. et al. (Proc Natl Acad. Sci USA 87:1874-1878, 1990) and in Compton J. (Nature 350:91-92, 1991), Q-beta amplification as descnbed in European Patent Application no 4544610, strand displacement amplification as descnbed in Walker et al. (Clin
Chem 42:9-13, 1996) and EP A 684 315 and, target mediated amplification as descnbed in PCT Publication WO 9322461.
LCR and Gap LCR are exponential amplification techniques, both depend on DNA ligase to join adjacent pnmers annealed to a DNA molecule. In Ligase Cham Reaction (LCR), probe pairs are used which include two primary (first and second) and two secondary (third and fourth) probes, all of which are employed m molar excess to target. The first probe hybndizes to a first segment of the target strand and the second probe hybndizes to a second segment of the target strand, the first and second segments being contiguous so that the primary probes abut one another in 5' phosphate-3 'hydroxyl relationship, and so that a ligase can covalently fuse or hgate the two probes into a fused product. In addition, a third
(secondary) probe can hybndize to a portion of the first probe and a fourth (secondary) probe can hybndize to a portion of the second probe in a similar abutting fashion. Of course, if the target is initially double stranded, the secondary probes also will hybndize to the target complement in the first instance. Once the ligated strand of pnmary probes is separated from the target strand, it will hybndize with the third and fourth probes which can be ligated to form a complementary, secondary ligated product. It is important to realize that the ligated products are functionally equivalent to either the target or its complement By repeated cycles of hybndization and ligation, amplification of the target sequence is achieved A method for multiplex LCR has also been described (WO 9320227). Gap LCR (GLCR) is a version of LCR where the probes are not adjacent but are separated by 2 to 3 bases For amplification of mRNAs, it is withm the scope of the present invention to reverse transcnbe mRNA into cDNA followed by polymerase chain reaction (RT-PCR), or, to use a single enzyme for both steps as descnbed m U.S Patent No. 5,322,770 or, to use Asymmetnc Gap LCR (RT-AGLCR) as descnbed by Marshall RL. et al. (PCR Methods and Applications 4:80-84, 1994). AGLCR is a modification of GLCR that allows the amplification of RNA. Some of these amplification methods are particularly suited for the detection of single nucleotide polymoφhisms and allow the simultaneous amplification of a target sequence and the identification of the polymoφhic nucleotide as it is further descnbed in Ifl.C.
The PCR technology is the prefened amplification technique used in the present invention. A vanety of PCR techniques are familiar to those skilled in the art For a review of PCR technology, see Molecular Cloning to Genetic Eng eeπng White, B.A. Ed. in Methods in
Molecular Biology 67: Humana Press, Totowa (1997) and the publication entitled "PCR Methods and Applications" (1991, Cold Spring Harbor Laboratory Press). In each of these PCR procedures, PCR pnmers on either side of the nucleic acid sequences to be amplified are added to a suitably prepared nucleic acid sample along with dNTPs and a thermostable polymerase such as Taq polymerase, Pfu polymerase, or Vent polymerase. The nucleic acid m the sample is denatured and the PCR pnmers are specifically hybndized to complementary nucleic acid sequences in the sample. The hybndized pnmers are extended. Thereafter, another cycle of denaturation, hybndization, and extension is initiated. The cycles are repeated multiple times to produce an amplified fragment containing the nucleic acid sequence between the pnmer sites. PCR has further been descnbed in several patents including US Patents 4,683,195,
4,683,202 and 4,965,188.
The identification of biallelic markers as descnbed above allows the design of appropnate oligonucleotides, which can be used as pnmers to amplify DNA fragments comprising the biallelic markers of the present invention. Amplification can be performed using the pnmers initially used to discover new biallelic markers which are descnbed herein or any set of pnmers allowing the amplification of a DNA fragment comprising a biallelic marker of the present invention. Pnmers can be prepared by any suitable method As for example, direct chemical synthesis by a method such as the phosphodiester method of Narang S.A. et al. (Methods Enzymol. 68:90-98, 1979), the phosphodiester method of Brown E.L. et al. (Methods Enzymol. 68:109-151, 1979), the diethylphosphoramidite method of Beaucage et al. (Tetrahedron Lett. 22:1859-1862, 1981) and the solid support method described in EP 0 707 592.
In some embodiments the present invention provides pnmers for amplifying a DNA fragment containing one or more biallelic markers of the present invention. Prefened amplification pnmers are listed in SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. It will be appreciated that the pnmers listed are merely exemplary and that any other set of pnmers which produce amplification products containing one or more biallelic markers of the present invention
The pnmers are selected to be substantially complementary to the different strands of each specific sequence to be amplified. The length of the pnmers of the present invention can range from 8 to 100 nucleotides, preferably from 8 to 50, 8 to 30 or more preferably 8 to 25 nucleotides. Shorter pnmers tend to lack specificity for a target nucleic acid sequence and generally require cooler temperatures to form sufficiently stable hybnd complexes with the template. Longer pnmers are expensive to produce and can sometimes self-hybπdize to form haiφm structures The formation of stable hybnds depends on the melting temperature (Tm) of the DNA The Tm depends on the length of the pnmer, the ionic strength of the solution and the G+C content The higher the G+C content of the primer, the higher is the melting temperature because G.C pairs are held by three H bonds whereas A.T pairs have only two. The G+C content of the amplification pnmers of the present invention preferably ranges between 10 and 75%, more preferably between 35 and 60%, and most preferably between 40 and 55%. The appropπate length for pnmers under a particular set of assay conditions may be empincally determined by one of skill in the art.
The spacing of the pnmers determines the length of the segment to be amplified. In the context of the present invention amplified segments carrying biallelic markers can range in size from at least about 25 bp to 35 kbp. Amplification fragments from 25-3000 bp are typical, fragments from 50-1000 bp are preferred and fragments from 100-600 bp are highly prefened. It will be appreciated that amplification pnmers for the biallelic markers may be any sequence which allow the specific amplification of any DNA fragment carrying the markers Amplification pnmers may be labeled or immobilized on a solid support as descnbed in I.
III.C. Methods of Genotyping DNA samples for Biallelic Markers
Any method known in the art can be used to identify the nucleotide present at a biallelic marker site. Since the biallelic marker allele to be detected has been identified and specified in the present invention, detection will prove simple for one of ordinary skill in the art by employing any of a number of techniques. Many genotyping methods require the previous amplification of the DNA region carrying the biallelic marker of interest While the amplification of target or signal is often prefened at present, ultrasensitive detection methods which do not require amplification are also encompassed by the present genotyping methods. Methods well-known to those skilled in the art that can be used to detect biallelic polymoφhisms include methods such as, conventional dot blot analyzes, single strand conformational polymoφhism analysis (SSCP) described by Orita et al. (Proc. Natl. Acad.
Sci. U.S.A 86:27776-2770, 1989), denaturing gradient gel electrophoresis (DGGE), heteroduplex analysis, mismatch cleavage detection, and other conventional techniques as described in Sheffield, V.C. et al. (Proc. Natl. Acad. Sci. USA 49:699-706, 1991), White et al. (Genomics 12:301-306, 1992), Grompe, M. et al. (Proc. Natl. Acad. Sci. USA 86:5855-5892, 1989) and Grompe, M. (Nature Genetics 5:111-117, 1993). Another method for determining the identity of the nucleotide present at a particular polymoφhic site employs a specialized exonuclease-resistant nucleotide derivative as described in US patent 4,656,127.
Prefened methods involve directly determining the identity of the nucleotide present at a biallelic marker site by sequencing assay, enzyme-based mismatch detection assay, or hybridization assay. The following is a description of some prefened methods. A highly prefened method is the microsequencing technique. The term "sequencing assay" is used herein to refer to polymerase extension of duplex primer/template complexes and includes both traditional sequencing and microsequencing.
1) Sequencing assays The nucleotide present at a polymoφhic site can be determined by sequencing methods. In a prefened embodiment, DNA samples are subjected to PCR amplification before sequencing as described above. DNA sequencing methods are described in IIC.
Preferably, the amplified DNA is subjected to automated dideoxy terminator sequencing reactions using a dye-primer cycle sequencing protocol. Sequence analysis allows the identification of the base present at the biallelic marker site.
2) Microsequencing assays
In microsequencing methods, a nucleotide at the polymoφhic site that is unique to one of the alleles in a target DNA is detected by a single nucleotide primer extension reaction. This method involves appropriate microsequencing primers which, hybridize just upstream of a polymoφhic base of interest in the target nucleic acid. A polymerase is used to specifically extend the 3' end of the primer with one single ddNTP (chain terminator) complementary to the selected nucleotide at the polymoφhic site. Next the identity of the incoφorated nucleotide is determined in any suitable way.
Typically, microsequencing reactions are carried out using fluorescent ddNTPs and the extended microsequencing primers are analyzed by electrophoresis on ABI 377 sequencing machines to determine the identity of the incoφorated nucleotide as described in EP 412 883. Alternatively capillary electrophoresis can be used m order to process a higher number of assays simultaneously. An example of a typical microsequencing procedure that can be used in the context of the present invention is provided in Example 8
Different approaches can be used to detect the nucleotide added to the microsequencing pnmer. A homogeneous phase detection method based on fluorescence resonance energy transfer has been descnbed by Chen and Kwok (Nucleic Acids Research 25 347-353 1997) and Chen et al (Proc Natl Acad Sci USA 94/20 10756-10761,1997). In this method amplified genomic DNA fragments containing polymoφhic sites are incubated with a 5'-fluoresceιn-labeled primer in the presence of allelic dye-labeled dideoxynbonucleoside tnphosphates and a modified Taq polymerase The dye-labeled pnmer is extended one base by the dye-termmator specific for the allele present on the template. At the end of the genotyping reaction, the fluorescence intensities of the two dyes m the reaction mixture are analyzed directly without separation or punfication All these steps can be performed in the same tube and the fluorescence changes can be monitored m real time. Alternatively, the extended primer may be analyzed by MALDI-TOF Mass Spectrometry. The base at the polymoφhic site is identified by the mass added onto the microsequencing pnmer (see Haff L.A. and Smirnov I P., Genome Research, 7:378-388, 1997).
Microsequencing may be achieved by the established microsequencing method or by developments or denvatives thereof Alternative methods include several solid-phase microsequencing techniques The basic microsequencing protocol is the same as descnbed previously, except that the method is conducted as a heterogenous phase assay, in which the pnmer or the target molecule is immobilized or captured onto a solid support. To simplify the pnmer separation and the terminal nucleotide addition analysis, oligonucleotides are attached to solid supports or are modified in such ways that permit affinity separation as well as polymerase extension. The 5' ends and internal nucleotides of synthetic oligonucleotides can be modified in a number of different ways to permit different affinity separation approaches, e.g , biotinylation. If a single affinity group is used on the oligonucleotides, the oligonucleotides can be separated from the incoφorated terminator regent This eliminates the need of physical or size separation. More than one oligonucleotide can be separated from the terminator reagent and analyzed simultaneously if more than one affinity group is used
This permits the analysis of several nucleic acid species or more nucleic acid sequence information per extension reaction The affinity group need not be on the pnming oligonucleotide but could alternatively be present on the template For example, immobilization can be earned out via an interaction between biotinylated DNA and streptavidin-coated microtitration wells or avidin-coated polystyrene particles. In the same manner oligonucleotides or templates may be attached to a solid support in a high-density format. In such solid phase microsequencing reactions, incoφorated ddNTPs can be radiolabeled (Syvanen, Chnica Chimica Acta 226:225-236, 1994) or linked to fluorescein (Livak and Hainer, Human Mutation 3:379-385,1994) The detection of radiolabeled ddNTPs can be achieved through scintillation-based techniques The detection of fluorescein-hnked ddNTPs can be based on the binding of antifluorescein antibody conjugated with alkaline phosphatase, followed by incubation with a chromogenic substrate (such as J-mtrophenyl phosphate). Other possible reporter-detection pairs include: ddNTP linked to dinitrophenyl (DNP) and anti-DNP alkaline phosphatase conjugate (Harju et al., Clin Chem. 39/11 2282- 2287, 1993) or biotinylated ddNTP and horseradish peroxidase-conjugated streptavidin with σ-phenylenediamine as a substrate (WO 92/15712) As yet another alternative solid-phase microsequencing procedure, Nyren et al. (Analytical Biochemistry 208: 171-175, 1993) descnbed a method relying on the detection of DNA polymerase activity by an enzymatic lummometnc inorganic pyrophosphate detection assay (ELDDA).
Past en et al (Genome research 7:606-614, 1997) descnbe a method for multiplex detection of single nucleotide polymoφhism in which the solid phase mimsequencing pnnciple is applied to an oligonucleotide anay format. High-density anays of DNA probes attached to a solid support (DNA chips) are further descnbed in m.C.5.
In one aspect the present invention provides polynucleotides and methods to genotype one or more biallelic markers of the present invention by performing a microsequencing assay. In the prefened embodiment, it will be appreciated that any primer having a 3' end immediately adjacent to a polymoφhic nucleotide may be used as a microsequencing pnmer. Similarly, it will be appreciated that microsequencing analysis may be performed for any biallelic marker or any combination of biallelic markers of the present invention. One aspect of the present invention is a solid support which includes one or more microsequencing pnmers compnsmg nucleotides complementary to the nucleotide sequences of SEQ DD Nos. 1 to 3908, 1 to 2260,
2261 to 3374, and 3735 to 3908 or the complements thereof, or fragments compnsmg at least 8, at least 12, at least 15, or at least 20 consecutive nucleotides thereof and having a 3' terminus immediately upstream of the corresponding biallelic marker, for determining the identity of a nucleotide at biallelic marker site. 3) Mismatch detection assays based on polymerases and ligases
In one aspect the present invention provides polynucleotides and methods to determine the allele of one or more biallelic markers of the present invention in a biological sample, by mismatch detection assays based on polymerases and/or ligases. These assays are based on the specificity of polymerases and ligases. Polymenzation reactions places particularly stringent requirements on conect base pairing of the 3' end of the amplification pnmer and the joining of two oligonucleotides hybndized to a target DNA sequence is quite sensitive to mismatches close to the ligation site, especially at the 3' end The terms "enzyme based mismatch detection assay" are used herein to refer to any method of determining the allele of a biallelic marker based on the specificity of ligases and polymerases Prefened methods are descnbed below Methods, pnmers and vanous parameters to amplify DNA fragments compnsmg biallelic markers of the present invention are further descnbed above m
Lfl B Allele specific amplification
Discnmination between the two alleles of a biallelic marker can also be achieved by allele specific amplification, a selective strategy, whereby one of the alleles is amplified without amplification of the other allele. This is accomplished by placing a polymoφhic base at the 3' end of one of the amplification primers. Because the extension forms from the 3 'end of the primer, a mismatch at or near this position has an inhibitory effect on amplification. Therefore, under appropnate amplification conditions, these pnmers only direct amplification on their complementary allele Designing the appropnate allele-specific pnmer and the conesponding assay conditions are well with the ordinary skill m the art.
Ligation/amplification based methods
The "Oligonucleotide Ligation Assay" (OLA) uses two oligonucleotides which are designed to be capable of hybndizmg to abutting sequences of a single strand of a target molecules. One of the oligonucleotides is biotinylated, and the other is detectably labeled. If the precise complementary sequence is found in a target molecule, the oligonucleotides will hybndize such that their termini abut, and create a ligation substrate that can be captured and detected. OLA is capable of detecting biallelic markers and may be advantageously combined with PCR as descnbed by Nickerson D.A et al. (Proc Natl. Acad Sci. USA 87:8923-8927, 1990) In this method, PCR is used to achieve the exponential amplification of target DNA, which is then detected using OLA
Other methods which are particularly suited for the detection of biallelic markers include LCR (ligase chain reaction), Gap LCR (GLCR) which are descnbed above in DI.B. As mentioned above LCR uses two pairs of probes to exponentially amplify a specific target. The sequences of each pair of oligonucleotides, is selected to permit the pair to hybndize to abutting sequences of the same strand of the target Such hybndization forms a substrate for a template-dependant ligase. In accordance with the present invention, LCR can be performed with oligonucleotides having the proximal and distal sequences of the same strand of a biallelic marker site. In one embodiment, either oligonucleotide will be designed to include the biallelic marker site. In such an embodiment, the reaction conditions are selected such that the oligonucleotides can be ligated together only if the target molecule either contains or lacks the specific nucleotιde(s) that is complementary to the biallelic marker on the oligonucleotide. In an alternative embodiment, the oligonucleotides will not include the biallelic marker, such that when they hybridize to the target molecule, a "gap" is created as described in WO 90/01069. This gap is then "filled" with complementary dNTPs (as mediated by DNA polymerase), or by an additional pair of oligonucleotides Thus at the end of each cycle, each single strand has a complement capable of serving as a target dunng the next cycle and exponential allele-specific amplification of the desired sequence is obtained.
Ligase/Polymerase-mediated Genetic Bit Analysis™ is another method for determining the identity of a nucleotide at a preselected site in a nucleic acid molecule (WO 95/21271). This method involves the incoφoration of a nucleoside tnphosphate that is complementary to the nucleotide present at the preselected site onto the terminus of a pnmer molecule, and their subsequent ligation to a second oligonucleotide The reaction is monitored by detecting a specific label attached to the reaction's solid phase or by detection in solution 4) Hybridization assay methods
A prefened method of determining the identity of the nucleotide present at a biallelic marker site involves nucleic acid hybndization The hybndization probes, which can be conveniently used in such reactions, preferably include the probes defined herein. Any hybndization assay may be used including Southern hybndization, Northern hybndization, dot blot hybridization and solid-phase hybridization (see Sambrook et al , Molecular Cloning - A Laboratory Manual, Second Edition, Cold Spnng Harbor Press, N.Y., 1989). Hybndization refers to the formation of a duplex structure by two single stranded nucleic acids due to complementary base pamng. Hybndization can occur between exactly complementary nucleic acid strands or between nucleic acid strands that contain minor regions of mismatch. Specific probes can be designed that hybndize to one form of a biallelic marker and not to the other and therefore are able to discnminate between different allelic forms. Allele-specific probes are often used in pairs, one member of a pair showing perfect match to a target sequence containing the oπginal allele and the other showing a perfect match to the target sequence containing the alternative allele. Hybndization conditions should be sufficiently stnngent that there is a significant difference in hybndization intensity between alleles, and preferably an essentially binary response, whereby a probe hybndizes to only one of the alleles. Stnngent, sequence specific hybndization conditions, under which a probe will hybndize only to the exactly complementary target sequence are well known in the art (Sambrook et al., Molecular Cloning - A Laboratory Manual, Second Edition, Cold Spring Harbor Press, N.Y., 1989). Stnngent conditions are sequence dependent and will be different in different circumstances. Generally, stringent conditions are selected to be about 5°C lower than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and pH. By way of example and not limitation, procedures using conditions of high stπngency are as follows- Prehybπdization of filters containing DNA is earned out for 8 h to overnight at 65°C in buffer composed of 6X SSC, 50 mM Tπs-HCl (pH 7.5), 1 mM EDTA, 0.02% PVP, 0 02% Ficoll, 0.02% BSA, and 500 μg/ml denatured salmon sperm DNA. Filters are hybndized for 48 h at 65°C, the prefened hybndization temperature, in prehybπdization mixture containing 100 μg/ml denatured salmon sperm DNA and 5-20 X 106 cpm of
32P-labeled probe Alternatively, the hybndization step can be performed at 65°C in the presence of SSC buffer, 1 x SSC conespondmg to 0.15M NaCl and 0.05 M Na citrate Subsequently, filter washes can be done at 37°C for 1 h in a solution containing 2X SSC, 0.01% PVP, 0 01% Ficoll, and 0.01% BSA, followed by a wash in 0.1X SSC at 50°C for 45 mm Alternatively, filter washes can be performed in a solution containing 2 x SSC and 0.1%
SDS, or 0 5 x SSC and 0.1% SDS, or 0.1 x SSC and 0 1% SDS at 68°C for 15 minute intervals Following the wash steps, the hybndized probes are detectable by autoradiography. By way of example and not limitation, procedures using conditions of intermediate stnngency are as follows Filters containing DNA are prehybndized, and then hybndized at a temperature of 60°C m the presence of a 5 x SSC buffer and labeled probe. Subsequently, filters washes are performed in a solution containing 2x SSC at 50°C and the hybndized probes are detectable by autoradiography. Other conditions of high and intermediate stnngency which may be used are well known in the art and as cited in Sambrook et al. (Molecular Cloning - A Laboratory Manual, Second Edition, Cold Spnng Harbor Press, N.Y., 1989) and Ausubel et al. (Cunent Protocols in Molecular Biology, Green Publishing Associates and Wiley
Interscience, N.Y., 1989).
Although such hybndizations can be performed m solution, it is prefened to employ a solid-phase hybndization assay. The target DNA compnsmg a biallelic marker of the present invention may be amplified pπor to the hybndization reaction. The presence of a specific allele m the sample is determined by detecting the presence or the absence of stable hybnd duplexes formed between the probe and the target DNA. The detection of hybnd duplexes can be earned out by a number of methods. Vanous detection assay formats are well known which utilize detectable labels bound to either the target or the probe to enable detection of the hybnd duplexes Typically, hybndization duplexes are separated from unhybndized nucleic acids and the labels bound to the duplexes are then detected. Those skilled in the art will recognize that wash steps may be employed to wash away excess target DNA or probe Standard heterogeneous assay formats are suitable for detecting the hybπds using the labels present on the pnmers and probes.
Two recently developed assays allow hybndization-based allele discnmination with no need for separations or washes (see Landegren U. et al., Genome Research, 8:769- 776,1998) The TaqMan assay takes advantage of the 5' nuclease activity of Taq DNA polymerase to digest a DNA probe annealed specifically to the accumulating amplification product TaqMan probes are labeled with a donor-acceptor dye pair that interacts via fluorescence energy transfer Cleavage of the TaqMan probe by the advancing polymerase dunng amplification dissociates the donor dye from the quenching acceptor dye, greatly increasing the donor fluorescence All reagents necessary to detect two allelic vanants can be assembled at the beginning of the reaction and the results are monitored in real time (see Livak et al., Nature Genetics, 9-341-342, 1995). In an alternative homogeneous hybndization- based procedure, molecular beacons are used for allele discnmmations Molecular beacons are haiφin-shaped oligonucleotide probes that report the presence of specific nucleic acids in homogeneous solutions When they bind to their targets they undergo a conformational reorganization that restores the fluorescence of an internally quenched fluorophore (Tyagi et al , Nature Biotechnology, 16-49-53, 1998)
The polynucleotides provided herein can be used m hybndization assays for the detection of biallelic marker alleles in biological samples. These probes are charactenzed in that they preferably compnse between 8 and 50 nucleotides, and m that they are sufficiently complementary to a sequence compnsmg a biallelic marker of the present invention to hybndize thereto and preferably sufficiently specific to be able to discnminate the targeted sequence for only one nucleotide vanation. The GC content in the probes of the invention usually ranges between 10 and 75 %, preferably between 35 and 60 %, and more preferably between 40 and 55 % The length of these probes can range from 10, 15, 20, or 30 to at least 100 nucleotides, preferably from 10 to 50, more preferably from 18 to 35 nucleotides. A particularly prefened probe is 25 nucleotides in length. Preferably the biallelic marker is within 4 nucleotides of the center of the polynucleotide probe In particularly prefened probes the biallelic marker is at the center of said polynucleotide Shorter probes may lack specificity for a target nucleic acid sequence and generally require cooler temperatures to form sufficiently stable hybnd complexes with the template. Longer probes are expensive to produce and can sometimes self-hybndize to form haiφin structures Methods for the synthesis of oligonucleotide probes have been descnbed above and can be applied to the probes of the present invention
Preferably the probes of the present invention are labeled or immobilized on a solid support. Labels and solid supports are further descnbed in I. Detection probes are generally nucleic acid sequences or uncharged nucleic acid analogs such as, for example peptide nucleic acids which are disclosed in International Patent Application WO 92/20702, moφhohno analogs which are descnbed in U.S Patents Numbered 5,185,444; 5,034,506 and 5,142,047.
The probe may have to be rendered "non-extendable" in that additional dNTPs cannot be added to the probe In and of themselves analogs usually are non-extendable and nucleic acid probes can be rendered non-extendable by modifying the 3' end of the probe such that the hydroxyl group is no longer capable of participating in elongation. For example, the 3' end of the probe can be functionahzed with the capture or detection label to thereby consume or otherwise block the hydroxyl group. Alternatively, the 3' hydroxyl group simply can be cleaved, replaced or modified, U S Patent Application Seπal No 07/049,061 filed Apnl 19, 1993 descnbes modifications, which can be used to render a probe non-extendable
The probes of the present invention are useful for a number of puφoses. They can be used in Southern hybndization to genomic DNA or Northern hybndization to mRNA. The probes can also be used to detect PCR amplification products. By assaying the hybndization to an allele specific probe, one can detect the presence or absence of a biallelic marker allele in a given sample
High-Throughput parallel hybndizations in anay format are specifically encompassed within "hybndization assays" and are descnbed below Hybridization to addressable arrays of oligonucleotides
Hybndization assays based on oligonucleotide anays rely on the differences in hybndization stability of short oligonucleotides to perfectly matched and mismatched target sequence vanants Efficient access to polymoφhism information is obtained through a basic structure compnsmg high-density anays of oligonucleotide probes attached to a solid support (the chip) at selected positions. Each DNA chip can contain thousands to millions of individual synthetic DNA probes ananged m a gnd-like pattern and mmiatunzed to the size of a dime.
The chip technology has already been applied with success in numerous cases. For example, the screening of mutations has been undertaken m the BRCA1 gene, in S. cerevisiae mutant strains, and in the protease gene of HIV- 1 virus (Hacia et al., Nature Genetics,
14(4):441-447, 1996; Shoemaker et al., Nature Genetics, 14(4):450-456, 1996 ; Kozal et al., Nature Medicine, 2:753-759, 1996). Chips of vanous formats for use m detecting biallelic polymoφhisms can be produced on a customized basis by Affymetnx (GeneChip™), Hyseq (HyChip and HyGnostics), and Protogene Laboratoπes In general, these methods employ anays of oligonucleotide probes that are complementary to target nucleic acid sequence segments from an individual which, target sequences include a polymoφhic marker. EP785280 descnbes a tiling strategy for the detection of single nucleotide polymoφhisms. Briefly, anays may generally be "tiled" for a large number of specific polymoφhisms. By "tiling" is generally meant the synthesis of a defined set of oligonucleotide probes which is made up of a sequence complementary to the target sequence of interest, as well as preselected vaπations of that sequence, e.g., substitution of one or more given positions with one or more members of the basis set of monomers, i.e. nucleotides Tiling strategies are further descnbed in PCT application No WO 95/11995. In a particular aspect, anays are tiled for a number of specific, identified biallelic marker sequences. In particular the anay is tiled to include a number of detection blocks, each detection block being specific for a specific biallelic marker or a set of biallelic markers. For example, a detection block may be tiled to include a number of probes, which span the sequence segment that includes a specific polymoφhism. To ensure probes that are complementary to each allele, the probes are synthesized in pairs diffeπng at the biallelic marker. In addition to the probes diffeπng at the polymoφhic base, monosubstituted probes are also generally tiled within the detection block. These monosubstituted probes have bases at and up to a certain number of bases m either direction from the polymoφhism, substituted with the remaining nucleotides (selected from A, T, G, C and U) Typically the probes in a tiled detection block will include substitutions of the sequence positions up to and including those that are 5 bases away from the biallelic marker. The monosubstituted probes provide internal controls for the tiled array, to distinguish actual hybndization from artefactual cross- hybπdization. Upon completion of hybndization with the target sequence and washing of the anay, the anay is scanned to determine the position on the anay to which the target sequence hybndizes. The hybndization data from the scanned anay is then analyzed to identify which allele or alleles of the biallelic marker are present m the sample. Hybndization and scanning may be earned out as descnbed in PCT application No. WO 92/10092 and WO 95/11995 and
US patent No. 5,424,186.
Thus, m some embodiments, the chips may compnse an anay of nucleic acid sequences of fragments of about 15 nucleotides in length. In further embodiments, the chip may compnse an anay including at least one of the sequences selected from the group consisting of SEQ DD No. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and the sequences complementary thereto, or a fragment thereof at least about 8 consecutive nucleotides, preferably 10, 15, 20, more preferably least 30, 35, 43, 44, 45, 46 or 47 consecutive nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD. In some embodiments, the chip may compnse an anay of at least 2, 3, 4, 5, 6, 7, 8 or more of these polynucleotides of the invention. Solid supports and polynucleotides of the present invention attached to solid supports are further descnbed in I. 5) Integrated Systems
Another technique, which may be used to analyze polymoφhisms, includes multicomponent integrated systems, which m iatunze and compartmentalize processes such as PCR and capillary electrophoresis reactions in a single functional device. An example of such technique is disclosed US patent 5,589,136, which descnbes the integration of PCR amplification and capillary electrophoresis in chips.
Integrated systems can be envisaged mainly when microfluidic systems are used These systems compnse a pattern of microchannels designed onto a glass, silicon, quartz, or plastic wafer included on a microchip. The movements of the samples are controlled by electric, electroosmotic or hydrostatic forces applied across different areas of the microchip For genotyping biallelic markers, the microfluidic system may integrate nucleic acid amplification, microsequencing, capillary electrophoresis and a detection method such as laser-induced fluorescence detection. IV. Methods Of Genetic Analysis Using The Biallelic Markers Of The Present Invention
Different methods are available for the genetic analysis of complex traits (see Lander and Schork, Science, 265, 2037-2048, 1994). The search for disease-susceptibility genes is conducted using two main methods: the linkage approach in which evidence is sought for cosegregation between a locus and a putative trait locus using family studies, and the association approach m which evidence is sought for a statistically significant association between an allele and a trait or a trait causing allele (Khoury J. et al , Fundamentals of Genetic Epidemiology, Oxford University Press, NY, 1993). In general, the biallelic markers of the present invention find use in any method known in the art to demonstrate a statistically significant conelation between a genotype and a phenotype. The biallelic markers may be used in parametπc and non-parametnc linkage analysis methods. Preferably, the biallelic markers of the present invention are used to identify genes associated with detectable traits using association studies, an approach which does not require the use of affected families and which permits the identification of genes associated with complex and sporadic traits. The genetic analysis using the biallelic markers of the present invention may be conducted on any scale. The whole set of biallelic markers of the present invention or any subset of biallelic markers of the present invention may be used. In some embodiments a subset of biallelic markers conesponding to one or several candidate genes may be used. In other embodiments a subset of biallelic markers conespondmg to candidate genes from a particular disease pathway may be used. Alternatively, a subset of biallelic markers of the present invention localised on a specific chromosome segment may be used. Further, any set of genetic markers including a biallelic marker of the present invention may be used. A set of biallelic polymoφhisms that, could be used as genetic markers in combination with the biallelic markers of the present invention, has been descnbed m WO 98/20165. As mentioned above, it should be noted that the biallelic markers of the present invention may be included in any complete or partial genetic map of the human genome. These different uses are specifically contemplated in the present invention and claims IV.A. Linkage analysis
Linkage analysis is based upon establishing a conelation between the transmission of genetic markers and that of a specific trait throughout generations within a family Thus, the aim of linkage analysis is to detect marker loci that show cosegregation with a trait of interest in pedigrees
Parametric methods
When data are available from successive generations there is the opportunity to study the degree of linkage between pairs of loci Estimates of the recombination fraction enable loci to be ordered and placed onto a genetic map With loci that are genetic markers, a genetic map can be established, and then the strength of linkage between markers and traits can be calculated and used to indicate the relative positions of markers and genes affecting those traits (Weir, B.S , Genetic data Analysis II Methods for Discrete population genetic Data, Stnauer Assoc , Inc , Sunderland, MA, USA, 1996) The classical method for linkage analysis is the loganthm of odds (lod) score method (see Morton N.E., Am J Hum Genet., 7 277-318, 1955, Ott J., Analysis of Human Genetic Linkage, John Hopkins University Press, Baltimore,
1991 ) Calculation of lod scores requires specification of the mode of mhentance for the disease (parametπc method) Generally, the length of the candidate region identified using linkage analysis is between 2 and 20Mb Once a candidate region is identified as descnbed above, analysis of recombinant individuals using additional markers allows further delineation of the candidate region. Linkage analysis studies have generally relied on the use of a maximum of 5,000 microsatelhte markers, thus limiting the maximum theoretical attainable resolution of linkage analysis to about 600 kb on average.
Linkage analysis has been successfully applied to map simple genetic traits that show clear Mendehan mhentance patterns and which have a high penetrance (i.e., the ratio between the number of trait positive earners of allele a and the total number of a earners m the population) However, parametric linkage analysis suffers from a vanety of drawbacks First, it is limited by its reliance on the choice of a genetic model suitable for each studied trait Furthermore, as already mentioned, the resolution attainable using linkage analysis is limited, and complementary studies are required to refine the analysis of the typical 2Mb to 20Mb regions initially identified through linkage analysis. In addition, parametric linkage analysis approaches have proven difficult when applied to complex genetic traits, such as those due to the combined action of multiple genes and/or environmental factors. It is very difficult to model these factors adequately in a lod score analysis. In such cases, too large an effort and cost are needed to recruit the adequate number of affected families required for applying linkage analysis to these situations, as recently discussed by Risch, N. and
Menkangas, K. (Science, 273:1516-1517, 1996) Non-parametric methods
The advantage of the so-called non-parametric methods for linkage analysis is that they do not require specification of the mode of inheritance for the disease, they tend to be more useful for the analysis of complex traits. In non-parametric methods, one tries to prove that the inheritance pattern of a chromosomal region is not consistent with random Mendehan segregation by showing that affected relatives inherit identical copies of the region more often than expected by chance. Affected relatives should show excess "allele sharing" even in the presence of incomplete penetrance and polygenic inheritance. In non-parametric linkage analysis the degree of agreement at a marker locus in two individuals can be measured either by the number of alleles identical by state (D3S) or by the number of alleles identical by descent (IBD). Affected sib pair analysis is a well-known special case and is the simplest form of these methods.
The biallelic markers of the present invention may be used in both parametric and non-parametric linkage analysis. Preferably biallelic markers may be used in non-parametric methods which allow the mapping of genes involved in complex traits. The biallelic markers of the present invention may be used in both ED- and IBS- methods to map genes affecting a complex trait. In such studies, taking advantage of the high density of biallelic markers, several adjacent biallelic marker loci may be pooled to achieve the efficiency attained by multi-allelic markers (Zhao et al., Am. J. Hum. Genet., 63:225-240, 1998). However, both parametric and non-parametric linkage analysis methods analyse affected relatives, they tend to be of limited value in the genetic analysis of drug responses or in the analysis of side effects to treatments. This type of analysis is impractical in such cases due to the lack of availability of familial cases. In fact, the likelihood of having more than one individual in a family being exposed to the same drug at the same time is extremely low. IV.B. Population Association studies
The present invention comprises methods for identifying one or several genes among a set of candidate genes that are associated with a detectable trait using the biallelic markers of the present invention. In one embodiment the present invention comprises methods to detect an association between a biallelic marker allele or a biallelic marker haplotype and a trait. Further, the invention comprises methods to identify a trait causing allele in linkage disequilibrium with any biallelic marker allele of the present invention.
As described above, alternative approaches can be employed to perform association studies: genome-wide association studies, candidate region association studies and candidate gene association studies. In a prefened embodiment, the biallelic markers of the present invention are used to perform candidate gene association studies. Further, the biallelic markers of the present invention may be incoφorated in any map of genetic markers of the human genome in order to perform genome-wide association studies Methods to generate a high-density map of biallelic markers has been described in US Provisional Patent application senal number 60/082,614. The biallelic markers of the present invention may further be incoφorated in any map of a specific candidate region of the genome (a specific chromosome or a specific chromosomal segment for example)
As mentioned above, association studies may be conducted within the general population and are not limited to studies performed on related individuals in affected families Association studies are extremely valuable as they permit the analysis of sporadic or multifactor traits. Moreover, association studies represent a powerful method for fine-scale mapping enabling much finer mapping of trait causing alleles than linkage studies. Studies based on pedigrees often only nanow the location of the trait causing allele. Association studies using the biallelic markers of the present invention can therefore be used to refine the location of a trait causing allele m a candidate region identified by Linkage Analysis methods Moreover, once a chromosome segment of interest has been identified, the presence of a candidate gene such as a candidate gene of the present invention, m the region of interest can provide a shortcut to the identification of the trait causing allele. Biallelic markers of the present invention can be used to demonstrate that a candidate gene is associated with a trait Such uses are specifically contemplated in the present invention and claims. 1) Determining the frequency of a biallelic marker allele or of a biallelic marker haplotype in a population
Association studies explore the relationships among frequencies for sets of alleles between loci Determining the frequency of an allele in a population
Allelic frequencies of the biallelic markers in a population can be determined using one of the methods described above under the heading "Methods for genotyping an individual for biallelic markers", or any genotyping procedure suitable for this intended puφose. Genotyping pooled samples or individual samples can determine the frequency of a biallelic marker allele in a population. One way to reduce the number of genotypings required is to use pooled samples A major obstacle in using pooled samples is m terms of accuracy and reproducibility for determining accurate DNA concentrations in setting up the pools.
Genotyping individual samples provides higher sensitivity, reproducibility and accuracy and, is the prefened method used in the present invention. Preferably, each individual is genotyped separately and simple gene counting is applied to determine the frequency of an allele of a biallelic marker or of a genotype in a given population. Determining the frequency of a haplotype in a population
The gametic phase of haplotypes is unknown when diploid individuals are heterozygous at more than one locus Using genealogical information in families gametic phase can sometimes be mfened (Perhn et al., Am J Hum Genet , 55 777-787, 1994) When no genealogical information is available different strategies may be used One possibility is that the multiple-site heterozygous diploids can be eliminated from the analysis, keeping only the homozygotes and the single-site heterozygote individuals, but this approach might lead to a possible bias m the sample composition and the underestimation of low-frequency haplotypes Another possibility is that single chromosomes can be studied independently, for example, by asymmetnc PCR amplification (see Newton et al., Nucleic Acids Res , 17:2503- 2516, 1989; Wu et al., Proc Natl Acad Sci USA, 86:2757, 1989) or by isolation of single chromosome by limit dilution followed by PCR amplification (see Ruano et al., Proc. Natl
Acad Sci USA, 87:6296-6300, 1990) Further, a sample may be haplotyped for sufficiently close biallelic markers by double PCR amplification of specific alleles (Sarkar, G. and Sommer S.S., Biotechniques, 1991). These approaches are not entirely satisfying either because of their technical complexity, the additional cost they entail, their lack of generalisation at a large scale, or the possible biases they introduce. To overcome these difficulties, an algonthm to infer the phase of PCR-amplified DNA genotypes introduced by Clark A.G. (Mol. Biol. Evol , 7: 111-122, 1990) may be used. Briefly, the pnnciple is to start filling a preliminary list of haplotypes present m the sample by examining unambiguous individuals, that is, the complete homozygotes and the single-site heterozygotes. Then other individuals in the same sample are screened for the possible occunence of previously recognised haplotypes. For each positive identification, the complementary haplotype is added to the list of recognised haplotypes, until the phase information for all individuals is either resolved or identified as unresolved. This method assigns a single haplotype to each multiheterozygous individual, whereas several haplotypes are possible when there are more than one heterozygous site. Alternatively, one can use methods estimating haplotype frequencies in a population without assigning haplotypes to each individual Preferably, a method based on an expectation-maximization (EM) algonthm (Dempster et al., J R. Stat Soc , 39B: 1-38, 1977) leading to maximum-likelihood estimates of haplotype frequencies under the assumption of Hardy- Weinberg proportions (random mating) is used (see Excoffier L and Slatkm M., Mol. Biol. Evol , 12(5): 921-927, 1995). The EM algonthm is a generalised iterative maximum-likelihood approach to estimation that is useful when data are ambiguous and/or incomplete. The EM algonthm is used to resolve heterozygotes into haplotypes. Haplotype estimations are further described below under the heading "Statistical methods". Any other method known m the art to determine or to estimate the frequency of a haplotype in a population may also be used
2) Linkage Disequilibrium analysis Linkage disequilibnum is the non-random association of alleles at two or more loci and represents a powerful tool for mapping genes involved in disease traits (see Ajioka R S. et a\., Am J Hum Genet., 60.1439-1447, 1997). Biallelic markers, because they are densely spaced in the human genome and can be genotyped m more numerous numbers than other types of genetic markers (such as RFLP or VNTR markers), are particularly useful m genetic analysis based on linkage disequilibrium. The biallelic markers of the present invention may be used in any linkage disequilibnum analysis method known in the art.
Bnefly, when a disease mutation is first introduced into a population (by a new mutation or the immigration of a mutation earner), it necessaπly resides on a single chromosome and thus on a single "background" or "ancestral" haplotype of linked markers
Consequently, there is complete disequilibrium between these markers and the disease mutation: one finds the disease mutation only in the presence of a specific set of marker alleles. Through subsequent generations recombinations occur between the disease mutation and these marker polymoφhisms, and the disequilibrium gradually dissipates. The pace of this dissipation is a function of the recombination frequency, so the markers closest to the disease gene will manifest higher levels of disequilibnum than those that are further away. When not broken up by recombination, "ancestral" haplotypes and linkage disequilibrium between marker alleles at different loci can be tracked not only through pedigrees but also through populations. Linkage disequilibnum is usually seen as an association between one specific allele at one locus and another specific allele at a second locus.
The pattern or curve of disequilibnum between disease and marker loci is expected to exhibit a maximum that occurs at the disease locus. Consequently, the amount of linkage disequilibnum between a disease allele and closely linked genetic markers may yield valuable information regarding the location of the disease gene. For fine-scale mapping of a disease locus, it is useful to have some knowledge of the patterns of linkage disequilibnum that exist between markers in the studied region. As mentioned above the mapping resolution achieved through the analysis of linkage disequilibrium is much higher than that of linkage studies. The high density of biallelic markers combined with linkage disequilibnum analysis provides powerful tools for fine-scale mapping. Different methods to calculate linkage disequilibnum are described below under the heading "Statistical Methods".
3) Population-based case-control studies of trait-marker associations
As mentioned above, the occunence of pairs of specific alleles at different loci on the same chromosome is not random and the deviation from random is called linkage disequilibnum. Association studies focus on population frequencies and rely on the phenomenon of linkage disequilibnum. If a specific allele in a given gene is directly involved in causing a particular trait, its frequency will be statistically increased in an affected (trait positive) population, when compared to the frequency in a trait negative population or in a random control population As a consequence of the existence of linkage disequilibnum, the frequency of all other alleles present m the haplotype carrying the trait-causing allele will also be increased in trait positive individuals compared to trait negative individuals or random controls Therefore, association between the trait and any allele (specifically a biallelic marker allele) in linkage disequilibrium with the trait-causing allele will suffice to suggest the presence of a trait-related gene m that particular region. Case-control populations can be genotyped for biallelic markers to identify associations that narrowly locate a trait causing allele As any marker m linkage disequilibrium with one given marker associated with a trait will be associated with the trait Linkage disequilibnum allows the relative frequencies in case-control populations of a limited number of genetic polymoφhisms (specifically biallelic markers) to be analysed as an alternative to screening all possible functional polymoφhisms in order to find trait-causmg alleles Association studies compare the frequency of marker alleles m unrelated case-control populations, and represent powerful tools for the dissection of complex traits.
Case-control populations (inclusion criteria)
Population-based association studies do not concern familial mhentance but compare the prevalence of a particular genetic marker, or a set of markers, in case-control populations. They are case-control studies based on companson of unrelated case (affected or trait positive) individuals and unrelated control (unaffected or trait negative or random) individuals. Preferably the control group is composed of unaffected or trait negative individuals. Further, the control group is ethnically matched to the case population Moreover, the control group is preferably matched to the case-population for the mam known confusion factor for the trait under study (for example age-matched for an age-dependent trait) Ideally, individuals m the two samples are paired m such a way that they are expected to differ only in their disease status. In the following "trait positive population", "case population" and "affected population" are used interchangeably.
An important step in the dissection of complex traits using association studies is the choice of case-control populations (see Lander and Schork, Science, 265, 2037-2048, 1994) A major step in the choice of case-control populations is the clinical definition of a given trait or phenotype. Any genetic trait may be analysed by the association method proposed here by carefully selecting the individuals to be included in the trait positive and trait negative phenotypic groups. Four cπtena are often useful: clinical phenotype, age at onset, family history and seventy. The selection procedure for continuous or quantitative traits (such as blood pressure for example) involves selecting individuals at opposite ends of the phenotype distπbution of the trait under study, so as to include in these trait positive and trait negative populations individuals with non-overlapping phenotypes Preferably, case-control populations consist of phenotypically homogeneous populations Trait positive and trait negative populations consist of phenotypically uniform populations of individuals representing each between 1 and 98%, preferably between 1 and 80%, more preferably between 1 and 50%, and more preferably between 1 and 30%, most preferably between 1 and
20%) of the total population under study, and selected among individuals exhibiting non- overlapping phenotypes. The clearer the difference between the two trait phenotypes, the greater the probability of detecting an association with biallelic markers The selection of those drastically different but relatively uniform phenotypes enables efficient compaπsons m association studies and the possible detection of marked differences at the genetic level, provided that the sample sizes of the populations under study are significant enough. In prefened embodiments, a first group of between 50 and 300 trait positive individuals, preferably about 100 individuals, are recruited according to their phenotypes. A similar number of trait negative individuals are included in such studies. Association analysis
The general strategy to perform association studies using biallelic markers denved from a region carrying a candidate gene is to scan two groups of individuals (case-control populations) in order to measure and statistically compare the allele frequencies of the biallelic markers of the present invention in both groups. If a statistically significant association with a trait is identified for at least one or more of the analysed biallelic markers, one can assume that: either the associated allele is directly responsible for causing the trait (the associated allele is the trait causing allele), or more likely the associated allele is m linkage disequilibnum with the trait causing allele. The specific charactenstics of the associated allele with respect to the candidate gene function usually gives further insight into the relationship between the associated allele and the trait (causal or in linkage disequilibnum). If the evidence indicates that the associated allele within the candidate gene is most probably not the trait causing allele but is in linkage disequilibnum with the real trait causing allele, then the trait causing allele can be found by sequencing the vicinity of the associated marker. Association studies are usually run in two successive steps. In a first phase, the frequencies of a reduced number of biallelic markers from one or several candidate genes are determined in the trait positive and trait negative populations In a second phase of the analysis, the identity of the candidate gene and the position of the genetic loci responsible for the given trait is further refined using a higher density of markers from the relevant region. However, if the candidate gene under study is relatively small in length, as it is the case for many of the candidate genes analysed included in the present invention, a single phase may be sufficient to establish significant associations Haplotype analysis
As descnbed above, when a chromosome carrying a disease allele first appears in a population as a result of either mutation or migration, the mutant allele necessanly resides on a chromosome having a set of linked markers: the ancestral haplotype This haplotype can be tracked through populations and its statistical association with a given trait can be analysed Complementing single point (allelic) association studies with multi-point association studies also called haplotype studies increases the statistical power of association studies. Thus, a haplotype association study allows one to define the frequency and the type of the ancestral earner haplotype. A haplotype analysis is important in that it increases the statistical power of an analysis involving individual markers.
In a first stage of a haplotype frequency analysis, the frequency of the possible haplotypes based on vanous combinations of the identified biallelic markers of the invention is determined. The haplotype frequency is then compared for distinct populations of trait positive and control individuals. The number of trait positive individuals, which should be, subjected to this analysis to obtain statistically significant results usually ranges between 30 and 300, with a prefened number of individuals ranging between 50 and 150. The same considerations apply to the number of unaffected individuals (or random control) used in the study. The results of this first analysis provide haplotype frequencies in case-control populations, for each evaluated haplotype frequency a p-value and an odd ratio are calculated
If a statistically significant association is found the relative nsk for an individual carrying the given haplotype of being affected with the trait under study can be approximated. Interaction Analysis
The biallelic markers of the present invention may also be used to identify patterns of biallelic markers associated with detectable traits resulting from polygemc interactions. The analysis of genetic interaction between alleles at unlinked loci requires individual genotyping using the techniques descnbed herein. The analysis of allelic interaction among a selected set of biallelic markers with appropnate level of statistical significance can be considered as a haplotype analysis. Interaction analysis consists in stratifying the case-control populations with respect to a given haplotype for the first loci and performing a haplotype analysis with the second loci with each subpopulation.
Statistical methods used in association studies are further descnbed below m IV.C. 4) Testing for linkage in the presence of association
The biallelic markers of the present invention may further be used in TDT (transmission/disequihbnum test). TDT tests for both linkage and association and is not affected by population stratification TDT requires data for affected individuals and their parents or data from unaffected sibs instead of from parents (see Spielmann S et al., Am J Hum Genet., 52.506-516, 1993, Schaid D J. et al , Genet Epιdemιol.,13 423-450, 1996, Spielmann S. and Ewens W J , Am J Hum Genet , 62 450-458, 1998) Such combined tests generally reduce the false - positive enors produced by separate analyses IV.C. Statistical methods
In general, any method known m the art to test whether a trait and a genotype show a statistically significant conelation may be used
1) Methods in linkage analysis
Statistical methods and computer programs useful for linkage analysis are well-known to those skilled in the art (see Terwilhger J D. and Ott J., Handbook of Human Genetic
Linkage, John Hopkins University Press, London, 1994, Ott J , Analysis of Human Genetic Linkage, John Hopkins University Press, Baltimore, 1991).
2) Methods to estimate haplotype frequencies in a population
As descnbed above, when genotypes are scored, it is often not possible to distinguish heterozygotes so that haplotype frequencies cannot be easily infened When the gametic phase is not known, haplotype frequencies can be estimated from the multilocus genotypic data. Any method known to person skilled in the art can be used to estimate haplotype frequencies (see Lange K., Mathematical and Statistical Methods for Genetic Analysis, Springer, New York, 1997, Weir, B.S., Genetic data Analysis II. Methods for Discrete population genetic Data, Sinauer Assoc , Inc , Sunderland, MA, USA, 1996) Preferably, maximum-likelihood haplotype frequencies are computed using an Expectation- Maximization (EM) algonthm (see Dempster et al., J. R. Stat Soc , 39B 1-38, 1977; Excoffier L. and Slatkm M., Mol. Biol. Evol, 12(5): 921-927, 1995) This procedure is an iterative process aiming at obtaining maximum- hkehhood estimates of haplotype frequencies from multi-locus genotype data when the gametic phase is unknown Haplotype estimations are usually performed by applying the EM algonthm using for example the EM-HAPLO program (Hawley M.E. et al., Am. J Phys Anthropol , 18:104, 1994) or the Arlequin program (Schneider et al., Arlequin. a software for population genetics data analysis, University of Geneva, 1997) The EM algonthm is a generalised iterative maximum likelihood approach to estimation and is bnefly descnbed below.
In the following part of this text, phenotypes will refer to multi-locus genotypes with unknown phase. Genotypes will refer to known-phase multi-locus genotypes
Suppose a sample of N unrelated individuals typed for K markers. The data observed are the unknown-phase K-locus phenotypes that can categonsed in F different phenotypes Suppose that we have H underlying possible haplotypes (m case of K biallelic markers,
H=2K). For phenotype j, suppose that c, genotypes are possible. We thus have the following equation e, c,
Pj = ∑ pr (genotype ι) = ∑ pr(hk ,h[ ) Equation 1
1 =1 =1 where Pj is the probability of the phenotype j, b*and ht are the two haplotypes constituent the genotype . Under the Hardy- Wemberg equilibrium,
Figure imgf000057_0001
becomes : pr(hk ,/i/ ) = pr(hk )2 if hk = /ι, , pr(hk ,ht ) = 2pr(hk ).pr(ht ) if hk ≠ /t, .
Equation 2
The successive steps of the E-M algonthm can be descnbed as follows:
Starting with initial values of the of haplotypes frequencies, noted /?,0) , pf* , p , these initial values serve to estimate the genotype frequencies (Expectation step) and then estimate another set of haplotype frequencies (Maximisation step), noted /?,(1) , p^ , p , these two steps are iterated until changes in the sets of haplotypes frequency are very small.
A stop cπteπon can be that the maximum difference between haplotype frequencies between two iterations is less than 10"7. These values can be adjusted according to the desired precision of estimations. In details, at a given iteration s, the Expectation step consists in calculating the genotypes frequencies by the following equation: pr(genotype s' = p -(phenotype :).pr '(genotype ^phenotype /) rtj prjh^ht )^
N ' p(s) j
Equation 3 where genotype i occurs in phenotypey, and where hk and ht constitute genotype i. Each probability is deπved according to eq.l, and eq.2 descnbed above.
Then the Maximisation step simply estimates another set of haplotype frequencies given the genotypes frequencies. This approach is also known as gene-counting method
(Smith, Ann. Hum Genet., 21:254-276, 1957).
Pt = -pr (genotype iYs' Equation 4
Figure imgf000057_0002
Where δ;( is an indicator vanable which count the number of time haplotype t in genotype J. It takes the values of 0, 1 or 2.
To ensure that the estimation finally obtained is the maximum-likelihood estimation several values of departures are required. The estimations obtained are compared and if they are different the estimations leading to the best likelihood are kept. 3) Methods to calculate linkage disequilibrium between markers
A number of methods can be used to calculate linkage disequilibrium between any two genetic positions, in practice linkage disequilibrium is measured by applying a statistical association test to haplotype data taken from a population. Linkage disequilibrium between any pair of biallelic markers comprising at least one of the biallelic markers of the present invention (Mj, Mj) having alleles (a/bj) at marker Mj and alleles (a bj) at marker Mj can be calculated for every allele combination (aj,aj . aj,bj; bj,aj and bj,bj), according to the Piazza formula : θ4 - (Θ4 + Θ3) (Θ4 +Θ2), where : Θ4= - - = frequency of genotypes not having allele aj at Mj and not having allele aj at Mj
Θ3= - + = frequency of genotypes not having allele as at Mj and having allele aj at Mj Θ2= + - = frequency of genotypes having allele aj at M; and not having allele aj at Mj
Linkage disequilibrium (LD) between pairs of biallelic markers (Mj, Mj) can also be calculated for every allele combination (ai,aj; ai,bj ;bj,aj andb;,bj), according to the maximum- likelihood estimate (MLE) for delta (the composite genotypic disequilibrium coefficient), as described by Weir (Weir B.S., Genetic Data Analysis, Sinauer Ass. Eds, 1996). The MLE for the composite linkage disequilibrium is: Daiaj= (2n, + n2 + n3 + ri4 2)/N - 2(pr(a;).pr(aj))
Where n! = Σ phenotype
Figure imgf000058_0001
Σ phenotype (a bj, a/aj), n4= Σ phenotype (aj bj, aj bj) and N is the number of individuals in the sample.
This formula allows linkage disequilibrium between alleles to be estimated when only genotype, and not haplotype, data are available.
Another means of calculating the linkage disequilibrium between markers is as follows. For a couple of biallelic markers, Mt (a bj) and Mj(a bJ), fitting the Hardy-Weinberg equilibrium, one can estimate the four possible haplotype frequencies in a given population according to the approach described above.
The estimation of gametic disequilibrium between ai and aj is simply: D iaj = pr(haplotype(af , α )) - pr(at ).pr(aj ).
Where pr(ai) is the probability of allele , and pr(a is the probability of allele α and where pr(haplotype (a„ aβ) is estimated as in Equation 3 above.
For a couple of biallelic marker only one measure of disequilibrium is necessary to describe the association between , and Mj.
Then a normalised value of the above is calculated as follows:
D', j = D-J-J / max (-pr(aj).pr(aj) , -pr(bi).pr(bj)) with D i.j<0 D',ι.j = D-iaj / max (pr(bj).pr(aj) , pr(ai).pr(bj)) with D„aJ>0 The skilled person will readily appreciate that other LD calculation methods can be used without undue experimentation.
Linkage disequilibrium among a set of biallelic markers having an adequate heterozygosity rate can be determined by genotyping between 50 and 1000 unrelated individuals, preferably between 75 and 200, more preferably around 100.
4) Testing for association
Methods for determining the statistical significance of a conelation between a phenotype and a genotype, in this case an allele at a biallelic marker or a haplotype made up of such alleles, may be determined by any statistical test known in the art and with any accepted threshold of statistical significance being required. The application of particular methods and thresholds of significance are well with in the skill of the ordinary practitioner of the art. Testing for association is performed by determining the frequency of a biallelic marker allele in case and control populations and comparing these frequencies with a statistical test to determine if their is a statistically significant difference in frequency which would indicate a conelation between the trait and the biallelic marker allele under study.
Similarly, a haplotype analysis is performed by estimating the frequencies of all possible haplotypes for a given set of biallelic markers in case and control populations, and comparing these frequencies with a statistical test to determine if their is a statistically significant conelation between the haplotype and the phenotype (trait) under study. Any statistical tool useful to test for a statistically significant association between a genotype and a phenotype may be used. Preferably the statistical test employed is a chi-square test with one degree of freedom. A p-value is calculated (the p-value is the probability that a statistic as large or larger than the observed one would occur by chance). Statistical significance In prefened embodiments, significance for diagnosis puφoses, either as a positive basis for further diagnostic tests or as a preliminary starting point for early preventive therapy, the p value related to a biallelic marker association is preferably about 1 x 10-2 or less, more preferably about 1 x 10-4 or less, for a single biallelic marker analysis and about 1 x 10-3 or less, still more preferably 1 x 10-6 or less and most preferably of about 1 x 10-8 or less, for a haplotype analysis involving several markers. These values are believed to be applicable to any association studies involving single or multiple marker combinations.
The skilled person can use the range of values set forth above as a starting point in order to carry out association studies with biallelic markers of the present invention. In doing so, significant associations between the biallelic markers of the present invention and diseases can be revealed.
Phenotypic permutation In order to confirm the statistical significance of the first stage haplotype analysis descnbed above, it might be suitable to perform further analyses in which genotyping data from case-control individuals are pooled and randomised with respect to the trait phenotype. Each individual genotyping data is randomly allocated to two groups, which contain the same number of individuals as the case-control populations used to compile the data obtained in the first stage. A second stage haplotype analysis is preferably run on these artificial groups, preferably for the markers included in the haplotype of the first stage analysis showing the highest relative nsk coefficient. This expenment is reiterated preferably at least between 100 and 10000 times The repeated iterations allow the determination of the percentage of obtained haplotypes with a significant p-value level.
Assessment of statistical association
To address the problem of false positives similar analysis may be performed with the same case-control populations in random genomic regions. Results in random regions and the candidate region are compared as descnbed in US Provisional Patent Application entitled "Methods, software and apparati for identifying genomic regions harbouring a gene associated with a detectable trait". 5) Evaluation of risk factors
The association between a risk factor (m genetic epidemiology the nsk factor is the presence or the absence of a certain allele or haplotype at marker loci) and a disease is measured by the odds ratio (OR) and by the relative nsk (RR). If P(R+) is the probability of developing the disease for individuals with R and P(R") is the probability for individuals without the πsk factor, then the relative nsk is simply the ratio of the two probabilities, that is:
RR= P(R+)/P(R) In case-control studies, direct measures of the relative nsk cannot be obtained because of the sampling design. However, the odds ratio allows a good approximation of the relative risk for low-incidence diseases and can be calculated:
OR =
Figure imgf000060_0001
F+ is the frequency of the exposure to the πsk factor in cases and F" is the frequency of the exposure to the πsk factor in controls. F+ and F are calculated using the allelic or haplotype frequencies of the study and further depend on the underlying genetic model (dominant, recessive, additive...).
One can further estimate the attributable nsk (AR) which descnbes the proportion of individuals in a population exhibiting a trait due to a given nsk factor. This measure is important m quantitating the role of a specific factor in disease etiology and m terms of the public health impact of a nsk factor The public health relevance of this measure lies in estimating the proportion of cases of disease in the population that could be prevented if the exposure of interest were absent AR is determined as follows- AR = PE (RR-1) / (PE (RR-1)+1)
AR is the nsk attnbutable to a biallelic marker allele or a biallelic marker haplotype. PE is the frequency of exposure to an allele or a haplotype within the population at large; and RR is the relative nsk which, is approximated with the odds ratio when the trait under study has a relatively low incidence in the general population. IV.F. Identification Of Biallelic Markers In Linkage Disequilibrium With The Biallelic
Markers of the Invention
Once a first biallelic marker has been identified m a genomic region of interest, the practitioner of ordinary skill in the art, using the teachings of the present invention, can easily identify additional biallelic markers in linkage disequilibnum with this first marker. As mentioned before any marker in linkage disequilibnum with a first marker associated with a trait will be associated with the trait. Therefore, once an association has been demonstrated between a given biallelic marker and a trait, the discovery of additional biallelic markers associated with this trait is of great interest m order to increase the density of biallelic markers m this particular region. The causal gene or mutation will be found m the vicinity of the marker or set of markers showing the highest conelation with the trait.
Identification of additional markers m linkage disequilibnum with a given marker involves: (a) amplifying a genomic fragment compnsmg a first biallelic marker from a plurality of individuals; (b) identifying of second biallelic markers m the genomic region harboπng said first biallelic marker; (c) conducting a linkage disequilibnum analysis between said first biallelic marker and second biallelic markers; and (d) selecting said second biallelic markers as being in linkage disequilibrium with said first marker Subcombmations compnsmg steps (b) and (c) are also contemplated.
Methods to identify biallelic markers and to conduct linkage disequilibnum analysis are descnbed herein and can be earned out by the skilled person without undue expenmentation. The present invention then also concerns biallelic markers which are in linkage disequilibnum with any of the specific biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and which are expected to present similar charactenstics in terms of their respective association with a given trait
Example 5 illustrates the measurement of linkage disequilibnum between a publicly known biallelic marker, the "ApoE Site A", located within the Alzheimer's related ApoE gene, and other biallelic markers randomly denved from the genomic region containing the ApoE gene
IV.G. Identification Of Functional Mutations
Once a positive association is confirmed with a biallelic marker of the present invention, the associated candidate gene can be scanned for mutations by companng the sequences of a selected number of trait positive and trait negative individuals. In a prefened embodiment, functional regions such as exons and splice sites, promoters and other regulatory regions of the candidate gene are scanned for mutations Preferably, trait positive individuals carry the haplotype shown to be associated with the trait and trait negative individuals do not carry the haplotype or allele associated with the trait The mutation detection procedure is essentially similar to that used for biallelic site identification
The method used to detect such mutations generally compnses the following steps: (a) amplification of a region of the candidate gene compnsmg a biallelic marker or a group of biallelic markers associated with the trait from DNA samples of trait positive patients and trait negative controls, (b) sequencing of the amplified region; (c) companson of DNA sequences from trait-positive patients and trait-negative controls, and (d) determination of mutations specific to trait-positive patients. Subcombinations which compnse steps (b) and (c) are specifically contemplated.
It is prefened that candidate polymoφhisms be then venfied by screening a larger population of cases and controls by means of any genotyping procedure such as those descnbed herein, preferably using a microsequencing technique in an individual test format
Polymoφhisms are considered as candidate mutations when present in cases and controls at frequencies compatible with the expected association results V. Biallelic Markers Of The Invention In Methods Of Genetic Diagnostics
The biallelic markers of the present invention can also be used to develop diagnostics tests capable of identifying individuals who express a detectable trait as the result of a specific genotype or individuals whose genotype places them at nsk of developing a detectable trait at a subsequent time The trait analyzed using the present diagnostics may be any detectable trait, including a disease, a response to an agent acting on a disease, or side effects to an agent acting on a disease The diagnostic techniques of the present invention may employ a vanety of methodologies to determine whether a test subject has a biallelic marker pattern associated with an increased πsk of developing a detectable trait or whether the individual suffers from a detectable trait as a result of a particular mutation, including methods which enable the analysis of individual chromosomes for haplotypmg, such as family studies, single sperm DNA analysis or somatic hybnds The present invention provides diagnostic methods to determine whether an individual is at risk of developing a disease or suffers from a disease resulting from a mutation or a polymoφhism in a candidate gene of the present invention. The present invention also provides methods to determine whether an individual is likely to respond positively to an agent acting on a disease or whether an individual is at risk of developing an adverse side effect to an agent acting on a disease
These methods involve obtaining a nucleic acid sample from the individual and, determining, whether the nucleic acid sample contains at least one allele or at least one biallelic marker haplotype, indicative of a risk of developing the trait or indicative that the individual expresses the trait as a result of possessing a particular candidate gene polymoφhism or mutation (trait-causing allele).
Preferably, in such diagnostic methods, a nucleic acid sample is obtained from the individual and this sample is genotyped using methods descnbed above in IE The diagnostics may be based on a single biallelic marker or a on group of biallelic markers In each of these methods, a nucleic acid sample is obtained from the test subject and the biallelic marker pattern of one or more of the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 is determined.
In one embodiment, a PCR amplification is conducted on the nucleic acid sample to amplify regions m which polymoφhisms associated with a detectable phenotype have been identified. The amplification products are sequenced to determine whether the individual possesses one or more polymoφhisms associated with a detectable phenotype The pnmers used to generate amplification products may compnse the pnmers of SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. Alternatively, the nucleic acid sample is subjected to microsequencing reactions as descnbed above to determine whether the individual possesses one or more polymoφhisms associated with a detectable phenotype resulting from a mutation or a polymoφhism in a candidate gene. In another embodiment, the nucleic acid sample is contacted with one or more allele specific oligonucleotide probes which, specifically hybridize to one or more candidate gene alleles associated with a detectable phenotype. These diagnostic methods are extremely valuable as they can, in certain circumstances, be used to initiate preventive treatments or to allow an individual carrying a significant haplotype to foresee warning signs such as minor symptoms. In diseases in which attacks may be extremely violent and sometimes fatal if not treated on time, such as disease, the knowledge of a potential predisposition, even if this predisposition is not absolute, might contnbute in a very significant manner to treatment efficacy. Similarly, a diagnosed predisposition to a potential side effect could immediately direct the physician toward a treatment for which such side effects have not been observed dunng clinical trials.
Diagnostics, which analyze and predict response to a drug or side effects to a drug, may be used to determine whether an individual should be treated with a particular drug. For example, if the diagnostic indicates a likelihood that an individual will respond positively to treatment with a particular drug, the drug may be administered to the individual. Conversely, if the diagnostic indicates that an individual is likely to respond negatively to treatment with a particular drug, an alternative course of treatment may be prescribed. A negative response may be defined as either the absence of an efficacious response or the presence of toxic side effects.
Clinical drug tnals represent another application for the markers of the present invention. One or more markers indicative of response to an agent acting on a disease or to side effects to an agent acting on a disease may be identified using the methods descnbed above. Thereafter, potential participants in clinical tnals of such an agent may be screened to identify those individuals most likely to respond favorably to the drug and exclude those likely to expenence side effects. In that way, the effectiveness of drug treatment may be measured in individuals who respond positively to the drug, without lowenng the measurement as a result of the inclusion of individuals who are unlikely to respond positively m the study and without nsking undesirable safety problems. VL Computer-Related Embodiments
In some embodiments of the present invention a computer to based system may support the on-line coordination between the identification of biallelic markers and the conespondmg analysis of their frequency in the different groups.
As used herein the term "nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to
11773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773" encompasses the nucleotide sequences of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773, fragments of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125,
10126 to 11599, and 11600 to 11773, nucleotide sequences homologous to SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 or homologous to fragments of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773, and sequences complementary to all of the preceding sequences. As used herein the term "nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 1 1600 to 1 1773" further encompasses the nucleotide sequences compnsmg, consisting essentially of, or consisting of any one of the following. a) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44,
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 1 to 2260 or the complements thereof; b) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 1 to 2260 or the complements thereof, further compnsmg the 1 ST allele of the polymoφhic base of the respective SEQ DD number; c) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 1 to 2260 or the complements thereof, further comprising the 2ND allele of the polymoφhic base of the respective SEQ DD number; d) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 2261 to 3734 or the complements thereof; e) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 2261 to 3734 or the complements thereof, further compnsmg the l allele of the polymoφhic base of the respective SEQ DD number; f) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 2261 to 3734 or the complements thereof, further compnsmg the 2ND allele of the polymoφhic base of the respective SEQ DD number; g) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3735 to 3908 or the complements thereof; h) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3735 to 3908 or the complements thereof, further comprising the 1 ST allele of the polymoφhic base of the respective SEQ DD number; i) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3735 to 3908 or the complements thereof, further comprising the 2ND allele of the polymoφhic base of the respective SEQ DD number; and j) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, or 21 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 or the complements thereof.
The "nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 1 1599, and 11600 to 1 1773" further encompass nucleotide sequences homologous to: a) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 1 to 2260 or the complements thereof; b) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 1 to 2260 or the complements thereof, further comprising the 1ST allele of the polymoφhic base of the respective SEQ DD number; c) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 1 to 2260 or the complements thereof, further comprising the 2ND allele of the polymoφhic base of the respective SEQ DD number; d) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 2261 to 3734 or the complements thereof; e) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 2261 to 3734 or the complements thereof, further comprising the l allele of the polymoφhic base of the respective SEQ DD number; f) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 2261 to 3734 or the complements thereof, further comprising the 2ND allele of the polymoφhic base of the respective SEQ DD number; g) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3735 to 3908 or the complements thereof; h) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44,
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3735 to 3908 or the complements thereof, further comprising the 1ST allele of the polymoφhic base of the respective SEQ DD number; i) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44,
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3735 to 3908 or the complements thereof, further comprising the 2ND allele of the polymoφhic base of the respective SEQ DD number; and j) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, or 21 nucleotides, to the extent that a contiguous span of these lengths is consistent with the lengths of the particular Sequence DD, of any of SEQ DD Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 or the complements thereof. Homologous sequences refer to a sequence having at least 99%, 98%, 97%, 96%, 95%,
90%, 85%o, 80%, or 75% homology to these contiguous spans. Homology may be determined using any method described herein, including BLAST2N with the default parameters or with any modified parameters. Homologous sequences also may include RNA sequences in which uridines replace the thymines in the nucleic acid codes of the invention. It will be appreciated that the nucleic acid codes of the invention can be represented in the traditional single character format (See the inside back cover of Stryer, Lubert. Biochemistry, 3rd edition. W. H Freeman & Co., New York.) or in any other format or code which records the identity of the nucleotides in a sequence.
It should be noted that the nucleic acid codes of the invention further encompass all of the polynucleotides disclosed, descnbed or claimed m the present application. Moveover, the present invention specifically contemplates computer readable media and computer systems wherein such codes are stored individually or in any combination.
It will be appreciated by those skilled m the art that the nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 can be stored, recorded, and manipulated on any medium which can be read and accessed by a computer. As used herein, the words "recorded" and "stored" refer to a process for storing information on a computer medium. A skilled artisan can readily adopt any of the presently known methods for recording information on a computer readable medium to generate embodiments compnsmg one or more of the nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842,
7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 A particularly prefened embodiment of the present invention is a computer readable medium having recorded thereon at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773.
Computer readable media include magnetically readable media, optically readable media, electronically readable media and magnetic/optical media. For example, the computer readable media may be a hard disk, a floppy disk, a magnetic tape, CD-ROM, Digital Versatile Disk (DVD), Random Access Memory (RAM), or Read Only Memory (ROM) as well as other types of other media known to those skilled in the art.
Embodiments of the present invention include systems, particularly computer systems which store and manipulate the sequence information descnbed herein. One example of a computer system 100 is illustrated in block diagram form in Figure 14 As used herein, "a computer system" refers to the hardware components, software components, and data storage components used to analyze the nucleotide sequences of the nucleic acid codes of SEQ DD NOs.
1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. In one embodiment, the computer system 100 is a Sun Enteφnse 1000 server (Sun Microsystems, Palo Alto, CA). The computer system 100 preferably includes a processor for processing, accessing and manipulating the sequence data The processor 105 can be any well-known type of central processing unit, such as the Pentium DI from Intel Coφoration, or similar processor from Sun, Motorola, Compaq or International Business Machines
Preferably, the computer system 100 is a general puφose system that compnses the processor 105 and one or more internal data storage components 110 for stonng data, and one or more data retrieving devices for retnevmg the data stored on the data storage components. A skilled artisan can readily appreciate that any one of the cunently available computer systems are suitable
In one particular embodiment, the computer system 100 includes a processor 105 connected to a bus which is connected to a main memory 115 (preferably implemented as RAM) and one or more internal data storage devices 110, such as a hard dnve and/or other computer readable media having data recorded thereon. In some embodiments, the computer system 100 further includes one or more data retnevmg device 118 for reading the data stored on the internal data storage devices 110
The data retnev g device 118 may represent, for example, a floppy disk dnve, a compact disk dnve, a magnetic tape dnve, etc. In some embodiments, the internal data storage device 110 is a removable computer readable medium such as a floppy disk, a compact disk, a magnetic tape, etc. containing control logic and/or data recorded thereon. The computer system 100 may advantageously include or be programmed by appropnate software for reading the control logic and/or the data from the data storage component once inserted in the data retnevmg device.
The computer system 100 includes a display 120 which is used to display output to a computer user. It should also be noted that the computer system 100 can be linked to other computer systems 125a-c in a network or wide area network to provide centralized access to the computer system 100 Software for accessing and processing the nucleotide sequences of the nucleic acid codes of SEQ
DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 (such as search tools, compare tools, and modeling tools etc.) may reside in main memory 115 dunng execution. In some embodiments, the computer system 100 may further compnse a sequence comparer for companng the above-descnbed nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 stored on a computer readable medium to reference nucleotide or polypeptide sequences stored on a computer readable medium. A "sequence comparer" refers to one or more programs which are implemented on the computer system 100 to compare a nucleotide sequence with other nucleotide sequences and/or compounds stored within the data storage means. For example, the sequence comparer may compare the nucleotide sequences of the nucleic acid codes of SEQ DD Nos 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 stored on a computer readable medium to reference sequences stored on a computer readable medium to identify homologies or structural motifs The vanous sequence comparer programs identified elsewhere m this patent specification are particularly contemplated for use m this aspect of the invention
Figure 15 is a flow diagram illustrating one embodiment of a process 200 for companng a new nucleotide or protein sequence with a database of sequences m order to determine the homology levels between the new sequence and the sequences m the database The database of sequences can be a pnvate database stored withm the computer system 100, or a public database such as GENBANK that is available through the Internet
The process 200 begins at a start state 201 and then moves to a state 202 wherein the new sequence to be compared is stored to a memory m a computer system 100. As discussed above, the memory could be any type of memory, including RAM or an internal storage device.
The process 200 then moves to a state 204 wherein a database of sequences is opened for analysis and companson. The process 200 then moves to a state 206 wherein the first sequence stored in the database is read into a memory on the computer A companson is then performed at a state 210 to determine if the first sequence is the same as the second sequence. It is important to note that this step is not limited to performing an exact companson between the new sequence and the first sequence in the database. Well-known methods are known to those of skill in the art for companng two nucleotide or protein sequences, even if they are not identical. For example, gaps can be introduced into one sequence in order to raise the homology level between the two tested sequences. The parameters that control whether gaps or other features are introduced mto a sequence dunng companson are normally entered by the user of the computer system. Once a compaπson of the two sequences has been performed at the state 210, a determination is made at a decision state 210 whether the two sequences are the same. Of course, the term "same" is not limited to sequences that are absolutely identical Sequences that are within the homology parameters entered by the user will be marked as "same" in the process
200
If a determination is made that the two sequences are the same, the process 200 moves to a state 214 wherein the name of the sequence from the database is displayed to the user. This state notifies the user that the sequence with the displayed name fulfills the homology constraints that were entered. Once the name of the stored sequence is displayed to the user, the process 200 moves to a decision state 218 wherein a determination is made whether more sequences exist in the database. If no more sequences exist in the database, then the process 200 terminates at an end state 220. However, if more sequences do exist in the database, then the process 200 moves to a state 224 wherein a pointer is moved to the next sequence in the database so that it can be compared to the new sequence. In this manner, the new sequence is aligned and compared with every sequence in the database.
It should be noted that if a determination had been made at the decision state 212 that the sequences were not homologous, then the process 200 would move immediately to the decision state 218 in order to determine if any other sequences were available in the database for comparison. Accordingly, one aspect of the present invention is a computer system comprising a processor, a data storage device having stored thereon a nucleic acid code of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773, a data storage device having retrievably stored thereon reference nucleotide sequences or polypeptide sequences to be compared to the nucleic acid code of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 and a sequence comparer for conducting the comparison. The sequence comparer may indicate a homology level between the sequences compared or identify structural motifs in the above described nucleic acid code of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to
6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 or it may identify structural motifs in sequences which are compared to these nucleic acid codes and polypeptide codes. In some embodiments, the data storage device may have stored thereon the sequences of at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to
3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773.
Another aspect of the present invention is a method for determining the level of homology between a nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 and a reference nucleotide sequence, comprising the steps of reading the nucleic acid code and the reference nucleotide sequence through the use of a computer program which determines homology levels and determining homology between the nucleic acid code and the reference nucleotide sequence with the computer program. The computer program may be any of a number of computer programs for determining homology levels, including those specifically enumerated herein, including BLAST2N with the default parameters or with any modified parameters. The method may be implemented using the computer systems descnbed above The method may also be performed by reading at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the above descnbed nucleic acid codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125,
10126 to 11599, and 11600 to 11773 through use of the computer program and determining homology between the nucleic acid codes and reference nucleotide sequences .
Figure 16 is a flow diagram illustrating one embodiment of a process 250 m a computer for determining whether two sequences are homologous. The process 250 begins at a start state 252 and then moves to a state 254 wherein a first sequence to be compared is stored to a memory The second sequence to be compared is then stored to a memory at a state 256 The process 250 then moves to a state 260 wherein the first character in the first sequence is read and then to a state 262 wherein the first character of the second sequence is read It should be understood that if the sequence is a nucleotide sequence, then the character would normally be either A, T, C, G or U. If the sequence is a protein sequence, then it should be in the single letter amino acid code so that the first and sequence sequences can be easily compared.
A determination is then made at a decision state 264 whether the two characters are the same. If they are the same, then the process 250 moves to a state 268 wherein the next characters m the first and second sequences are read. A determination is then made whether the next characters are the same. If they are, then the process 250 continues this loop until two characters are not the same. If a determination is made that the next two characters are not the same, the process 250 moves to a decision state 274 to determine whether there are any more characters either sequence to read. If there aren't any more characters to read, then the process 250 moves to a state 276 wherein the level of homology between the first and second sequences is displayed to the user. The level of homology is determined by calculating the proportion of characters between the sequences that were the same out of the total number of sequences m the first sequence. Thus, if every character in a first 100 nucleotide sequence aligned with a every character m a second sequence, the homology level would be 100%
Alternatively, the computer program may be a computer program which compares the nucleotide sequences of the nucleic acid codes of the present invention, to reference nucleotide sequences in order to determine whether the nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 differs from a reference nucleic acid sequence at one or more positions. Optionally such a program records the length and identity of inserted, deleted or substituted nucleotides with respect to the sequence of either the reference polynucleotide or the nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. In one embodiment, the computer program may be a program which determines whether the nucleotide sequences of the nucleic acid codes of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 contain a biallelic marker or single nucleotide polymoφhism (SNP) with respect to a reference nucleotide sequence. This single nucleotide polymoφhism may comprise a single base substitution, insertion, or deletion, while this biallelic marker may comprise about one to ten consecutive bases substituted, inserted or deleted.
Accordingly, another aspect of the present invention is a method for determining whether a nucleic acid code of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 differs at one or more nucleotides from a reference nucleotide sequence comprising the steps of reading the nucleic acid code and the reference nucleotide sequence through use of a computer program which identifies differences between nucleic acid sequences and identifying differences between the nucleic acid code and the reference nucleotide sequence with the computer program. In some embodiments, the computer program is a program which identifies single nucleotide polymoφhisms. The method may be implemented by the computer systems described above and the method illustrated in Figure 16. The method may also be performed by reading at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 and the reference nucleotide sequences through the use of the computer program and identifying differences between the nucleic acid codes and the reference nucleotide sequences with the computer program. In other embodiments the computer based system may further comprise an identifier for identifying features within the nucleotide sequences of the nucleic acid codes of SEQ DD NOs.
1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773.
An "identifier" refers to one or more programs which identifies certain features within the above-described nucleotide sequences of the nucleic acid codes of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. In one embodiment, the identifier may compnse a program which identifies an open reading frame in the cDNAs codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 1 1600 to 11773 Figure 17 is a flow diagram illustrating one embodiment of an identifier process 300 for detecting the presence of a feature in a sequence The process 300 begins at a start state 302 and then moves to a state 304 wherein a first sequence that is to be checked for features is stored to a memory 115 in the computer system 100 The process 300 then moves to a state 306 wherein a database of sequence features is opened Such a database would include a list of each feature's attributes along with the name of the feature For example, a feature name could be "Initiation Codon" and the attribute would be "ATG" Another example would be the feature name "TAATAA Box" and the feature attribute would be "TAATAA". An example of such a database is produced by the University of Wisconsin Genetics Computer Group (www.gcg com) Once the database of features is opened at the state 306, the process 300 moves to a state 308 wherein the first feature is read from the database A compaπson of the attribute of the first feature with the first sequence is then made at a state 310 A determination is then made at a decision state 316 whether the attnbute of the feature was found in the first sequence If the attribute was found, then the process 300 moves to a state 318 wherein the name of the found feature is displayed to the user
The process 300 then moves to a decision state 320 wherein a determination is made whether move features exist m the database. If no more features do exist, then the process 300 terminates at an end state 324. However, if more features do exist in the database, then the process 300 reads the next sequence feature at a state 326 and loops back to the state 310 wherein the attnbute of the next feature is compared against the first sequence.
It should be noted, that if the feature attnbute is not found in the first sequence at the decision state 316, the process 300 moves directly to the decision state 320 order to determine if any more features exist in the database
Accordingly, another aspect of the present invention is a method of identifying a feature within the nucleic acid codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374,
3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 1 1773 comprising reading the nucleic acid code(s) through the use of a computer program which identifies features therein and identifying features within the nucleic acid code(s) with the computer program. In one embodiment, computer program compnses a computer program which identifies open reading frames The method may be performed by reading a single sequence or at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ DD NOs 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 through the use of the computer program and identifying features withm the nucleic acid codes with the computer program.
The nucleic acid codes of SEQ ED NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 may be stored and manipulated in a vanety of data processor programs in a vanety of formats. For example, the nucleic acid codes of SEQ ED NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to
7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 may be stored as text m a word processing file, such as MicrosoftWORD or WORDPERFECT or as an ASCII file in a vanety of database programs familiar to those of skill m the art, such as DB2, SYBASE, or ORACLE. In addition, many computer programs and databases may be used as sequence comparers, identifiers, or sources of reference nucleotide sequences to be compared to the nucleic acid codes of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. The following list is intended not to limit the invention but to provide guidance to programs and databases which are useful with the nucleic acid codes of SEQ DD NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194,
6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773
The programs and databases which may be used include, but are not limited to: MacPattem (EMBL), DiscoveryBase (Molecular Applications Group), GeneMine (Molecular Applications Group), Look (Molecular Applications Group), MacLook (Molecular Applications
Group), BLAST and BLAST2 (NCBI), BLASTN and BLASTX (Altschul et al, J Mol. Biol 215: 403 (1990)), FASTA (Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85: 2444 (1988)), FASTDB (Brutlag et al. Comp. App. Biosci 6:237-245, 1990), Catalyst (Molecular Simulations Inc.), Catalyst/SHAPE (Molecular Simulations Inc.), Cenus2.DBAccess (Molecular Simulations Inc.), HypoGen (Molecular Simulations Inc.), Insight π, (Molecular Simulations Inc.), Discover
(Molecular Simulations Inc.), CHARMm (Molecular Simulations Inc.), Felix (Molecular Simulations Inc.), DelPhi, (Molecular Simulations Inc.), QuanteMM, (Molecular Simulations Inc.), Homology (Molecular Simulations Inc.), Modeler (Molecular Simulations Inc.), ISIS (Molecular Simulations Inc.), Quanta/Protein Design (Molecular Simulations Inc.), WebLab (Molecular Simulations Inc.), WebLab Diversity Explorer (Molecular Simulations Inc.), Gene
Explorer (Molecular Simulations Inc.), SeqFold (Molecular Simulations Inc.), the MDL Available Chemicals Directory database, the MDL Drug Data Report data base, the Comprehensive Medicinal Chemistry database, Derwents's World Drug Index database, the BioByteMasterFile database, the Genbank database, and the Genseqn database. Many other programs and data bases would be apparent to one of skill in the art given the present disclosure Motifs which may be detected using the above programs include sequences encoding leucine zippers, helix-turn-hehx motifs, glycosylation sites, ubiquitmation sites, alpha helices, and beta sheets, signal sequences encoding signal peptides which direct the secretion of the encoded proteins, sequences implicated in transcnption regulation such as homeoboxes, acidic stretches, enzymatic active sites, substrate binding sites, and enzymatic cleavage sites. It should be noted that the nucleic acid codes of the invention further encompass all of the polynucleotides disclosed, descnbed or claimed in the present application Moreover, the present invention specifically contemplates the storage of such codes on computer readable media and computer systems individually or in any combination, as well as the use of such codes and combinations in the methods of VI. VII. Mapping and Maps Comprising the Biallelic Markers of the Invention
The human haploid genome contains an estimated 80,000 to 100,000 or more genes scattered on a 3 x 109 base-long double stranded DNA shared among the 24 chromosomes. Each human being is diploid, i e possesses two haploid genomes, one from paternal ongin, the other from maternal ongm The sequence of the human genome varies among individuals in a population About 107 sites scattered along the 3 x 109 base pairs of DNA are polymoφhic, existing in at least two vanant forms called alleles. Most of these polymoφhic sites are generated by single base substitution mutations and are biallelic. Less than 105 polymoφhic sites are due to more complex changes and are very often multi-allehc, i.e. exist m more than two allelic forms At a given polymoφhic site, any individual (diploid), can be either homozygous (twice the same allele) or heterozygous (two different alleles). A given polymoφhism or rare mutation can be either neutral (no effect on trait), or functional, i.e responsible for a particular genetic trait. Genetic Maps
The first step towards the identification of genes associated with a detectable trait, such as a disease or any other detectable trait, consists m the localization of genomic regions containing trait-causmg genes using genetic mapping methods The prefened traits contemplated withm the present invention relate to fields of therapeutic interest; m particular embodiments, they will be disease traits and/or drug response traits, reflecting drug efficacy or toxicity. Traits can either be "binary", e.g. diabetic vs. non diabetic, or "quantitative", e.g elevated blood pressure Individuals affected by a quantitative trait can be classified according to an appropnate scale of trait values, e.g. blood pressure ranges. Each trait value range can then be analyzed as a binary trait. Patients showing a trait value within one such range will be studied in companson with patients showing a trait value outside of this range In such a case, genetic analysis methods will be applied to subpopulations of individuals showing trait values within defined ranges Genetic mapping involves the analysis of the segregation of polymoφhic loci m trait positive and trait-negative populations Polymoφhic loci constitute a small fraction of the human genome (less than 1%), compared to the vast majonty of human genomic DNA which is identical in sequence among the chromosomes of different individuals. Among all existing human polymoφhic loci, genetic markers can be defined as genome-deπved polynucleotides which are sufficiently polymoφhic to allow a reasonable probability that a randomly selected person will be heterozygous, and thus informative for genetic analysis by methods such as linkage analysis or association studies.
A genetic map consists of a collection of polymoφhic markers which have been positioned on the human chromosomes. Genetic maps may be combined with physical maps, collections of ordered overlapping fragments of genomic DNA whose anangement along the human chromosomes is known. The optimal genetic map should possess the following charactenstics.
- the density of the genetic markers scattered along the genome should be sufficient to allow the identification and localization of any trait-related polymoφhism, - each marker should have an adequate level of heterozygosity, so as to be informative in a large percentage of different meioses,
- all markers should be easily typed on a routine basis, at a reasonable expense, and in a reasonable amount of time,
- the entire set of markers per chromosome should be ordered in a highly reliable fashion.
However, while the above maps are optimal, it will be appreciated that the maps of the present invention may be used in the individual marker and haplotype association analyses descnbed below without the necessity of determining the order of biallelic markers denved from a single BAC with respect to one another Construction of a Physical Map
The first step in constructing a high density genetic map of biallelic markers is the construction of a physical map. Physical maps consist of ordered, overlapping cloned fragments of genomic DNA covenng a portion of the genome, preferably covenng one or all chromosomes. Obtaining a physical map of the genome entails constructing and ordeπng a genomic DNA library. For an example of a complete explanation of the construction of a physical map from a BAC library see related PCT Application No. PCT/TB98/00193 filed July 19, 1998 The methods disclosed therein can be used to generate larger more complete sets of markers and entire maps of the human genome compnsmg the map-relate biallelic markers of the invention Biallelic Markers It will be appreciated that the ordered DNA fragments containing these groups of biallelic markers need not completely cover the genomic regions of these lengths but may instead be incomplete contigs having one or more gaps therein As discussed m further detail below, biallelic markers may be used in single maker and haplotype association analyses regardless of the completeness of the conesponding physical contig harbonng them Using the procedures above, 3908 biallelic markers, each having two alleles, were identified using sequences obtained from BACs which had been localized on the genome. In some cases, markers were identified using pooled BACs and thereafter reassigned to individual BACs using STS screening procedures such as those descnbed in Examples 1 and 2 The sequences of these biallelic markers are provided in the accompanying Sequence Listing as SEQ DD Nos. 1 to 3908 Although the sequences of SEQ DD Nos. 1 to 3908 will be used as exemplary markers throughout the present application, these markers are not limited to markers having the exact flanking sequences sunounding the polymoφhic bases which are enumerated m SEQ DD Nos 1 to 3908 Rather, it will be appreciated that the flanking sequences sunounding the polymoφhic bases of SEQ DD Nos 1 to 3908 may be lengthened or shortened to any extent compatible with their intended use and the present invention specifically contemplates such sequences The sequences of these biallelic markers may be used to construct genomic maps as well as m the gene identification and diagnostic techniques descnbed herein. It will be appreciated that the biallelic markers refened to herein may be of any length compatible with their intended use provided that the markers include the polymoφhic base, and the present invention specifically contemplates such sequences.
Some of the markers of SEQ DD Nos: 1 to 3908 as well as related amplification and microsequencing pnmers were disclosed in the instant pπoπty documents. However, some of the earlier descnbed amplification pnmers and microsequencing pnmers did not have the precise sequence lengths disclosed in the instant application It will be appreciated that either length of pnmers may be used in the methods disclosed m the present application.
In addition, the internal identification numbers used to identify the biallelic markers disclosed in U.S. Provisional Patent Application Senal No 60/082,614 filed Apπl 21, 1998 have been revised to include additional numbers on the end For example, the marker formerly given the internal identification number 99-1091 was given the revised internal identification number 99-1091-446 Therefore, it will be appreciated that shortened identification numbers and extended identification numbers which overlap one another refer to the same markers
Ordenng of biallelic markers
Biallelic markers can be ordered to determine their positions along chromosomes, preferably subchromosomal regions, by methods known m the art as well as those disclosed in
PCT Application No. PCT/TB98/00193 filed July 19, 1998, and U.S. Provisional Patent Application Serial No. 60/082,614 filed Apnl 21, 1998
The positions of the biallelic markers along chromosomes may be determined using a variety of methodologies. In one approach, radiation hybnd mapping is used. Radiation hybnd (RH) mapping is a somatic cell genetic approach that can be used for high resolution mapping of the human genome In this approach, cell lines containing one or more human chromosomes are lethally madiated, breaking each chromosome into fragments whose size depends on the radiation dose These fragments are rescued by fusion with cultured rodent cells, yielding subclones containing different portions of the human genome. This technique is descnbed by Benham et al. (Genomics 4:509-517, 1989) and Cox et al., (Science 250:245-250, 1990). The random and independent nature of the subclones permits efficient mapping of any human genome marker. Human DNA isolated from a panel of 80-100 cell lines provides a mapping reagent for ordenng biallelic markers. In this approach, the frequency of breakage between markers is used to measure distance, allowing construction of fine resolution maps as has been done for ESTs (Schuler et al., Science 274:540-546, 1996).
RH mapping has been used to generate a high-resolution whole genome radiation hybnd map of human chromosome 17q22-q25.3 across the genes for growth hormone (GH) and thymidine kinase (TK) (Foster et al., Genomics 33:185-192, 1996), the region sunounding the Gorhn syndrome gene (Obermayr et al., Eur. J. Hum. Genet. 4:242-245, 1996), 60 loci covenng the entire short arm of chromosome 12 (Raeymaekers et al., Genomics 29:170-178, 1995), the region of human chromosome 22 containing the neurofibromatosis type 2 locus (Frazer et al., Genomics 14.574-584, 1992) and 13 loci on the long arm of chromosome 5 (Warπngton et al., Genomics 11:701-708, 1991).
Alternatively, PCR based techniques and human-rodent somatic cell hybnds may be used to determine the positions of the biallelic markers on the chromosomes. In such approaches, oligonucleotide pnmer pairs which are capable of generating amplification products containing the polymoφhic bases of the biallelic markers are designed. Preferably, the oligonucleotide pnmers are 18-23 bp in length and are designed for PCR amplification The creation of PCR pnmers from known sequences is well known to those with skill m the art. For a review of PCR technology see Erhch, H.A., PCR Technology: Principles and Applications for
DNA Amplification. 1992. W.H. Freeman and Co., New York. The pnmers are used in polymerase chain reactions (PCR) to amplify templates from total human genomic DNA PCR conditions are as follows: 60 ng of genomic DNA is used as a template for PCR with 80 ng of each oligonucleotide pnmer, 0.6 unit of Taq polymerase, and 1 mCu of a 32P-labeled deoxycytidine tnphosphate. The PCR is performed in a microplate thermocycler (Techne) under the following conditions: 30 cycles of 94°C, 1 4 mm; 55°C, 2 min; and 72°C, 2 min; with a final extension at 72°C for 10 min The amplified products are analyzed on a 6% polyacrylamide sequencing gel and visualized by autoradiography If the length of the resulting PCR product is identical to the length expected for an amplification product containing the polymoφhic base of the biallelic marker, then the PCR reaction is repeated with DNA templates from two panels of human-rodent somatic cell hybnds, BIOS PCRable DNA (BIOS
Coφoration) and NIGMS Human-Rodent Somatic Cell Hybnd Mapping Panel Number 1 (NIGMS, Camden, NJ).
PCR is used to screen a senes of somatic cell hybnd cell lines containing defined sets of human chromosomes for the presence of a given biallelic marker. DNA is isolated from the somatic hybnds and used as starting templates for PCR reactions using the pnmer pairs from the biallelic marker. Only those somatic cell hybnds with chromosomes containing the human sequence conespondmg to the biallelic marker will yield an amplified fragment The biallelic markers are assigned to a chromosome by analysis of the segregation pattern of PCR products from the somatic hybnd DNA templates. The single human chromosome present in all cell hybnds that give nse to an amplified fragment is the chromosome containing that biallelic marker. For a review of techniques and analysis of results from somatic cell gene mapping expeπments. (See Ledbetter et al., Genomics 6:475-481 (1990).)
Example 2 descnbes a prefened method for positioning of biallelic markers on clones, such as BAC clones, obtained from genomic DNA hbranes. Using such procedures, a number of BAC clones carrying selected biallelic markers can be isolated. The position of these BAC clones on the human genome can be defined by performing STS screening as descnbed in Example 1. Preferably, to decrease the number of STSs to be tested, each BAC can be localized on chromosomal or subchromosomal regions by procedures such as those descnbed in Examples 3 and 4. This localization will allow the selection of a subset of STSs conespondmg to the identified chromosomal or subchromosomal region Testing each BAC with such a subset of STSs and taking account of the position and order of the STSs along the genome will allow a refined positioning of the conespondmg biallelic marker along the genome.
In other embodiments, if the DNA library used to isolate BAC inserts or any type of genomic DNA fragments harbonng the selected biallelic markers already constitute a physical map of the genome or any portion thereof, using the known order of the DNA fragments will allow the order of the biallelic markers to be established
As discussed above, it will be appreciated that markers earned by the same fragment of genomic DNA, such as the insert m a BAC clone, need not necessanly be ordered with respect to one another within the genomic fragment to conduct single point or haplotype association analyses. However, in other embodiments of the present maps, the order of biallelic markers earned by the same fragment of genomic DNA may be determined
The positions of the biallelic markers used to construct the maps of the present invention, including the map-related biallelic markers of the invention, may be assigned to subchromosomal locations using Fluorescence In Situ Hybndization (FISH) (Chenf et al., Proc Natl Acad Sci USA , 87.6639-6643 (1990)) FISH analysis is descnbed in Example 3
The ordenng analyses may be conducted to generate an integrated genome wide genetic map compnsmg about 20,000, 40,000, 60,000, 80,000, 100,000, 120,000 biallelic markers with a roughly consistent number of biallelic marker per BAC. In some embodiments, the map includes one or more markers selected from the group consisting of the sequences of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
Alternatively, maps having the above-specified average numbers of biallelic markers per BAC which compnse smaller portions of the genome, such as a set of chromosomes, a single chromosome, a particular subchromosomal region, or any other desired portion of the genome, may also be constructed using the procedures provided herein.
In some embodiments, the biallelic markers in the map are separated from one another by an average distance of 10-200kb, 15-150kb, 20-100kb, 100-150kb, 50-100kb, or 25-50kb. Maps having the above-specified intermarker distances which compnse smaller portions of the genome, such as a set of chromosomes, a single chromosome, a particular subchromosomal region, or any other desired portion of the genome, may also be constructed using the procedures provided herein.
Figure 2, showing the results of computer simulations of the distribution of intermarker spacing on a randomly distnbuted set of biallelic markers, indicates the percentage of biallelic markers which will be spaced a given distance apart for a given number of markers/BAC in the genomic map (assuming 20,000 BACs constituting a minimally overlapping anay covenng the entire genome are evaluated) One hundred iterations were performed for each simulation (20,000 marker map, 40,000 marker map, 60,000 marker map, 120,000 marker map)
As illustrated m Figure 2a, 98% of inter-marker distances will be lower than 150kb provided 60,000 evenly distnbuted markers are generated (3 per BAC); 90% of mter-marker distances will be lower than 150kb provided 40,000 evenly distnbuted markers are generated (2 per BAC), and 50% of inter-marker distances will be lower than 150kb provided 20,000 evenly distributed markers are generated ( 1 per BAC)
As illustrated in Figure 2b, 98% of mter-marker distances will be lower than 80kb provided 120,000 evenly distnbuted markers are generated (6 per BAC), 80% of mter-marker distances will be lower than 80kb provided 60,000 evenly distributed markers are generated (3 per BAC), and 15% of mter-marker distances will be lower than 80kb provided 20,000 evenly distributed markers are generated (1 per BAC)
As already mentioned, high density biallelic marker maps allow association studies to be performed to identify genes involved in complex traits. Linkage Disequilibnum
The present invention then also concerns biallelic markers in linkage disequilibrium with the specific biallelic markers descnbed above and which are expected to present similar characteristics in terms of their respective association with a given trait. In a prefened embodiment, the present invention concerns the biallelic markers that are in linkage disequilibrium with the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374,
3735 to 3908 or the sequences complementary thereto.
LD among a set of biallelic markers having an adequate heterozygosity rate can be determined by genotyping between 50 and 1000 unrelated individuals, preferably between 75 and 200, more preferably around 100 Genotyping a biallelic marker consists of determining the specific allele earned by an individual at the given polymoφhic base of the biallelic marker Genotyping can be performed using similar methods as those descnbed above for the generation of the biallelic markers, or using other genotyping methods such as those further descnbed below.
Genome-wide linkage disequilibnum mapping aims at identifying, for any trait- causing allele being searched, at least one biallelic marker m linkage disequilibnum with said trait-causmg allele. Preferably, in order to enhance the power of linkage disequilibnum maps, in some embodiments, the biallelic markers therein have average inter-marker distances of 150kb or less, 75 kb or less, or 50 kb or less, 30kb or less, or 25kb or less to accommodate the fact that, m some regions of the genome, the detection of linkage disequilibnum requires lower inter-marker distances
The present invention provides methods to generate biallelic marker maps with average mter-marker distances of 150kb or less. In some embodiments, the mean distance between biallelic markers constituting the high density map will be less than 75kb, preferably less than 50kb. Further prefened maps according to the present invention contain markers that are less than 37.5kb apart. In highly prefened embodiments, the average inter-marker spacing for the biallelic markers constituting very high density maps is less than 30kb, most preferably less than 25kb.
Genetic maps containing biallelic markers (including the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto) may be used to identify and isolate genes associated with detectable traits. The use of the genetic maps of the present invention is descnbed in more detail below. VTII. Use of High Density Biallelic Marker Maps to Identify Genes Associated with Detectable Traits
One embodiment of the present invention compnses methods for identifying and isolating genes associated with a detectable trait using the biallelic marker maps of the present invention.
In the past, the identification of genes linked with detectable traits has relied on a statistical approach called linkage analysis. Linkage analysis is based upon establishing a conelation between the transmission of genetic markers and that of a specific trait throughout generations withm a family. In this approach, all members of a series of affected families are genotyped with a few hundred markers, typically microsatelhte markers, which are distnbuted at an average density of one every 10 Mb. By companng genotypes in all family members, one can attribute sets of alleles to parental haploid genomes (haplotypmg or phase determination). The ongm of recombined fragments is then determined in the offspnng of all families. Those that co-segregate with the trait are tracked. After pooling data from all families, statistical methods are used to determine the likelihood that the marker and the trait are segregating independently in all families. As a result of the statistical analysis, one or several regions having a high probability of harbonng a gene linked to the trait are selected as candidates for further analysis. The result of linkage analysis is considered as significant (i.e. there is a high probability that the region contains a gene involved a detectable trait) when the chance of independent segregation of the marker and the trait is lower than 1 in 1000 (expressed as a LOD score > 3). Generally, the length of the candidate region identified using linkage analysis is between 2 and 20Mb.
Once a candidate region is identified as descnbed above, analysis of recombinant individuals using additional markers allows further delineation of the candidate linked region
Linkage analysis studies have generally relied on the use of a maximum of 5,000 microsatelhte markers, thus limiting the maximum theoretical attainable resolution of linkage analysis to ca. 600 kb on average.
Linkage analysis has been successfully applied to map simple genetic traits that show clear Mendehan mhentance patterns and which have a high penetrance (penetrance is the ratio between the number of trait-positive earners of allele a and the total number of a earners in the population) About 100 pathological trait-causmg genes were discovered using linkage analysis over the last 10 years In most of these cases, the majoπty of affected individuals had affected relatives and the detectable trait was rare in the general population (frequencies less than 0 1%) In about 10 cases, such as Alzheimer's Disease, breast cancer, and Type II diabetes, the detectable trait was more common but the allele associated with the detectable trait was rare in the affected population Thus, the alleles associated with these traits were not responsible for the trait in all sporadic cases
Linkage analysis suffers from a vanety of drawbacks First, linkage analysis is limited by its reliance on the choice of a genetic model suitable for each studied trait Furthermore, as already mentioned, the resolution attainable using linkage analysis is limited, and complementary studies are required to refine the analysis of the typical 2Mb to 20Mb regions initially identified through linkage analysis
In addition, linkage analysis approaches have proven difficult when applied to complex genetic traits, such as those due to the combined action of multiple genes and/or environmental factors In such cases, too large an effort and cost are needed to recruit the adequate number of affected families required for applying linkage analysis to these situations, as recently discussed by Risch, N. and Menkangas, K. (Science 273.1516-1517 (1996))
Finally, linkage analysis cannot be applied to the study of traits for which no large informative families are available. Typically, this will be the case m any attempt to identify trait-causing alleles involved in sporadic cases, such as alleles associated with positive or negative responses to drug treatment.
The present genetic maps and biallelic markers (including the biallelic markers of SEQ DD Nos 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto) may be used to identify and isolate genes associated with detectable traits using association studies, an approach which does not require the use of affected families and which permits the identification of genes associated with sporadic traits Association Studies
As already mentioned, any gene responsible or partly responsible for a given trait will be m linkage disequilibnum with some flanking markers To map such a gene, specific alleles of these flanking markers which are associated with the gene or genes responsible for the trait are identified Although the following discussion of techniques for finding the gene or genes associated with a particular trait using linkage disequilibnum mapping, refers to locating a single gene which is responsible for the trait, it will be appreciated that the same techniques may also be used to identify genes which are partially responsible for the trait Association studies may be conducted within the general population (as opposed to the linkage analysis techniques discussed above which are limited to studies performed on related individuals in one or several affected families)
Association between a biallelic marker A and a trait T may pnmaπly occur as a result of three possible relationships between the biallelic marker and the trait
First, allele a of biallelic marker A may be directly responsible for trait T (e.g , Apo E e 4 site A and Alzheimer's disease) However, since the majority of the biallelic markers used m genetic mapping studies are selected randomly, they mainly map outside of genes. Thus, the likelihood of allele a being a functional mutation directly related to trait T is very low
Second, an association between a biallelic marker A and a trait T may also occur when the biallelic marker is very closely linked to the trait locus. In other words, an association occurs when allele a is in linkage disequilibnum with the trait-causing allele. When the biallelic marker is in close proximity to a gene responsible for the trait, more extensive genetic mapping will ultimately allow a gene to be discovered near the marker locus which carnes mutations m people with trait T (i.e. the gene responsible for the trait or one of the genes responsible for the trait) As will be further exemplified below, using a group of biallelic markers which are m close proximity to the gene responsible for the trait the location of the causal gene can be deduced from the profile of the association curve between the biallelic markers and the trait. The causal gene will usually be found in the vicinity of the marker showing the highest association with the trait
Finally, an association between a biallelic marker and a trait may occur when people with the trait and people without the trait conespond to genetically different subsets of the population who, coincidentally, also differ in the frequency of allele a (population stratification). This phenomenon may be avoided by using ethnically matched large heterogeneous samples.
Association studies are particularly suited to the efficient identification of genes that present common polymoφhisms, and are involved in multifactonal traits whose frequency is relatively higher than that of diseases with monofactoπal mhentance Association studies mamly consist of four steps: recruitment of trait-positive (T+) and control populations, preferably trait-negative (T-) populations with well-defined phenotypes, identification of a candidate region suspected of harbonng a trait causing gene, identification of said gene among candidate genes in the region, and finally validation of mutatιon(s) responsible for the trait in said trait causing gene. In a first step, the trait-positive should be well-defined, preferably the control phenotype is a well-defined trait-negative phenotype as well. In order to perform efficient and significant association studies such as those described herein, the trait under study should preferably follow a bimodal distribution in the population under study, presenting two clear non-overlappmg phenotypes, trait-positive and trait-negative
Nevertheless, in the absence of such a bimodal distπbution (as may m fact be the case for complex genetic traits), any genetic trait may still be analyzed using the association method proposed herein by carefully selecting the individuals to be included in the trait- positive group and preferably the trait-negative phenotypic group as well. The selection procedure ideally involves selecting individuals at opposite ends of the non-bimodal phenotype spectrum of the trait under study, so as to include in these trait-positive and trait- negative populations individuals who clearly represent non-overlappmg, preferably extreme phenotypes
As discussed above, the definition of the inclusion cπteπa for the trait-positive and control populations is an important aspect of the present invention
Figure 3 shows, for a senes of hypothetical sample sizes, the p-value significance obtained in association studies performed using individual markers from the high-density biallelic map, according to vanous hypotheses regarding the difference of allelic frequencies between the trait-positive and trait-negative samples. It indicates that, all cases, samples ranging from 150 to 500 individuals are numerous enough to achieve statistical significance. It will be appreciated that bigger or smaller groups can be used to perform association studies according to the methods of the present invention.
In a second step, a marker/trait association study is performed that compares the genotype frequency of each biallelic marker in the above described trait-positive and trait- negative populations by means of a chi square statistical test (one degree of freedom). In addition to this single marker association analysis, a haplotype association analysis is performed to define the frequency and the type of the ancestral earner haplotype. Haplotype analysis, by combining the informativeness of a set of biallelic markers increases the power of the association analysis, allowing false positive and/or negative data that may result from the single marker studies to be eliminated.
Genotyping can be performed using any method descnbed in m, including the microsequencing procedure descnbed m Example 8
If a positive association with a trait is identified using an anay of biallelic markers having a high enough density, the causal gene will be physically located in the vicinity of the associated markers, since the markers showing positive association with the trait are in linkage disequilibnum with the trait locus Regions harbonng a gene responsible for a particular trait which are identified through association studies using high density sets of biallelic markers will, on average, be 20 - 40 times shorter length than those identified by linkage analysis Once a positive association is confirmed as descnbed above, a third step consists of completely sequencing the BAC inserts harbonng the markers identified in the association analyzes These BACs are obtained through screening human genomic hbranes with the markers probes and/or pnmers, as descnbed above. Once a candidate region has been sequenced and analyzed, the functional sequences within the candidate region (e g. exons, splice sites, promoters, and other potential regulatory regions) are scanned for mutations which are responsible for the trait by comparing the sequences of the functional regions in a selected number of trait-positive and trait-negative individuals using appropnate software. Tools for sequence analysis are further described m Example 9. Finally, candidate mutations are then validated by screening a larger population of trait-positive and trait-negative individuals using genotyping techniques descnbed below. Polymoφhisms are confirmed as candidate mutations when the validation population shows association results compatible with those found between the mutation and the trait m the test population In practice, in order to define a region beanng a candidate gene, the trait-positive and trait-negative populations are genotyped using an appropπate number of biallelic markers. The markers may include one or more of the markers of SEQ DD Nos- 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
The markers used to define a region beanng a candidate gene may be distributed at an average density of 1 marker per 10-200 kb. Preferably, the markers used to define a region beanng a candidate gene are distnbuted at an average density of 1 marker every 15-150 kb. In further preferred embodiments, the markers used to define a region beanng a candidate gene are distnbuted at an average density of 1 marker every 20-100kb. In yet another prefened embodiment, the markers used to define a region beanng a candidate gene are distnbuted at an average density of 1 marker every 100 to 150kb. In a further highly prefened embodiment, the markers used to define a region beanng a candidate gene are distnbuted at an average density of 1 marker every 50 to lOOkb. In yet another embodiment, the biallelic markers used to define a region beanng a candidate gene are distnbuted at an average density of 1 marker every 25-50 kilobases. As mentioned above, in order to enhance the power of linkage disequilibrium based maps, in a prefened embodiment, the marker density of the map will be adapted to take the linkage disequilibnum distribution in the genomic region of interest into account
In some embodiments, the initial identification of a candidate genomic region harbonng a gene associated with a detectable phenotype may be conducted using a preliminary map containing a few thousand biallelic markers. Thereafter, the genomic region harbonng the gene responsible for the detectable trait may be better delineated using a map containing a larger number of biallelic markers. Furthermore, the genomic region harbonng the gene responsible for the detectable trait may be further delineated using a high density map of biallelic markers Finally, the gene associated with the detectable trait may be identified and isolated using a very high density biallelic marker map Example 6 descnbes a procedure for identifying a candidate region harbonng a gene associated with a detectable trait and provides simulated results for this procedure. It will be appreciated that although Example 6 compares the results of simulated analyzes using markers denved from maps having 3,000, 20,000, and 60,000 markers, the number of markers contained in the map is not restncted to these exemplary figures Rather, Example 6 exemplifies the increasing refinement of the candidate region with increasing marker density
As increasing numbers of markers are used in the analysis, points in the association analysis become broad peaks The gene associated with the detectable trait under investigation will he within or near the region under the peak
The statistical power of linkage disequilibnum mapping using a high density marker map is also reinforced by complementing the single point association analysis descnbed above with a multi-marker association analysis of haplotype analysis descnbed in IV. To improve the statistical power of the individual marker association analyses conducted as descnbed above using maps of increasing marker densities, haplotype studies can be performed using groups of markers located in proximity to one another within regions of the genome. For example, using the methods described above in which the association of an individual marker with a detectable phenotype was analyzed using maps of 3,000 markers, 20,000 markers, and 60,000 markers, a seπes of haplotype studies can be performed using groups of contiguous markers from such maps or from maps having higher marker densities
In a prefened embodiment, a senes of successive haplotype studies including groups of markers spanning regions of more than 1 Mb may be performed. In some embodiments, the biallelic markers included in each of these groups may be located within a genomic region spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1Mb Preferably, the genomic regions containing the groups of biallelic markers used m the successive haplotype analyses are overlapping. It will be appreciated that the groups of biallelic markers need not completely cover the genomic regions of the above-specified lengths but may instead be obtained from incomplete contigs having one or more gaps therein. As discussed in further detail below, biallelic markers may be used in single point and haplotype association analyses regardless of the completeness of the conespondmg physical contig harbonng them
Genome-wide mapping using association studies with dense enough anays of markers permit a case-by-case best estimate of p-value significance thresholds. Given a test population comprising two ethnically matched trait-positive and control groups of about 50 to about 500 individuals or more, conducting the above described association studies will allow a p-value "cut-off to be established by, for example, analyzing significant numbers of allele frequency differences or, m some cases where appropπate, running computer simulations or control studies as descnbed in Examples 6, 15, and 26
For a p-value above the threshold, a conesponding association between the trait and a studied marker will be deemed not significant, while for a p-value below such a threshold, said association will be deemed significant. If the p-value is significant, the genomic region around the marker will be further scrutinized for a trait-causmg gene.
It is prefened that p-value significance thresholds be assessed for each case/control population compaπson. Both the genetic distance between sampled population- "stratificatιon"-and the dispersion due to random selection of samples may indeed influence the p-value significance thresholds. It will be appreciated that the above approaches may be conducted on any scale (i.e. over the whole genome, a set of chromosomes, a single chromosome, a particular subchromosomal region, or any other desired portion of the genome). As mentioned above, once significance thresholds have been assessed, population sample sizes may be adapted as exemplified in Figure 3. Example 7 below illustrates the increase in statistical power brought to an association study by a haplotype analysis.
The results descnbed in Examples 5 and 7, generated from individual and haplotype studies using a biallelic marker set of an average density equal to ca. 40kb in the region of an Alzheimer's disease trait causing gene, indicate that all biallelic markers of sufficient informative content located withm a ca. 200 kb genomic region around a trait-causmg allele can potentially be successfully used to localize a trait causing gene with the methods provided by the present invention. This conclusion is further supported by the results obtained through measunng the linkage disequilibnum between markers 99-365-344 or 99-359-308 and ApoE 4 Site A marker within Alzheimer's patients: as one could predict since linkage disequilibrium is the supporting basis for association studies, linkage disequilibnum between these pairs of markers was enhanced in the diseased population vs. the control population. In a similar way as the haplotype analysis enhanced the significance of the conespondmg association studies. Once a given polymoφhic site has been found and charactenzed as a biallelic marker according to the methods of the present invention, several methods can be used in order to determine the specific allele earned by an individual at the given polymoφhic base as described in III Location of a Gene Associated with Detectable Traits
Once the candidate region has been delineated using the high density biallelic marker map, a sequence analysis process will allow the detection of all genes located within said region, together with a potential functional characterization of said genes. The identified functional features may allow prefened trait-causing candidates to be chosen from among the identified genes. More biallelic markers may then be generated within said candidate genes, and used to perform refined association studies that will support the identification of the trait causing gene. Sequence analysis processes are described in Example 9.
Examples 10-18 illustrate the application of the above methods using biallelic markers to identify a gene associated with a complex disease, prostate cancer, within a ca. 450 kb candidate region. Additional details of the identification of the gene associated with prostate cancer are provided in the U.S. Patent Application entitled "Prostate Cancer Gene" Serial No. 08/996,306.
The above methods were also used to identify biallelic markers in a gene which was an attractive candidate for a gene associated with asthma. Examples 19-26 show how the use of methods of the present invention allowed this gene to be identified as a gene responsible, at least partially, for asthma in the studied populations. Additional details of the identification of the gene associated with asthma are provided in U.S. Provisional Application Serial Nos. 60/081,893. Alternatively, genes associated with detectable traits may be identified as follows.
Candidate genomic regions suspected of harboring a gene associated with the trait may be identified using techniques such as those described herein. In such techniques, the allelic frequencies of biallelic markers are compared in nucleic acid samples derived from individuals expressing the detectable trait and individuals who do not express the detectable trait. In this manner, candidate genomic regions suspected of harboring a gene associated with the detectable trait under investigation are identified.
The existence of one or more genes associated with the detectable trait within the candidate region is confirmed by identifying more biallelic markers lying in the candidate region. A first haplotype analysis is performed for each possible combination of groups of biallelic markers within the genomic region suspected of harboring a trait-associated gene.
For example, each group may comprise three biallelic markers. For each of the groups of markers, the frequency of each possible haplotype (for groups of three markers there are 8 possible haplotypes) in individuals expressing the trait and individuals who do not express the trait is estimated. For example, the a haplotype estimation method is applied as described in IV. for example the haplotype frequencies may be estimated using the Expectation-
Maximization method of Excoffier L and Slatkin M, Mol. Biol. Evol. 12:921-927 (1995). The frequencies of each of the possible haplotypes of the grouped markers (or each allele of individual markers) in individuals expressing the trait and individuals who do not express the trait are compared. For example, the frequencies may be compared by performing a chi-squared analysis. Within each group, the haplotype (or the allele of each individual marker) having the greatest association with the trait is selected. This process is repeated for each group of biallelic markers (or each allele of the individual markers) to generate a distribution of association values, which will be refened to herein as the "trait-associated" distribution.
A second haplotype analysis is performed for each possible combination of groups of biallelic markers within the genomic regions which are not suspected of harboring a trait- associated gene. For example, each group may comprise three biallelic markers. For each of the groups of markers, the frequency of each possible haplotype (for groups of three markers there are 8 possible haplotypes) in individuals expressing the trait and individuals who do not express the trait is estimated. The frequencies of each of the possible haplotypes of the grouped markers (or each allele of individual markers) in individuals expressing the trait and individuals who do not express the trait are compared. For example, the frequencies may be compared by performing a chi-squared analysis. Within each group, the haplotype (or the allele of each individual marker) having the greatest association with the trait is selected. This process is repeated for each group of biallelic markers (or each allele of the individual markers) to generate a distribution of association values, which will be refened to herein as the "random" distribution.
The trait-associated distribution and the random distribution are then compared to one another to determine if there are significant differences between them. For example, the trait- associated distribution and the random distribution can be compared using either the
Wilcoxon rank test (Noether, G.E. (1991) Introduction to statistics: "The nonparametric way", Springer- Verlag, New York, Berlin) or the Kolmogorov-Smirnov test (Saporta, G. (1990) "Probalites, analyse des donnees et statistiques" Technip editions, Paris) or both the Wilcoxon rank test and the Kolmogorov-Smirnov test. If the trait-associated distribution and the random distribution are found to be significantly different, the candidate genomic region is highly likely to contain a gene associated with the detectable trait. Accordingly, the candidate genomic region is evaluated more fully to isolate the trait-associated gene. Alternatively, if the trait-associated distribution and the random distribution are equal using the above analyses, the candidate genomic region is unlikely to contain a gene associated with the detectable trait. Accordingly, no further analysis of the candidate genomic region is performed. While Examples 10 to 26 illustrate the use of the maps and markers of the present invention for identifying a new gene associated with a complex disease within a 2Mb genomic region for establishing that a candidate gene is, at least partially, responsible for a disease, the maps and markers of the present invention may also be used to identify one or more biallelic markers or one or more genes associated with other detectable phenotypes, including drug response, drug toxicity, or drug efficacy. The biallelic markers used in such drug response analyses or shown, using the methods of the present invention to be associated with such traits, may lie within or near genes responsible for or partly responsible for a particular disease, for example a disease against which the drug is meant to act, or may lie within genomic regions which are not responsible for or partly responsible for a disease. In the context of the present invention, a "positive response" to a medicament can be defined as comprising a reduction of the symptoms related to the disease or condition to be treated. In the context of the present invention, a "negative response" to a medicament can be defined as comprising either a lack of positive response to the medicament which does not lead to a symptom reduction or to a side-effect observed following administration of the medicament.
Drug efficacy, response and tolerance/toxicity can be considered as multifactorial traits involving a genetic component in the same way as complex diseases such as Alzheimer's disease, prostate cancer, hypertension or diabetes. As such, the identification of genes involved in drug efficacy and toxicity could be achieved following a positional cloning approach, e.g. performing linkage analysis within families in order to obtain the subchromosomal location of the gene(s). However, this type of analysis is actually impractical in the case of drug responsiveness, due to the lack of availability of familial cases. In fact, the likelihood of having more than one individual in a particular family being exposed to the same drug at the same time is very low. Therefore, drug efficacy and toxicity can only be analyzed as sporadic traits.
In order to conduct association studies to analyze the individual response to a given drug in groups of patients affected with a disease, up to four groups are screened to determine their patterns of biallelic markers using the techniques described above. The four groups are:
- Non-diseased or random controls, - Diseased patients/drug responders,
- Diseased patients/drug non-responders, and
- Diseased patients/drug side effects.
In prefened embodiments, the above mentioned groups are recruited according to phenotyping criteria having the characteristics described above, so that the phenotypes defining the different groups are non-overlapping, preferably extreme phenotypes. In highly prefened embodiments, such phenotyping criteria have the bimodal distribution described above.
The final number and composition of the groups for each drug association study is adapted to the distribution of the above described phenotypes within the studied population. After selecting a suitable population, association and haplotype analyses may be performed as described herein to identify one or more biallelic markers associated with drug response, preferably drug toxicity or drug efficacy. The identification of such one or more biallelic markers allows one to conduct diagnostic tests to determine whether the administration of a drug to an individual will result in drug response, preferably drug toxicity, or drug efficacy.
The methods described above for identifying a gene associated with prostate cancer and biallelic markers indicative of a risk of suffering from asthma may be utilized to identify genes associated with other detectable phenotypes. In particular, the above methods may be used with any marker or combination of markers included in the maps of the present invention, including the biallelic markers of SEQ ED Nos.: 1 to 3809 or the sequences complementary thereto. As described above, the general strategy to perform the association studies using the maps and markers of the present invention is to scan two groups of individuals (trait-positive individuals and trait-negative controls) characterized by a well defined phenotype in order to measure the allele frequencies of the biallelic markers in each of these groups. Preferably, the frequencies of markers with inter-marker spacing of about 150 kb are determined in each group. More preferably, the frequencies of markers with intermarker spacing of about 75 kb are determined in each group. Even more preferably, markers with inter-marker spacing of about 50 kb, about 37.5kb, about 30kb, or about 25kb will be tested in each population. In some embodiments the frequenices of 1, 5, 10, 20, 50, 100, 500, 1000, 2000, 3000, or all of the biallelic markers of SEQ DD Nos.: 1 to 3908 or the sequences complementary thereto are measured in each population. In another embodiment, the frequencies of 1, 5, 10, 20, 50, 100, 500, 1000, 2000, or 3000 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 1 to 3908 or the sequences complementary thereto are measured in each population. In some embodiments the frequenices of 1, 5, 10, 20, 50, 100, 500, 1000, 2000, or all of the biallelic markers of SEQ DD Nos.: 1 to 2260 or the sequences complementary thereto are measured in each population. In another embodiment, the frequencies of 1, 5, 10, 20, 50, 100, 500, 1000, or 2000 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 1 to 2260 or the sequences complementary thereto are measured in each population. In some embodiments the frequenices of 1, 5, 10, 20, 50, 100, 500, 1000, or all of the biallelic markers of SEQ DD Nos.: 2261 to 3734 or the sequences complementary thereto are measured in each population. In another embodiment, the frequencies of 1, 5, 10, 20, 50, 100, 500, 1000 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 2261 to 3734 or the sequences complementary thereto are measured in each population. In some embodiments the frequenices of 1, 5, 10, 20, 50, 100, or all of the biallelic markers of SEQ ID Nos.: 3735 to 3908 or the sequences complementary thereto are measured in each population. In another embodiment, the frequencies of 1, 5, 10, 20, 50, or 100 biallelic markers selected from the group consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers of 3735 to 3908 or the sequences complementary thereto are measured in each population.
In some embodiments, the frequencies of about 20,000, or about 40,000 biallelic markers are determined in each population. In a highly prefened embodiment, the frequencies of about 60,000, about 80,000, about 100,000, or about 120,000 biallelic markers are determined in each population. In some embodiments, haplotype analyses may be run using groups of markers located within regions spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1Mb.
Allele frequency can be measured using any genotyping method described herein including microsequencing techniques; prefened high throughput microsequencing procedures are further exemplified in DI; it will be further appreciated that any other large scale genotyping method suitable with the intended puφose contemplated herein may also be used. It will be appreciated that it is not necessary to use a full high density biallelic marker map in order to start a genome-wide association study. Maps having higher densities of biallelic markers (two or more markers per BAC, average inter-marker spacing of about 75kb or less) may then be generated by starting first on those BACs for which a candidate association has been established at the first step.
In cases when one or more candidate regions have previously been delineated, such as cases where a particular gene or genomic region is suspected of being associated with a trait, local exceφts of biallelic marker maps having densities above one marker per 150kb may be exploited using BACs harboring said genomic regions, or genes, or portions thereof. In these cases also, successive association studies may be performed using sets of biallelic markers showing increasing densities, preferably from about one every 150 kb to about one every 75kb; more preferably, sets of markers with inter-marker spacing below about 50kb, below about 37.5kb, below about 30kb, most preferably below about 25 kb, will be used. Haplotype analyses may also be conducted using groups of biallelic markers within the candidate region. The biallelic markers included in each of these groups may be located within a genomic region spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1Mb. It will be appreciated that the ordered DNA fragments containing these groups of biallelic markers need not completely cover the genomic regions of these lengths but may instead be incomplete contigs having one or more gaps therein. As discussed in further detail below, biallelic markers may be used in association studies and haplotype analyses regardless of the completeness of the conesponding physical contig harboring them, provided linkage disequilibrium between the markers can be assessed.
As described above, if a positive association with a trait, such as a disease, or a drug efficacy and/or toxicity, is identified using the biallelic markers and maps of the present invention, the maps will provide not only the confirmation of the association, but also a shortcut towards the identification of the gene involved in the trait under study. As described above, since the markers showing positive association to the trait are in linkage disequilibrium with the trait loci, the causal gene will be physically located in the vicinity of these markers. Regions identified through association studies using high density maps will on average have a 20 - 40 times shorter length than those identified by linkage analysis (2 to 20 Mb).
As described above, once a positive association is confirmed with the high density biallelic marker maps of the present invention, BACs from which the most highly associated markers were derived are completely sequenced and the mutations in the causal gene are searched by applying genomic analysis tools. As described above, once a region harboring a gene associated with a detectable trait has been sequenced and analyzed, the candidate functional regions (e.g. exons and splice sites, promoters and other regulatory regions) are scanned for mutations by comparing the sequences of a selected number of controls and cases, using adequate software.
In some embodiments, trait-positive samples being compared to identify causal mutations are selected among those carrying the ancestral haplotype; in these embodiments, control samples are chosen from individuals not carrying said ancestral haplotype. In further embodiments, trait-positive samples being compared to identify causal mutations are selected among those showing haplotypes that are as close as possible to the ancestral haplotype; in these embodiments, control samples are chosen from individuals not carrying any of the haplotypes selected for the case population.
The maps and biallelic markers of the present invention may also be used to identify patterns of biallelic markers associated with detectable traits resulting from polygenic interactions. The analysis of genetic interaction between alleles at unlinked loci requires individual genotyping using the techniques described herein. The analysis of allelic interaction among a selected set of biallelic markers with appropriate p-values can be considered as a haplotype analysis, similar to those described in further details within the present invention.
IX. Use of Biallelic Markers to Identify Individuals Likely to Exhibit a Detectable Trait
Associated with a Particular Allele of a Known Gene
In addition to their utility in searches for genes associated with detectable traits on a genome-wide, chromosome-wide, or subchromosomal level, the maps and biallelic markers of the present invention may be used in more targeted approaches for identifying individuals likely to exhibit a particular detectable trait or individuals who exhibit a particular detectable trait as a consequence of possessing a particular allele of a gene associated with the detectable trait. For example, the biallelic markers and maps of the present invention may be used to identify individuals who carry an allele of a known gene that is suspected of being associated with a particular detectable trait. In particular, the target genes may be genes having alleles which predispose an individual to suffer from a specific disease state. In other cases, the target genes may be genes having alleles that predispose an individual to exhibit a desired or undesired response to a drug or other pharmaceutical composition, a food, or any administered compound. The known gene may encode any of a variety of types of biomolecules. For example, the known genes targeted in such analyzes may be genes known to be involved in a particular step in a metabolic pathway in which disruptions may cause a detectable trait. Alternatively, the target genes may be genes encoding receptors or ligands which bind to receptors in which disruptions may cause a detectable trait, genes encoding transporters, genes encoding proteins with signaling activities, genes encoding proteins involved in the immune response, genes encoding proteins involved in hematopoesis, or genes encoding proteins involved in wound healing. It will be appreciated that the target genes are not limited to those specifically enumerated above, but may be any gene known to be or suspected of being associated with a detectable trait.
As previously mentioned, the maps and markers of the present invention may be used to identify genes associated with drug response. The biallelic markers of the present invention may also be used to select individuals for inclusion in the clinical trials of a drug. In some embodiments, the markers of SEQ DD Nos.: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto may be used in targeted approaches to identify individuals at risk of developing a detectable trait, for example a complex disease or desired/undesired drug response, or to identify individuals exhibiting said trait. The present invention provides methods to establish putative associations between any of the biallelic markers described herein and any detectable traits, including those specifically described herein. To use the maps and markers of the present invention in further targeted approaches, biallelic markers which are in linkage disequilibrium with any of the above disclosed markers may be identified. In cases where one or more biallelic markers of the present invention have been shown to be associated with a detectable trait, more biallelic markers in linkage disequilibrium with said associated biallelic markers may be generated and used to perform targeted approaches aiming at identifying individuals exhibiting, or likely to exhibit, said detectable trait, according to the methods provided herein.
Furthermore, in cases where a candidate gene is suspected of being associated with a particular detectable trait or suspected of causing the detectable trait, biallelic markers in linkage disequilibrium with said candidate gene may be identified and used in targeted approaches, such as the approaches utilized above for the asthma-associated gene and the Apo E gene.
Biallelic markers that are in linkage disequilibrium with markers associated with a detectable trait, or with genes associated with a detectable trait, or suspected of being so, are identified by performing single marker analyzes, haplotype association analyzes, or linkage disequilibrium measurements on samples from trait-positive and trait-negative individuals as described above using biallelic markers lying in the vicinity of the target marker or gene. In this manner, a single biallelic marker or a group of biallelic markers may be identified which indicate that an individual is likely to possess the detectable trait or does possess the detectable trait as a consequence of a particular allele of the target marker or gene. Nucleic acid samples from individuals to be tested for predisposition to a detectable trait or possession of a detectable trait as a consequence of a particular allele of the target gene may be examined using the diagnostic methods described above.
Throughout this application, various publications, patents, and published patent applications are cited. The disclosures of the publications, patents, and published patent specifications referenced in this application are hereby incoφorated by reference into the present disclosure to more fully describe the state of the art to which this invention pertains.
EXAMPLES Several of the methods of the present invention are described in the following examples, which are offered by way of illustration and not by way of limitation. Many other modifications and variations of the invention as herein set forth can be made without departing from the spirit and scope thereof and therefore only such limitations should be imposed as are indicated by the appended claims. Example 1 Ordering of a BAC Library: Screening Clones with STSs The BAC library is screened with a set of PCR-typeable STSs to identify clones containing the STSs. To facilitate PCR screening of several thousand clones, for example 200,000 clones, pools of clones are prepared.
Three-dimensional pools of the BAC libraries are prepared as described in Chumakov et al. and are screened for the ability to generate an amplification fragment in amplification reactions conducted using primers derived from the ordered STSs. (Chumakov et al. (1995), supra). A BAC library typically contains 200,000 BAC clones. Since the average size of each insert is 100-300 kb, the overall size of such a library is equivalent to the size of at least about
7 human genomes. This library is stored as an anay of individual clones in 518 384-well plates. It can be divided into 74 primary pools (7 plates each). Each primary pool can then be divided into 48 subpools prepared by using a three-dimensional pooling system based on the plate, row and column address of each clone (more particularly, 7 subpools consisting of all clones residing in a given microtiter plate; 16 subpools consisting of all clones in a given row;
24 subpools consisting of all clones in a given column).
Amplification reactions are conducted on the pooled BAC clones using primers specific for the STSs. For example, the three dimensional pools may be screened with 45,000 STSs whose positions relative to one another and locations along the genome are known. Preferably, the three dimensional pools are screened with about 30,000 STSs whose positions relative to one another and locations along the genome are known. In a highly prefened embodiment, the three dimensional pools are screened with about 20,000 STSs whose positions relative to one another and locations along the genome are known.
Amplification products resulting from the amplification reactions are detected by conventional agarose gel electrophoresis combined with automatic image capturing and processing. PCR screening for a STS involves three steps: (1) identifying the positive primary pools; (2) for each positive primary pool, identifying the positive plate, row and column 'subpools' to obtain the address of the positive clone; (3) directly confirming the PCR assay on the identified clone. PCR assays are performed with primers specifically defining the STS. Screening is conducted as follows. First BAC DNA containing the genomic inserts is prepared as follows. Bacteria containing the BACs are grown overnight at 37°C in 120 μl of LB containing chloramphenicol (12 μg/ml). DNA is extracted by the following protocol: Centrifuge 10 min at 4°C and 2000 φm
Eliminate supernatant and resuspend pellet in 120 μl TE 10-2 (Tris HCI 10 mM, EDTA 2 mM)
Centrifuge 10 min at 4°C and 2000 φm Eliminate supernatant and incubate pellet with 20 μl lyzozyme 1 mg/ml during 15 min at room temperature
Add 20 μl proteinase K lOOμg/ml and incubate 15 min at 60°C Add 8 μl DNAse 2U/μl and incubate 1 hr at room temperature Add 100 μl TE 10-2 and keep at -80°C
PCR assays are performed using the following protocol:
Final volume 15 μl
BAC DNA 1.7 ng/μl MgCl2 2 mM dNTP (each) 200 μM primer (each) 2.9 ng/μl
Ampli Taq Gold DNA polymerase 0.05 unit μl
PCR buffer (lOx = 0.1 M TrisHCl pH8.3 0.5M KC1 lx
The amplification is performed on a Genius II thermocycler. After heating at 95°C for 10 min, 40 cycles are performed. Each cycle comprises: 30 sec at 95°C, 54°C for 1 min, and 30 sec at 72°C. For final elongation, 10 min at 72°C end the amplification. PCR products are analyzed on 1% agarose gel with 0.1 mg/ml ethidium bromide. Alternatively, a YAC (Yeast Artificial Chromosome) library can be used. The very large insert size, of the order of 1 megabase, is the main advantage of the YAC libraries. The library can typically include about 33,000 YAC clones as described in Chumakov et al. (1995, supra). The YAC screening protocol may be the same as the one used for BAC screening.
The known order of the STSs is then used to align the BAC inserts in an ordered anay (contig) spanning the whole human genome. If necessary new STSs to be tested can be generated by sequencing the ends of selected BAC inserts. Subchromosomal localization of the BACs can be established and/or verified by fluorescence in situ hybridization (FISH), performed on metaphasic chromosomes as described by Cherif et al. 1990 and in Example 3 below. BAC insert size may be determined by Pulsed Field Gel Electrophoresis after digestion with the restriction enzyme Notl.
Finally, a minimally overlapping set of BAC clones, with known insert size and subchromosomal location, covering the entire genome, a set of chromosomes, a single chromosome, a particular subchromosomal region, or any other desired portion of the genome is selected from the DNA library. For example, the BAC clones may cover at least lOOkb of contiguous genomic DNA, at least 250kb of contiguous genomic DNA, at least 500kb of contiguous genomic DNA, at least 2Mb of contiguous genomic DNA, at least 5Mb of contiguous genomic DNA, at least 10Mb of contiguous genomic DNA, or at least 20Mb of contiguous genomic DNA.
Example 2 Screening BAC libraries with biallelic markers Amplification primers enabling the specific amplification of DNA fragments carrying the biallelic markers, including the map-related biallelic markers of the invention, may be used to screen clones in any genomic DNA library, preferably the BAC libraries described above for the presence of the biallelic markers.
Pairs of primers of SEQ DD Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 were designed which allow the amplification of fragments carrying the biallelic markers of SEQ DD Nos: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto. The amplification primers of SEQ ID Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 may be used to screen clones in a genomic DNA library for the presence of the biallelic markers of SEQ ID Nos: 1 to 3908, 1 to
2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto.
It will be appreciated that amplification primers for the biallelic markers of SEQ DD Nos: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 need not be identical to the primers of SEQ DD Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. Rather, they can be any other primers allowing the specific amplification of any DNA fragment carrying the markers and may be designed using techniques familiar to those skilled in the art. The amplification primers may be oligonucleotides of 8, 10, 15, 20 or more bases in length which enable the amplification of any fragment carrying the polymoφhic site in the markers. The polymoφhic base may be in the center of the amplification product or, alternatively, it may be located off-center. For example, in some embodiments, the amplification product produced using these primers may be at least 100 bases in length (i.e. 50 nucleotides on each side of the polymoφhic base in amplification products in which the polymoφhic base is centrally located). In other embodiments, the amplification product produced using these primers may be at least 500 bases in length (i.e. 250 nucleotides on each side of the polymoφhic base in amplification products in which the polymoφhic base is centrally located). In still further embodiments, the amplification product produced using these primers may be at least 1000 bases in length (i.e. 500 nucleotides on each side of the polymoφhic base in amplification products in which the polymoφhic base is centrally located). Amplification primers such as those described above are included within the scope of the present invention.
The localization of biallelic markers on BAC clones is performed essentially as described in Example 1.
The BAC clones to be screened are distributed in three dimensional pools as described in Example 1.
Amplification reactions are conducted on the pooled BAC clones using primers specific for the biallelic markers to identify BAC clones which contain the biallelic markers, using procedures essentially similar to those described in Example 1.
Amplification products resulting from the amplification reactions are detected by conventional agarose gel electrophoresis combined with automatic image capturing and processing. PCR screening for a biallelic marker involves three steps: (1) identifying the positive primary pools; (2) for each positive primary pools, identifying the positive plate, row and column 'subpools' to obtain the address of the positive clone; (3) directly confirming the PCR assay on the identified clone. PCR assays are performed with primers defining the biallelic marker.
Screening is conducted as follows. First BAC DNA is isolated as follows. Bacteria containing the genomic inserts are grown overnight at 37°C in 120 μl of LB containing chloramphenicol (12 μg/ml). DNA is extracted by the following protocol:
Centrifuge 10 min at 4°C and 2000 φm
Eliminate supernatant and resuspend pellet in 120 μl TE 10-2 (Tris HCI 10 mM,
EDTA 2 mM) Centrifuge 10 min at 4°C and 2000 φm
Eliminate supernatant and incubate pellet with 20 μl lyzozyme 1 mg/ml during 15 min at room temperature
Add 20 μl proteinase K lOOμg/ml and incubate 15 min at 60°C
Add 8 μl DNAse 2U/μl and incubate 1 hr at room temperature Add 100 μl TE 10-2 and keep at -80°C
PCR assays are performed using the following protocol:
Final volume 15 μl
BAC DNA 1.7 ng/μl MgCl2 2 mM dNTP (each) 200 μM primer (each) 2.9 ng/μl
Ampli Taq Gold DNA polymerase 0.05 unit μl
PCR buffer (lOx = 0.1 M TrisHCl pH8.3 0.5M KC1 lx The amplification is performed on a Genius II thermocycler. After heating at 95°C for 10 min, 40 cycles are performed. Each cycle comprises: 30 sec at 95°C, 54°C for 1 min, and 30 sec at 72°C. For final elongation, 10 min at 72°C end the amplification. PCR products are analyzed on 1% agarose gel with 0.1 mg/ml ethidium bromide. Example 3
Assignment of Biallelic Markers to Subchromosomal Regions Metaphase chromosomes are prepared from phytohemagglutinin (PHA)-stimulated blood cell donors. PHA-stimulated lymphocytes from healthy males are cultured for 72 h in RPMI-1640 medium. For synchronization, methotrexate (10 mM) is added for 17 h, followed by addition of 5-bromodeoxyuridine (5-BudR, 0.1 mM) for 6 h. Colcemid (1 mg/ml) is added for the last 15 min before harvesting the cells. Cells are collected, washed in RPMI, incubated with a hypotonic solution of KC1 (75 mM) at 37°C for 15 min and fixed in three changes of methano acetic acid (3:1). The cell suspension is dropped onto a glass slide and air-dried. BAC clones carrying the biallelic markers used to construct the maps of the present invention (including the biallelic markers of SEQ ID Nos: 1 to 3908, 1 to 2260, 2261 to 3374,
3735 to 3908 or the sequences complementary thereto) can be isolated as described above. These BACs or portions thereof, including fragments carrying said biallelic markers, obtained for example from amplification reactions using pairs of primers of SEQ ED Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773, can be used as probes to be hybridized with metaphasic chromosomes. It will be appreciated that the hybridization probes to be used in the contemplated method may be generated using alternative methods well known to those skilled in the art. Hybridization probes may have any length suitable for this intended puφose.
Probes are then labeled with biotin- 16 dUTP by nick translation according to the manufacturer's instructions (Bethesda Research Laboratories, Bethesda, MD), purified using a
Sephadex G-50 column (Pharmacia, Upssala, Sweden) and precipitated. Just prior to hybridization, the DNA pellet is dissolved in hybridization buffer (50% formamide, 2 X SSC, 10% dextran sulfate, 1 mg/ml sonicated salmon sperm DNA, pH 7) and the probe is denatured at 70°C for 5-10 min. Slides kept at -20°C are treated for 1 h at 37°C with RNase A ( 100 mg/ml), rinsed three times in 2 X SSC and dehydrated in an ethanol series. Chromosome preparations are denatured in 70% formamide, 2 X SSC for 2 min at 70°C, then dehydrated at 4°C. The slides are treated with proteinase K (10 mg/100 ml in 20 mM Tris-HCl, 2 mM CaCl2) at 37°C for 8 min and dehydrated. The hybridization mixture containing the probe is placed on the slide, covered with a coverslip, sealed with rubber cement and incubated overnight in a humid chamber at 37°C.
After hybridization and post-hybridization washes, the biotinylated probe is detected by avidin- FITC and amplified with additional layers of biotinylated goat anti-avidin and avidin-FITC. For chromosomal localization, fluorescent R-bands are obtained as previously described (Cherif et al.,(1990) supra.). The slides are observed under a LEICA fluorescence microscope (DMRXA). Chromosomes are counterstained with propidium iodide and the fluorescent signal of the probe appears as two symmetrical yellow-green spots on both chromatids of the fluorescent R-band chromosome (red). Thus, a particular biallelic marker may be localized to a particular cytogenetic R-band on a given chromosome.
The above procedure was used to confirm the subchromosomal location of many of the BAC clones harboring the markers obtained above. In particular, several of the markers were assigned to subchromosomal regions of chromosome 21. Simple identification numbers were attributed to each BAC from which the markers are derived. Figure 1 is a cytogenetic map of chromosome 21 indicating the subchromosomal regions therein. Amplification primers for generating amplification products containing the polymoφhic bases of these markers are also provided in the accompanying sequence listing. In addition, microsequencing primers for use in determining the identities of the polymoφhic bases of these biallelic markers are provided in the accompanying Sequence Listing.
The rate at which biallelic markers may be assigned to subchromosomal regions may be enhanced through automation. For example, probe preparation may be performed in a microtiter plate format, using adequate robots. The rate at which biallelic markers may be assigned to subchromosomal regions may be enhanced using techniques which permit the in situ hybridization of multiple probes on a single microscope slide, such as those disclosed in Larin et al., Nucleic Acids Research 22: 3689-3692 (1994). In the largest test format described, different probes were hybridized simultaneously by applying them directly from a 96-well microtiter dish which was inverted on a glass plate. Software for image data acquisition and analysis that is adapted to each optical system, test format, and fluorescent probe used, can be derived from the system described in Lichter et al. Science 247: 64-69 (1990). Such software measures the relative distance between the center of the fluorescent spot conesponding to the hybridized probe and the telomeric end of the short arm of the conesponding chromosome, as compared to the total length of the chromosome. The rate at which biallelic markers are assigned to subchromosomal locations may be further enhanced by simultaneously applying probes labeled with different flouorescent tags to each well of the 96 well dish. A further benefit of conducting the analysis on one slide is that it facilitates automation, since a microscope having a moving stage and the capability of detecting fluorescent signals in different metaphase chromosomes could provide the coordinates of each probe on the metaphase chromosomes distributed on the 96 well dish.
Example 4 below describes an alternative method to position biallelic markers which allows their assignment to human chromosomes.
Example 4 Assignment of Biallelic Markers to Human Chromosomes The biallelic markers used to construct the maps of the present invention, including the biallelic markers of SEQ DD Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto, may be assigned to a human chromosome using monosomal analysis as described below.
The chromosomal localization of a biallelic marker can be performed through the use of somatic cell hybrid panels. For example 24 panels, each panel containing a different human chromosome, may be used (Russell et al., Somat Cell Mol. Genet 22:425-431 (1996);
Drwinga et al., Genomics 16:311-314 (1993)).
The biallelic markers are localized as follows. The DNA of each somatic cell hybrid is extracted and purified. Genomic DNA samples from a somatic cell hybrid panel are prepared as follows. Cells are lysed overnight at 42°C with 3.7 ml of lysis solution composed of:
3 ml TE 10-2 (Tris HCI 10 mM, EDTA 2 mM) / NaCl 0.4 M 200 μl SDS 10%
500 μl K-proteinase (2 mg K-proteinase in TE 10-2 / NaCl 0.4 M) For the extraction of proteins, 1 ml saturated NaCl (6M) (1/3.5 v/v) is added. After vigorous agitation, the solution is centrifuged for 20 min at 10,000 φm. For the precipitation of DNA, 2 to 3 volumes of 100 % ethanol are added to the previous supernatant, and the solution is centrifuged for 30 min at 2,000 φm. The DNA solution is rinsed three times with 70 % ethanol to eliminate salts, and centrifuged for 20 min at 2,000 φm. The pellet is dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water. The DNA concentration is evaluated by measuring the OD at 260 nm (1 unit OD = 50 μg/ml DNA). To determine the presence of proteins in the DNA solution, the OD26o OD28o ratio is determined. Only DNA preparations having a OD260/OD28o ratio between 1.8 and 2 are used in the PCR assay.
Then, a PCR assay is performed on genomic DNA with primers defining the biallelic marker. The PCR assay is performed as described above for BAC screening. The PCR products are analyzed on a 1% agarose gel containing 0.2 mg/ml ethidium bromide.
Example 5 Measurement of Linkage Disequilibrium As originally reported by Strittmatter et al. and by Saunders et al. in 1993, the Apo E e4 allele is strongly associated with both late-onset familial and sporadic Alzheimer's disease. (Saunders, A.M. Lancet 342: 710-711 (1993) and Strittmater, W.J. et al., Proc. Natl. Acad.
Sci. U.S.A. 90: 1977-1981 (1993)). The 3 major isoforms of human Apolipoprotein E (apoE2, -E3, and -E4), as identified by isoelectric focusing, are coded for by 3 alleles (e 2, 3, and 4). The e 2, e 3, and e 4 isoforms differ in amino acid sequence at 2 sites, residue 112 (called site A) and residue 158 (called site B). The ancestral isoform of the protein is Apo E3, which at sites A B contains cysteine/arginine, while ApoE2 and -E4 contain cysteine/cysteine and arginine/arginine, respectively (Weisgraber, K.H. et al., J. Biol. Chem. 256: 9077-9083
(1981); Rail, S.C. et al., Proc. Natl. Acad. Sci. U.S.A. 79: 4696-4700 (1982)).
Apo E e 4 is currently considered as a major susceptibility risk factor for Alzheimer's disease development in individuals of different ethnic groups (specially in Caucasians and Japanese compared to Hispanics or African Americans), across all ages between 40 and 90 years, and in both men and women, as reported recently in a study performed on 5930
Alzheimer's disease patients and 8607 controls (Farrer et al., JAMA 278:1349-1356 (1997)). More specifically, the frequency of a C base coding for arginine 112 at site A is significantly increased in Alzheimer's disease patients.
Although the mechanistic link between Apo E e 4 and neuronal degeneration characteristic of Alzheimer's disease remains to be established, cunent hypotheses suggest that the Apo E genotype may influence neuronal vulnerability by increasing the deposition and/or aggregation of the amyloid beta peptide in the brain or by indirectly reducing energy availability to neurons by promoting atherosclerosis.
Using the methods of the present invention, biallelic markers that are in the vicinity of the Apo E site A were generated and the association of one of their alleles with Alzheimer's disease was analyzed. An Apo E public marker (stSG94) was used to screen a human genome BAC library as previously described. A BAC, which gave a unique FISH hybridization signal on chromosomal region 19ql3.2.3, the chromosomal region harboring the Apo E gene, was selected for finding biallelic markers in linkage disequilibrium with the Apo E gene as follows.
This BAC contained an insert of 205 kb that was subcloned as previously described. Fifty BAC subclones were randomly selected and sequenced. Twenty five subclone sequences were selected and used to design twenty five pairs of PCR primers allowing 500 bp-amplicons to be generated. These PCR primers were then used to amplify the conesponding genomic sequences in a pool of DNA from 100 unrelated individuals (blood donors of French origin) as already described.
Amplification products from pooled DNA were sequenced and analyzed for the presence of biallelic polymoφhisms, as already described. Five amplicons were shown to contain a polymoφhic base in the pool of 100 unrelated individuals, and therefore these polymoφhisms were selected as random biallelic markers in the vicinity of the Apo E gene.
The sequences of both alleles of these biallelic markers (99-344-439; 99-366-274, 99-359-308; 99-355-219; 99-365-344; ) conespond to SEQ DD Nos: 3909 to 3913. Conesponding pairs of amplification primers for generating amplicons containing these biallelic markers can be chosen from those listed as SEQ DD Nos: 7843 to 7847 and 11774 to 11778.
An additional pair of primers (SEQ FD Nos: 3124 and 4169) was designed that allows amplification of the genomic fragment carrying the biallelic polymoφhism conesponding to the ApoE marker (99-2452-54; C/T; designated SEQ DD NO: 3914 in the accompanying Sequence Listing; publicly known as Apo E site A (Weisgraber et al. (1981), supra; Rail et al. (1982), supra) to be amplified.
The five random biallelic markers plus the Apo E site A marker were physically ordered by PCR screening of the conesponding amplicons using all available BACs originally selected from the genomic DNA libraries, as previously described, using the public Apo E marker stSG94. The amplicon's order derived from this BAC screening is as follows: (99- 344-439/99-366-274) - (99-365-344/99-2452-54) - 99-359-308 - 99-355-219, where parentheses indicate that the exact order of the respective amplicons couldn't be established. Linkage disequilibrium among the six biallelic markers (five random markers plus the
Apo E site A) was determined by genotyping the same 100 unrelated individuals from whom the random biallelic markers were identified.
DNA samples and amplification products from genomic PCR were obtained in similar conditions as those described above for the generation of biallelic markers, and subjected to automated microsequencing reactions using fluorescent ddNTPs (specific fluorescence for each ddNTP) and the appropriate microsequencing primers having a 3 ' end immediately upstream of the polymoφhic base in the biallelic markers. Once specifically extended at the 3' end by a DNA polymerase using the complementary fluorescent dideoxynucleotide analog (thermal cycling), the microsequencing primer was precipitated to remove the unincoφorated fluorescent ddNTPs. The reaction products were analyzed by electrophoresis on ABI 377 sequencing machines. Results were automatically analyzed by an appropriate software further described in Example 8.
Linkage disequilibrium (LD) between all pairs of biallelic markers (Mi, Mj) was calculated for every allele combination (Mil,Mj l ; Mil,Mj2 ; Mi2,Mj l ; Mi2,Mj2) according to the maximum likelihood estimate (MLE) for delta (the composite linkage disequilibrium coefficient). The results of the linkage disequilibrium analysis between the Apo E Site A marker and the five new biallelic markers (99-344-439 ; 99-355-219 ; 99-359-308 ; 99-365- 344 ; 99-366-274) are summarized in Table 2 below: Table 2
Markers d x lOO SEQ ID Nos of the SEQ ID Nos of the biallelic Markers amplification Primers
ApoE SiteA 1028 3124
99-2452-54 2076 4169
99-344-439 1 1023 3119
2071 4164
99-366-274 1 1024 3120
2072 4165
99-365-344 8 1027 3123
2075 4168
99-359-308 2 1025 3121
2073 4166
99-355-219 1 1026 3122
2074 4167
The above linkage disequilibrium results indicate that among the five biallelic markers randomly selected in a region of about 200 kb containing the Apo E gene, marker 99-365- 344T is in relatively strong linkage disequilibrium with the Apo E site A allele (99-2452-54C).
Therefore, since the Apo E site A allele is associated with Alzheimer's disease, one can predict that the T allele of marker 99-365-344 will probably be found associated with Alzheimer's disease. In order to test this hypothesis, the biallelic markers of SEQ DD Nos: 3909 to 3913 were used in association studies as described below.
225 Alzheimer's disease patients were recruited according to clinical inclusion criteria based on the MMSE test. The 248 control cases included in this study were both ethnically- and age-matched to the affected cases. Both affected and control individuals conesponded to unrelated cases. The identities of the polymoφhic bases of each of the biallelic markers was determined in each of these individuals using the methods described above. Techniques for conducting association studies are further described below.
The results of this study are summarized in Table 3 below : Table 3
MARKER ASSOCIATION DATA
Difference in allele frequency Conesponding p-value between individuals with Alzheimer's and control individuals
99-344-439 3.3 % 9.54 E-02
99-366-274 1.6% 2.09 E-01
99-365-344 17.7% 6.9 E-10
99-2452-54 (ApoE Site A) 23.8% 3.95 E-21
99-359-308 0.4% 9.2 E-01
99-355-219 2.5% 2.54 E-01
The frequency of the Apo E site A allele in both Alzheimer's disease cases and controls was found in agreement with that previously reported (ca. 10% in controls and ca. 34% in Alzheimer's disease cases, leading to a 24% difference in allele frequency), thus validating the Apo E e4 association in the populations used for this study.
Moreover, as predicted from the linkage disequilibrium analysis (Table 3), a significant association of the T allele of marker 99-365/344 with Alzheimer's disease cases (18% increase in the T allele frequency in Alzheimer's disease cases compared to controls, p value for this difference = 6.9 E-10) was observed.
The above results indicate that any marker in linkage disequilibrium with one given marker associated with a trait will be associated with the trait. It will be appreciated that, though in this case the ApoE Site A marker is the trait-causing allele (TCA) itself, the same conclusion could be drawn with any other non trait-causing allele marker associated with the studied trait.
These results further indicate that conducting association studies with a set of biallelic markers randomly generated within a candidate region at a sufficient density (here about one biallelic marker every 40kb on average), allows the identification of at least one marker associated with the trait. In addition, these results conelate with the physical order of the six biallelic markers contemplated within the present example (see above) : marker 99-365/344, which had been found to be the closest in terms of physical distance to the ApoE Site A marker, also shows the strongest linkage disequilibrium with the Apo E site A marker.
In order to further refine the relationship between physical distance and linkage disequilibrium between biallelic markers, a ca. 450 kb fragment from a genomic region on chromosome 8 was fully sequenced. LD within ca. 230 pairs of biallelic markers derived therefrom was measured in a random French population and analyzed as a function of the known physical inter-marker spacing. This analysis confirmed that, on average, linkage disequilibrium between 2 biallelic markers conelates with the physical distance that separates them. It further indicated that linkage disequilibrium between 2 biallelic markers tends to decrease when their spacing increases. More particularly, linkage disequilibrium between 2 biallelic markers tends to decrease when their inter-marker distance is greater than 50kb, and is further decreased when the inter-marker distance is greater than 75kb. It was further observed that when 2 biallelic markers were further than 150kb apart, most often no significant linkage disequilibrium between them could be evidenced. It will be appreciated that the size and history of the sample population used to measure linkage disequilibrium between markers may influence the distance beyond which linkage disequilibrium tends not to be detectable. Assuming that linkage disequilibrium can be measured between markers spanning regions up to an average of 150kb long, biallelic marker maps will allow genome-wide linkage disequilibrium mapping, provided they have an average inter-marker distance lower than
150kb.
Example 6
Identification of a Candidate Region Harboring a
Gene Associated with a Detectable Trait The initial identification of a candidate genomic region harboring a gene associated with a detectable trait may be conducted using a genome-wide map comprising about 20,000 biallelic markers. The candidate genomic region may be further defined using a map having a higher marker density, such as a map comprising about 40,000 markers, about 60,000 markers, about 80,000 markers, about 100,000 markers, or about 120,000 markers. The use of high density maps such as those described above allows the identification of genes which are truly associated with detectable traits, since the coincidental associations will be randomly distributed along the genome while the true associations will map within one or more discrete genomic regions. Accordingly, biallelic markers located in the vicinity of a gene associated with a detectable trait will give rise to broad peaks in graphs plotting the frequencies of the biallelic markers in trait-positive individuals versus control individuals. In contrast, biallelic markers which are not in the vicinity of the gene associated with the detectable trait will produce unique points in such a plot. By determining the association of several markers within the region containing the gene associated with the detectable trait, the gene associated with the detectable trait can be identified using an association curve which reflects the difference between the allele frequencies within the trait-positive and control populations for each studied marker. The gene associated with the detectable trait will be found in the vicinity of the marker showing the highest association with the trait.
Figures 4, 5, and 6 provide a simulated illustration of the above principles. As illustrated in Figure 4, an association analysis conducted with a map comprising about 3,000 biallelic markers yields a group of points. However, when an association analysis is performed using a denser map which includes additional biallelic markers, the points become broad peaks indicative of the location of a gene associated with a detectable trait. For example, the biallelic markers used in the initial association analysis may be obtained from a map comprising about 20,000 biallelic markers, as illustrated by the simulation results shown in Figure 5. In some embodiments, one or more of the biallelic markers of SEQ ED Nos. 1 to
3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are used in the association analysis.
In the simulated results of Figure 4, the association analysis with 3,000 markers suggests peaks near markers 9 and 17. Next, a second analysis is performed using additional markers in the vicinity of markers 9 and 17, as illustrated in the simulated results of Figure 5, using a map of about 20,000 markers. This step again indicates an association in the close vicinity of marker 17, since more markers in this region show an association with the trait. However, none of the additional markers around marker 9 shows a significant association with the trait, which makes marker 9 a potential false positive. In some embodiments, one or more of the biallelic markers selected from the group consisting of SEQ ED Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are used in the second analysis. In order to further test the validity of these two suspected associations, a third analysis may be obtained with a map comprising about 60,000 biallelic markers. In some embodiments, one or more of the biallelic markers selected from the group consisting of SEQ ED Nos: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are used in the third association analysis. In the simulated results of Figure 6, more markers lying around marker 17 exhibit a high degree of association with the detectable trait. Conversely, no association is confirmed in the vicinity of marker 9. The genomic region sunounding marker 17 can thus be considered a candidate region for the potential trait of this simulation.
Example 7 Haplotype Analysis: Identification of biallelic markers delineating a genomic region associated with Alzheimer's Disease (AD) As shown in Table 3 within Example 5, at an average map density of one marker per 40 kb only one marker (99-365-344) out of five random biallelic markers from a ca. 200 kb genomic region around the Apo E gene showed a clear association to Alzheimer's disease (delta allelic frequency in cases and controls =18% ; p value = 6.9 E-10). The allelic frequencies of the other four random markers were not significantly different between Alzheimer's disease cases and controls (p-values > E-01). However, since linkage disequilibrium can usually be detected between markers located further apart than an average 40 kb as previously discussed, one should expect that, performing an association study with a local exceφt of a biallelic marker map covering ca. 200kb with an average inter-marker distance of ca. 40kb should allow the identification of more than one biallelic marker associated with Alzheimer's disease.
A haplotype analysis was thus performed using the biallelic markers 99-344-439; 99- 355-219; 99-359-308; 99-365-344; and 99-366-274 (of SEQ DD Nos: 3909 to 3919).
In a first step, marker 99-365-344 that was already found associated with Alzheimer's disease was not included in the haplotype study. Only biallelic markers 99-344-439, 99-355- 219, 99-359-308, and 99-366-274, which did not show any significant association with Alzheimer's disease when taken individually, were used. This first haplotype analysis measured frequencies of all possible two-, three-, or four-marker haplotypes in the
Alzheimer's disease case and control populations. As shown in Figure 7, there was one haplotype among all the potential different haplotypes based on the four individually nonsignificant markers ("haplotype 8", TAGG comprising SEQ ID No. 3910 with the T allele of marker 99-366-274, SEQ ID No. 3909 with the A allele of marker 99-344-439, SEQ DD No. 3911 with the G allele of marker 99-359-308 and SEQ DD No. 3912 which is the G allele of marker 99-355-219), that was present at statistically significant different frequencies in the Alzheimer's disease case and control populations (D=12% ; p value = 2.05 E-06). Moreover, a significant difference was already observed for a three-marker haplotype included in the above mentioned "haplotype 8" ("haplotype 7", TGG, D=10% ; p value = 4.76 E-05). Haplotype 7 comprises SEQ DD No. 3910 with the T allele of marker 99-366-274, SEQ DD No. 3911 with the G allele of marker 99-359-308 and SEQ DD No. 3912 with the G allele of marker 99-355- 219). The haplotype association analysis thus clearly increased the statistical power of the individual marker association studies by more than four orders of magnitude when compared to single-marker analysis from p values > E-01 for the individual markers to p value < 2 E-06 for the four-marker "haplotype 8". See Table 3.
The significance of the values obtained for this haplotype association analysis was evaluated by the following computer simulation. The genotype data from the Alzheimer's disease cases and the unaffected controls were pooled and randomly allocated to two groups which contained the same number of individuals as the case/control groups used to produce the data summarized in Figure 7. A four-marker haplotype analysis (99-344-439 ; 99-355-
219 ; 99-359-308 ; and 99-366-274) was run on these artificial groups. This experiment was reiterated 100 times and the results are shown in Figure 8. No haplotype among those generated was found for which the p-value of the frequency difference between both populations was more significant than 1 E-05. In addition, only 4% of the generated haplotypes showed p-values lower than 1 E-04. Since both these p-value thresholds are less significant than the 2 E-06 p-value showed by "haplotype 8", this haplotype can be considered significantly associated with Alzheimer's disease.
In a second step, marker 99-365-344 was included in the haplotype analyzes. The frequency differences between the affected and non affected populations was calculated for all two-, three-, four- or five-marker haplotypes involving markers: 99-344-439; 99-355-219; 99- 359-308; 99-366-274; and 99-365-344. The most significant p-values obtained in each category of haplotype (involving two, three, four or five markers) were examined depending on which markers were involved or not within the haplotype. This showed that all haplotypes which included marker 99-365-344 showed a significant association with Alzheimer's disease (p-values in the range of E-04 to E-l 1). An additional way of evaluating the significance of the values obtained in the haplotype association analysis was to perform a similar Alzheimer's disease case-control study on biallelic markers generated from BACs containing inserts conesponding to genomic regions derived from chromosomes 13 or 21 and not known to be involved in Alzheimer's disease. Performing similar haplotype and individual association analyzes as those described above and in Example 10 did not generate any significant association results (all p-values for haplotype analyzes were less significant than E-03; all p-values for single marker association studies were less significant than E-02).
Example 8 Genotyping of biallelic markers using microsequencing procedures Several microsequencing protocols conducted in liquid phase are well known to those skilled in the art. A first possible detection analysis allowing the allele characterization of the microsequencing reaction products relies on detecting fluorescent ddNTP- extended microsequencing primers after gel electrophoresis. A first alternative to this approach consists in performing a liquid phase microsequencing reaction, the analysis of which may be carried out in solid phase.
For example, the microsequencing reaction may be performed using 5 '-biotinylated oligonucleotide primers and fluorescein-dideoxynucleotides. The biotinylated oligonucleotide is annealed to the target nucleic acid sequence immediately adjacent to the polymoφhic nucleotide position of interest. It is then specifically extended at its 3 '-end following a PCR cycle, wherein the labeled dideoxynucleotide analog complementary to the polymoφhic base is incoφorated. The biotinylated primer is then captured on a microtiter plate coated with streptavidin. The analysis is thus entirely carried out in a microtiter plate format. The incoφorated ddNTP is detected by a fluorescein antibody - alkaline phosphatase conjugate.
In practice this microsequencing analysis is performed as follows. 20 μl of the microsequencing reaction is added to 80 μl of capture buffer (SSC 2X, 2.5% PEG 8000, 0.25 M Tris pH7.5, 1.8%) BSA, 0.05% Tween 20) and incubated for 20 minutes on a microtiter plate coated with streptavidin (Boehringer). The plate is rinsed once with washing buffer (0.1 M Tris pH 7.5, 0.1 M NaCl, 0.1% Tween 20). 100 μl of anti-fluorescein antibody conjugated with phosphatase alkaline, diluted 1/5000 in washing buffer containing 1.8% BSA is added to the microtiter plate. The antibody is incubated on the microtiter plate for 20 minutes. After washing the microtiter plate four times, 100 μl of 4-methylumbelliferyl phosphate (Sigma) diluted to 0.4 mg/ml in 0.1 M diethanolamine pH 9.6, lOmM MgCl2 are added. The detection of the microsequencing reaction is carried out on a fluorimeter (Dynatech) after 20 minutes of incubation.
As another alternative, solid phase microsequencing reactions have been developed, for which either the oligonucleotide microsequencing primers or the PCR-amplified products derived from the DNA fragment of interest are immobilized. For example, immobilization can be canied out via an interaction between biotinylated DNA and streptavidin-coated microtitration wells or avidin-coated polystyrene particles.
As a further alternative, the PCR reaction generating the amplicons to be genotyped can be performed directly in solid phase conditions, following procedures such as those described in WO 96/13609.
In such solid phase microsequencing reactions, incoφorated ddNTPs can either be radiolabeled (see Syvanen, Clin. Chim. Ada. 226:225-236 (1994)) or linked to fluorescein (see Livak and Hainer, Hum. Metal 3:379-385 (1994)). The detection of radiolabeled ddNTPs can be achieved through scintillation-based techniques. The detection of fluorescein-linked ddNTPs can be based on the binding of antifluorescein antibody conjugated with alkaline phosphatase, followed by incubation with a chromogenic substrate (such as p-nitrophenyl phosphate).
Other possible reporter-detection couples for use in the above microsequencing procedures include:
-ddNTP linked to dinitrophenyl (DNP) and anti-DNP alkaline phosphatase conjugate (see Harju et al., Clin CΛew:39(l lPt l):2282-2287 (1993))
-biotinylated ddNTP and horseradish peroxidase-conjugated streptavidin with o- phenylenediamine as a substrate (see WO 92/15712). A diagnosis kit based on fluorescein-linked ddNTP with antifluorescein antibody conjugated with alkaline phosphatase has been commercialized under the name PRONTO by GamidaGen Ltd.
As yet another alternative microsequencing procedure, Nyren et al. (Anal. Biochem. 208:171-175 (1993)) have described a solid-phase DNA sequencing procedure that relies on the detection of DNA polymerase activity by an enzymatic luminometric inorganic pyrophosphate detection assay (ELDDA). In this procedure, the PCR-amplified products are biotinylated and immobilized on beads. The microsequencing primer is annealed and four aliquots of this mixture are separately incubated with DNA polymerase and one of the four different ddNTPs. After the reaction, the resulting fragments are washed and used as substrates in a primer extension reaction with all four dNTPs present. The progress of the DNA-directed polymerization reactions is monitored with the ELDDA. Incoφoration of a ddNTP in the first reaction prevents the formation of pyrophosphate during the subsequent dNTP reaction. In contrast, no ddNTP incoφoration in the first reaction gives extensive pyrophosphate release during the dNTP reaction and this leads to generation of light throughout the ELDDA reactions. From the ELDDA results, the identity of the first base after the primer is easily deduced.
It will be appreciated that several parameters of the above-described microsequencing procedures may be successfully modified by those skilled in the art without undue experimentation. In particular, high throughput improvements to these procedures may be elaborated, following principles such as those described further below.
Example 9 Sequence Analysis DNA sequences, such as BAC inserts, containing the region carrying the candidate gene associated with the detectable trait are sequenced and their sequence is analyzed using automated software which eliminates repeat sequences while retaining potential gene sequences. The potential gene sequences are compared to numerous databases to identify potential exons using a set of scoring algorithms such as trained Hidden Markov Models, statistical analysis models (including promoter prediction tools) and the GRALL neural network. Prefened databases for use in this analysis, the construction and use of which are further detailed in Example 17, include the following:
NRPU (Non-Redundant Protein-Unique) database: NRPU is a non-redundant merge of the publicly available NBRF/PIR, Genpept, and SwissProt databases. Homologies found with NRPU allow the identification of regions potentially coding for already known proteins or related to known proteins (translated exons). NREST fNon-Redundant EST database): NREST is a merge of the EST subsection of the publicly available GenBank database. Homologies found with NREST allow the location of potentially transcribed regions (translated or non-translated exons).
NRN (Non-Redundant Nucleic acid database): NRN is a merge of GenBank, EMBL and their daily updates.
Any sequence giving a positive hit with NRPU, NREST or an "excellent" score using GRAIL or/and other scoring algorithms is considered a potential functional region, and is then considered a candidate for genomic analysis.
While this first screening allows the detection of the "strongest" exons, a semi- automatic scan is further applied to the remaining sequences in the context of the sequence assembly. That is, the sequences neighboring a 5' site or an exon are submitted to another round of bioinformatics analysis with modified parameters. In this way, new exon candidates are generated for genomic analysis.
Using the above procedures, genes associated with detectable traits may be identified.
Example 10 YAC Contig Construction in the Candidate Genomic Region Substantial amounts of LOH data supported the hypothesis that genes associated with distinct cancer types are located within a particular region of the human genome. More specifically, this region was likely to harbor a gene associated with prostate cancer.
Association studies were performed as described below in order to identify this prostate cancer gene. First, a YAC contig which contains the candidate genomic region was constructed as follows. The CEPH-Genethon YAC map for the entire human genome (Chumakov et al. (1995), supra) was used for detailed contig building in the genomic region containing genetic markers known to map in the candidate genomic region. Screening data available for several publicly available genetic markers were used to select a set of CEPH YACs localized within the candidate region. This set of YACs was tested by PCR with the above mentioned genetic markers as well as with other publicly available markers supposedly located within the candidate region. As a result of these studies, a YAC STS contig map was generated around genetic markers known to map in this genomic region. Two CEPH YACs were found to constitute a minimal tiling path in this region, with an estimated size of ca. 2 Megabases.
During this mapping effort, several publicly known STS markers were precisely located within the contig. Example 11 below describes the identification of sets of biallelic markers within the candidate genomic region. Example 11 BAC contig construction and Biallelic Markers isolation within the candidate chromosomal region. Next, a BAC contig covering the candidate genomic region was constructed as follows. BAC libraries were obtained as described in Woo et al., Nucleic Acids Res. 22:4922-
4931 (1994). Briefly, the two whole human genome BamHI and HindID libraries already described in related WIPO application No. PCT/FB98/00193 were constructed using the pBeloBACl 1 vector (Kim et al. (1996), supra).
The BAC libraries were then screened with all of the above mentioned STSs, following the procedure described in Example 1 above.
The ordered BACs selected by STS screening and verified by FISH, were assembled into contigs and new markers were generated by partial sequencing of insert ends from some of them. These markers were used to fill the gaps in the contig of BAC clones covering the candidate chromosomal region having an estimated size of 2 megabases. Figure 9 illustrates a minimal anay of overlapping clones which was chosen for further studies, and the positions of the publicly known STS markers along said contig.
Selected BAC clones from the contig were subcloned and sequenced, essentially following the procedures described in related WEPO application No. PCT/TB98/00193. Biallelic markers lying along the contig were identified following the processes described in related WIPO application No. PCT/TB98/00193.
Figure 9 shows the locations of the biallelic markers along the BAC contig. This first set of markers conesponds to a medium density map of the candidate locus, with an intermarker distance averaging 50kb-150kb.
A second set of biallelic markers was then generated as described above in order to provide a very high-density map of the region identified using the first set of markers which can be used to conduct association studies, as explained below. This very high density map has markers spaced on average every 2-50kb.
The biallelic markers were then used in association studies. DNA samples were obtained from individuals suffering from prostate cancer and unaffected individuals as described in Example 12.
Example 12 Collection of DNA Samples from Affected and Non-affected Individuals Prostate cancer patients were recruited according to clinical inclusion criteria based on pathological or radical prostatectomy records. Control cases included in this study were both ethnically- and age-matched to the affected cases; they were checked for both the absence of all clinical and biological criteria defining the presence or the risk of prostate cancer, and for the absence of related familial prostate cancer cases. Both affected and control individuals were all unrelated.
The two following groups of independent individuals were used in the association studies. The first group, comprising individuals suffering from prostate cancer, contained 185 individuals. Of these 185 cases of prostate cancer, 47 cases were sporadic and 138 cases were familial. The control group contained 104 non-diseased individuals.
Haplotype analysis was conducted using additional diseased (total samples: 281) and control samples (total samples: 130), from individuals recruited according to similar criteria. DNA was extracted from peripheral venous blood of all individuals as described in related WrPO application No. PCT/D398/00193.
The frequencies of the biallelic markers in each population were determined as described in Example 13.
Example 13 Genotyping Affected and Control Individuals Genotyping was performed using the following microsequencing procedure.
Amplification was performed on each DNA sample using primers designed as previously explained. The pairs of primers of SEQ DD Nos.: 7849 to 7860 and 11780 to 11791 were used to generate amplicons harboring the biallelic markers of SEQ DD Nos: 3915 to 3926 or the sequences complementary thereto (markers 99-123-381, 4-26-29, 4-14-240, 4-77-151, 99-217- 277, 4-67-40, 99-213-164, 99-221-377, 99-135-196, 99-1482-32, 4-73-134, and 4-65-324) using the protocols described in related WD>O application No. PCT/TB98/00193.
Microsequencing primers were designed for each of the biallelic markers, as previously described. After purification of the amplification products, the microsequencing reaction mixture was prepared by adding, in a 20μl final volume: 10 pmol microsequencing oligonucleotide, 1 U Thermosequenase (Amersham E79000G), 1.25 μl Thermosequenase buffer (260 mM Tris HCI pH 9.5, 65 mM MgCl2), and the two appropriate fluorescent ddNTPs (Perkin Elmer, Dye Terminator Set 401095) complementary to the nucleotides at the polymoφhic site of each biallelic marker tested, following the manufacturer's recommendations. After 4 minutes at 94°C, 20 PCR cycles of 15 sec at 55°C, 5 sec at 72°C, and 10 sec at 94°C were canied out in a Tetrad PTC-225 thermocycler (MJ Research). The unincoφorated dye terminators were then removed by ethanol precipitation. Samples were finally resuspended in formamide-EDTA loading buffer and heated for 2 min at 95°C before being loaded on a polyacrylamide sequencing gel. The data were collected by an ABI PRISM 377 DNA sequencer and processed using the GENESCAN software (Perkin Elmer). Following gel analysis, data were automatically processed with software that allows the determination of the alleles of biallelic markers present in each amplified fragment. The software evaluates such factors as whether the intensities of the signals resulting from the above microsequencing procedures are weak, normal, or saturated, or whether the signals are ambiguous. In addition, the software identifies significant peaks (according to shape and height criteria). Among the significant peaks, peaks conesponding to the targeted site are identified based on their position. When two significant peaks are detected for the same position, each sample is categorized as homozygous or heterozygous based on the height ratio.
Association analyzes were then performed using the biallelic markers as described below. Example 14
Association Analysis Association studies were run in two successive steps. In a first step, a rough localization of the candidate gene was achieved by determining the frequencies of the biallelic markers of Figure 9 in the affected and unaffected populations. The results of this rough localization are shown in Figure 10. This analysis indicated that a gene responsible for prostate cancer was located near the biallelic marker designated 4-67. In a second phase of the analysis, the position of the gene responsible for prostate cancer was further refined using the very high density set of markers including the markers of SEQ DD Nos: 3915 to 3926 or the sequences complementary thereto (markers 99-123-381, 4-26-29, 4- 14-240, 4-77-151, 99-217-277, 4-67-40, 99-213-164, 99-221-377, 99-135-196, 99-1482-32, 4-
73-134, and 4-65-324) .
As shown in Figure 11, the second phase of the analysis confirmed that the gene responsible for prostate cancer was near the biallelic marker designated 4-67-40, most probably within a ca. 150kb region comprising the marker. A haplotype analysis was also performed as described in Example 15.
Example 15 Haplotype analysis The allelic frequencies of each of the alleles of biallelic markers 99-123-381, 4-26-29, 4-14-240, 4-77-151, 99-217-277, 4-67-40, 99-213-164, 99-221-377, and 99-135-196 were determined in the affected and unaffected populations. Table 4 lists the internal identification numbers of the markers used in the haplotype analysis (SEQ DD Nos: 3915-3923), the alleles of each marker, the most frequent allele in both unaffected individuals and individuals suffering from prostate cancer, the least frequent allele in both unaffected individuals and individuals suffering from prostate cancer, and the frequencies of the least frequent alleles in each population.
Table 4 Frequency of least frequent allele **
Markers Polymorphic base * Cases Controls
99-123-381 C/T 0.35 0.3
4-26-29 A/G 0.39 0.45
4-14-240 C/T 0.35 0.41
4-77-151 C/G 0.33 0.24
99-217-277 C/T 0.31 0.23
4-67-40 C/T 0.26 0.16
99-213-164 T/C 0.45 0.38
99-221-377 C/A 0.43 0.43
99-135-196 A G 0.25 0.3
*most frequent allele/least frequent allele **standard deviations - 0.023 to 0.031 for controls
- 0.018 to 0.021 for cases
Among all the theoretical potential different haplotypes based on 2 to 9 markers, 11 haplotypes showing a strong association with prostate cancer were selected. The results of these haplotype analyzes are shown in Figure 12.
Figures 11 and 12 aggregate association analysis results with sequencing results - generated following the procedures further described in Example 16, which permitted the physical order and the distance between markers to be estimated.
The significance of the values obtained in Figure 12 are underscored by the following results of computer simulations. For the computer simulations, the data from the affected individuals and the unaffected controls were pooled and randomly allocated to two groups which contained the same number of individuals as the affected and unaffected groups used to compile the data summarized in Figure 12. A haplotype analysis was run on these artificial groups for the six markers included in haplotype 5 of Figure 12. This experiment was reiterated 100 times and the results are shown in Figure 13. Among 100 iterations, only 5% of the obtained haplotypes are present with a p-value less significant than E-04 as compared to the p-value of 9E-07 for haplotype 5 of Figure 12. Furthermore, for haplotype 5 of Figure 12, only 6% of the obtained haplotypes have a significance level below 5E-03, while none of them show a significance level below 5E-03.
Thus, using the data of Figure 13 and evaluating the associations for single marker alleles or for haplotypes will permit estimation of the risk a conesponding carrier has to develop prostate cancer. It will be appreciated that significance thresholds of relative risks will be more finely assessed according to the population tested. Diagnostic techniques for determining an individual's risk of developing prostate cancer may be implemented as described below for the markers in the maps of the present invention, including the markers of SEQ DD Nos: 3915 to 3923 (markers 99-123-381, 4-26-29, 4-14-240, 4-77-151, 99-217-277, 4-67-40, 99-213-164, 99-221-377, and 99-135-196). The above haplotype analysis indicated that 17 lkb of genomic DNA between biallelic markers 4-14-240 and 99-221-377 totally or partially contains a gene responsible for prostate cancer. Therefore, the protein coding sequences lying within this region were characterized to locate the gene associated with prostate cancer. This analysis, described in further detail below, revealed a single protein coding sequence in the 171 kb genomic region, which was designated as the PG1 gene.
Example 16 Identification of the Genomic Sequence in the Candidate Region Template DNA for sequencing the PG1 gene was obtained as follows. BACs E and F from Fig. 9 were subcloned as previously described. Plasmid inserts were first amplified by PCR on PE 9600 thermocyclers (Perkin-Elmer), using appropriate primers, AmpliTaqGold (Perkin-
Elmer), dNTPs (Boehringer), buffer and cycling conditions as recommended by the Perkin- Elmer Coφoration.
PCR products were then sequenced using automatic ABI Prism 377 sequencers (Perkin Elmer, Applied Biosystems Division, Foster City, CA). Sequencing reactions were performed using PE 9600 thermocyclers (Perkin Elmer) with standard dye-primer chemistry and
ThermoSequenase (Amersham Life Science). The primers were labeled with the JOE, FAM, ROX and TAMRA dyes. The dNTPs and ddNTPs used in the sequencing reactions were purchased from Boehringer. Sequencing buffer, reagent concentrations and cycling conditions were as recommended by Amersham. Following the sequencing reaction, the samples were precipitated with EtOH, resuspended in formamide loading buffer, and loaded on a standard 4% acrylamide gel. Electrophoresis was performed for 2.5 hours at 3000V on an ABI 377 sequencer, and the sequence data were collected and analyzed using the ABI Prism DNA Sequencing Analysis Software, version 2.1.2. The sequence data obtained as described above were transfened to a proprietary database, where quality control and validation steps were performed. A proprietary base-caller flagged suspect peaks, taking into account the shape of the peaks, the inter-peak resolution, and the noise level. The proprietary base-caller also performed an automatic trimming. Any stretch of 25 or fewer bases having more than 4 suspect peaks was considered unreliable and was discarded. The sequence fragments from BAC subclones isolated as described above were assembled using Gap4 software from R. Staden (Bonfield et al. 1995). This software allows the reconstruction of a single sequence from sequence fragments. The sequence deduced from the alignment of different fragments is called the consensus sequence. Directed sequencing techniques (primer walking) were used to complete sequences and link contigs.
Potential functional sequences were then identified as described in Example 17.
Example 17 Identification of Functional Sequences Potential exons in BAC-derived human genomic sequences were located by homology searches on protein, nucleic acid and EST (Expressed Sequence Tags) public databases. Main public databases were locally reconstructed as mentioned in Example 9. The protein database, NRPU (Non-redundant Protein Unique) is formed by a non-redundant fusion of the Genpept (Benson et al., Nucleic Acids Res. 24:1-5 (1996)), Swissprot (Bairoch, A. and Apweiler, R., Nucleic Acids Res. 24:21-25 (1996)) and PIR/NBRF (George et al., Nucleic Acids Res. 24:17-20 (1996)) databases. Redundant data were eliminated by using the NRDB software (Benson et al.
(1996), supra) and internal repeats were masked with the XNU software (Benson et al., supra). Homologies found using the NRPU database allowed the identification of sequences conesponding to potential coding exons related to known proteins.
The EST local database is composed by the gbest section (1-9) of GenBank (Benson et al. (1996), supra), and thus contains all publicly available transcript fragments. Homologies found with this database allowed the localization of potentially transcribed regions.
The local nucleic acid database contained all sections of GenBank and EMBL (Rodriguez-Tome et al., Nucleic Acids Res. 24:6-12 (1996)) except the EST sections. Redundant data were eliminated as previously described. Similarity searches in protein or nucleic acid databases were performed using the
BLAST software (Altschul et al., J. Mol. Biol. 215:403-410 (1990)). Alignments were refined using the Fasta software, and multiple alignments used Clustal W. Homology thresholds were adjusted for each analysis based on the length and the complexity of the tested region, as well as on the size of the reference database. Potential exon sequences identified as above were used as probes to screen cDNA libraries. Extremities of positive clones were sequenced and the sequence stretches were positioned on the genomic sequence determined above. Primers were then designed using the results from these alignments in order to enable the cloning of cDNAs derived from the gene associated with prostate cancer that was identified using the above procedures. The obtained cDNA molecules were then sequenced and results of Northern blot analysis of prostate mRNAs supported the existence of a major cDNA having a 5-6kb length. The structure of the gene associated with prostate cancer was evaluated as described in Example 18.
Example 18 Analysis of Gene Structure The intron/exon structure of the gene was finally completely deduced by aligning the mRNA sequence from the cDNA obtained as described above and the genomic DNA sequence obtained as described above. This alignment permitted the determination of the positions of the introns and exons, the positions of the start and end nucleotides defining each of the at least 8 exons, the locations and phases of the 5' and 3' splice sites, the position of the stop codon, and the position of the polyadenylation site to be determined in the genomic sequence.
This analysis also yielded the positions of the coding region in the mRNA, and the locations of the polyadenylation signal and polyA stretch in the mRNA.
The gene identified as described above comprises at least 8 exons and spans more than 52kb. A G/C rich putative promoter region was identified upstream of the coding sequence. A CCAAT in the putative promoter was also identified. The promoter region was identified as described in Prestridge, D.S., Predicting Pol II Promoter Sequences Using Transcription Factor Binding Sites, J. Mol. Biol. 249:923-932 (1995).
Additional analysis using conventional techniques, such as a 5 'RACE reaction using the Marathon-Ready human prostate cDNA kit from Clontech (Catalog. No. PT1156-1), may be performed to confirm that the 5' of the cDNA obtained above is the authentic 5' end in the mRNA.
Alternatively, the 5'sequence of the transcript can be determined by conducting a PCR amplification with a series of primers extending from the 5'end of the identified coding region. Example 19
Detection of biallelic markers in the candidate gene: DNA extraction Donors were unrelated and healthy. They presented a sufficient diversity for being representative of a French heterogeneous population. The DNA from 100 individuals was extracted and tested for the detection of the biallelic markers. 30 ml of peripheral venous blood were taken from each donor in the presence of
EDTA. Cells (pellet) were collected after centrifugation for 10 minutes at 2000 φm. Red cells were lysed by a lysis solution (50 ml final volume: 10 mM Tris pH7.6; 5 mM MgC12; 10 mM NaCl). The solution was centrifuged (10 minutes, 2000 φm) as many times as necessary to eliminate the residual red cells present in the supernatant, after resuspension of the pellet in the lysis solution. The pellet of white cells was lysed overnight at 42°C with 3.7 ml of lysis solution composed of:
- 3 ml TE 10-2 (Tris-HCl 10 mM, EDTA 2 mM) / NaCl 0.4 M - 200 μl SDS 10% - 500 μl K-proteinase (2 mg K-proteinase in TE 10-2 / NaCl 0.4 M).
For the extraction of proteins, 1 ml saturated NaCl (6M) (1/3.5 v/v) was added. After vigorous agitation, the solution was centrifuged for 20 minutes at 10000 φm. For the precipitation of DNA, 2 to 3 volumes of 100% ethanol were added to the previous supernatant, and the solution was centrifuged for 30 minutes at 2000 φm. The DNA solution was rinsed three times with 70% ethanol to eliminate salts, and centrifuged for 20 minutes at
2000 φm. The pellet was dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water. The DNA concentration was evaluated by measuring the OD at 260 nm (1 unit OD = 50 μg/ml DNA).
To determine the presence of proteins in the DNA solution, the OD 260 / OD 280 ratio was determined. Only DNA preparations having a OD 260 / OD 280 ratio between 1.8 and 2 were used in the subsequent examples described below.
The pool was constituted by mixing equivalent quantities of DNA from each individual.
Example 20 Detection of the biallelic markers: amplification of genomic DNA by PCR
The amplification of specific genomic sequences of the DNA samples of Example 19 was carried out on the pool of DNA obtained previously using the amplification primers of SEQ DD Nos: 7861 to 7865 and 11792 to 11796. In addition, 50 individual samples were similarly amplified.
PCR assays were performed using the following protocol: Final volume 25 μl
DNA 2 ng/μl
MgC12 2 mM dNTP (each) 200 μM primer (each) 2.9 ng/μl
Ampli Taq Gold DNA polymerase 0.05 unit/μl
PCR buffer (lOx = 0.1 M TrisHCl pH8.3 0.5M KC1) lx
Pairs of first primers were designed to amplify the promoter region, exons, and 3' end of the candidate asthma-associated gene using the sequence information of the candidate gene and the OSP software (Hillier & Green, 1991). These first primers were about 20 nucleotides in length and contained a common oligonucleotide tail upstream of the specific bases targeted for amplification which was useful for sequencing. The synthesis of these primers was performed following the phosphoramidite method, on a GENSET UFPS 24.1 synthesizer.
DNA amplification was performed on a Genius II thermocycler. After heating at 94°C for 10 min, 40 cycles were performed. Each cycle comprised: 30 sec at 94 °C, 55°C for 1 min, and 30 sec at 72°C. For final elongation, 7 min at 72°C ended the amplification. The quantities of the amplification products obtained were determined on 96-well microtiter plates, using a fluorometer and Picogreen as intercalant agent (Molecular Probes).
Example 21
Detection of the biallelic markers
Sequencing of amplified genomic DNA and identification of polymoφhisms
The sequencing of the amplified DNA obtained in Example 20 was canied out on ABI 377 sequencers. The sequences of the amplification products were determined using automated dideoxy terminator sequencing reactions with a dye terminator cycle sequencing protocol. The products of the sequencing reactions were run on sequencing gels and the sequences were analyzed as formerly described.
The sequence data were further evaluated using the above mentioned polymoφhism analysis software designed to detect the presence of biallelic markers among the pooled amplified fragments. The polymoφhism search was based on the presence of superimposed peaks in the electrophoresis pattern resulting from different bases occurring at the same position as described previously.
Six fragments of amplification were analyzed. In these segments, 8 biallelic markers were detected (SEQ DD Nos: 3927 to 3934). The localization of the biallelic markers, the polymoφhic bases of each allele, and the frequencies of the most frequent alleles was as shown in Table 5.
Table 5
Amplicon Marker Origin of Localization in Polymorphism Frequency Name DNA gene
1 10-204-326 Ind. Promoter A G 96.2 (G)
2 10-32-357 Pool Intron 1 A/C 67.7 (C)
3 10-33-175 Ind. Exon 2 C/T 97.3 (C)
3 10-33-234 Pool Intron 2 A/C 56.7 (C)
3 10-33-327 Ind. Intron 2 C/T 75.3 (T) 5 10-35-358 Pool Intron 4 C/G 67.9 (G)
5 10-35-390 Ind. Intron 4 C/t 82 (C)
6 10-36-164 Ind. Exon 5 A/G 99.5 (G)
Allelic frequencies were determined in a population of random blood donors from French
Caucasian origin. Their wide range is due to the fact that, besides screening a pool of 100 individuals to generate biallelic markers as described above, polymoφhism searches were also conducted in an individual testing format for 50 samples. This strategy was chosen here to provide a potential shortcut towards the identification of putative causal mutations in the association studies using them. As the 10-36-164 biallelic marker (SEQ DD No: 3933) was found in only one individual, this marker was not considered in the association studies.
The fourth fragment of amplification carrying exon 3 (not shown in the Table) was not polymoφhic in the tested samples (1 pool + 50 individuals).
Example 22 Validation of the polvmoφhisms through microsequencing
The biallelic markers identified in Example 21 were further confirmed and their respective frequencies were determined through microsequencing. Microsequencing was carried out for each individual DNA sample described in Example 19.
Amplification from genomic DNA of individuals was performed by PCR as described above for the detection of the biallelic markers with the same set of PCR primers described above.
The prefened primers used in microsequencing had about 19 nucleotides in length and hybridized just upstream of the considered polymoφhic base.
Five primers hybridized with the non-coding strand of the gene. For the biallelic markers 10- 204-326, 10-35-358 and 10-36-164, primers hybridized with the coding strand of the gene.
The microsequencing reaction was performed as described in Example 13.
Example 23 Association study between asthma and the biallelic markers of the candidate gene Collection of DNA samples from affected and non-affected individuals The asthmatic population used to perform association studies in order to establish whether the candidate gene was an asthma-causing gene consisted of 298 individuals. More than 90 % of these 298 asthmatic individuals had a Caucasian ethnic background.
The control population consisted of 373 unaffected individuals, among which 279 French (at least 70 % were of Caucasian origin) and 94 American (at least 90 % were of Caucasian origin). DNA samples were obtained from asthmatic and non-asthmatic individuals as described above.
Example 24 Association study between asthma and the biallelic markers of the candidate gene Genotyping of affected and control individuals The general strategy to perform the association studies was to individually scan the DNA samples from all individuals in each of the populations described above in order to establish the allele frequencies of the above described biallelic markers in each of these populations.
Allelic frequencies of the above-described biallelic markers in each population were determined by performing microsequencing reactions on amplified fragments obtained by genomic PCR performed on the DNA samples from each individual. Genomic PCR and microsequencing were performed as detailed above in Examples 20 and 22 using the described amplification and microsequencing primers.
Example 25 Association study between asthma and the biallelic markers of the candidate gene Table 6 shows the results of the association study between five biallelic markers in the candidate gene and asthma.
Table 6
Allelic frequencies (%)
Markers Asthmatics Controls Frequency diff. P value 298 individuals 373 individuals
10-32-357 A 38.6 A 29.8 8.8 7.34X10"4
10-33-234 A 49 A 44.3 4.7 8.86xl02
10-33-327 T 78.5 T 74.6 3.9 l.OxlO"1
10-35-358 G 72.3 G 66.9 5.4 3.59xl0"2
10-35-390 T 30.4 T 20.3 10.1 2.33x10 s
As shown in Table 6, markers 10-32-357 and 10-35-390 presented a strong association with asthma, this association being highly significant ( pvalue = 7.34x10-4 for marker 10-32-357 and 2.33x10-5 for marker 10-35-390).
Three markers showed moderate association when tested independently, namely 10- 33-234, 10-33-327,10- 35-358.
It is worth mentioning that allelic frequencies for each of the biallelic markers of Table 7 were separately measured within the French control population (279 individuals) and the American control population (94 individuals). The differences in allele frequencies between the two populations were between 1% and 7%, with p-values above 10"'. These data confirmed that the combined French American control population (373 individuals) was homogeneous enough to be used as a control population for the present association study.
Example 26 Association studies: Haplotype frequency analysis
As already shown, one way of increasing the statistical power of individual markers, is by performing haplotype association analysis. A haplotype analysis for association of markers in the candidate gene and asthma was performed by estimating the frequencies of all possible haplotypes for biallelic markers 10-32-357, 10-33-234, 10-33-327, 10-35-358 and 10- 35-390 in the asthmatic and control populations described in Example 25 (Table 6), and comparing these frequencies by means of a chi square statistical test (one degree of freedom). Haplotype estimations were performed by applying the Expectation-Maximization (EM) algorithm (Excoffier L & Slatkin M, 1995, Mol.Biol.Evol. 12 :921-927), using the EM- HAPLO program (Hawley ME, Pakstis AJ & Kidd KK, 1994, Am .Phys.Anthropol. 18:104).
The results of such haplotype analysis are shown in Table 7.
Table 7
Haplotype frequencies
Markers 10-32-357 10-33-234 10-33-327 10-35-358 10-35-390 Asthm. Controls Odds P value ratio
Frequency 8.8 4.7 3.9 5.4 10.1 diff.
P value 7.34x10^ 8.86xl0-2 1.0x10-' 3.59xl0-2 2.33xl0"5
Haplotype 1 A T 0.2 0.11 2.02 8.47X10-6
Haplotype 2 A T G 0.27 0.18 1.68 2.81X10"4
Haplotype 3 A A T G T 0.18 0.09 2.22 3.95x10"f
A two-marker haplotype covering markers 10-32-357 and 10-35-390 (haplotype 1, AT alleles respectively) presented a p value of 8.47x10-6, an odds ratio of 2.02 and haplotype frequencies of 0.2 for asthmatic and 0.11 for control populations respectively.
A three-marker haplotype covering markers 10-33-234, 10-33-327 and 10-35-358 (haplotype 2, ATG alleles respectively) presented a p value of 2.81x10-4, an odds ratio of 1.68 and haplotype frequencies of 0.27 for asthmatic and 0.18 for control populations respectively.
A five-marker haplotype covering markers 10-32-357, 10-33-234, 10-33-327, 10-35- 358 and 10-35-390 (haplotype 3, AATGT alleles respectively) presented a p value of 3.95x10- 5, an odds ratio of 2.22 and haplotype frequencies of 0.18 for asthmatic and 0.09 for control populations respectively. Haplotype association analysis thus increased the statistical power of the individual marker association studies when compared to single-marker analysis (from p values between 10"' and 2X10'5 for the individual markers to p values between 3X10" and 8X10'6 for the three-marker haplotype, haplotype 2).
The significance of the values obtained for the haplotype association analysis was evaluated by the following computer simulation test. The genotype data from the asthmatic and control individuals were pooled and randomly allocated to two groups which contained the same number of individuals as the trait-positive and trait-negative groups used to produce the data summarized in Table 7. A haplotype analysis was then run on these artificial groups for the three haplotypes presented in Table 6. This experiment was reiterated 1000 times and the results are shown in Table 8.
Table 8
Permutation Test
Haplotype Chi-Square Chi-Square Maximal P value Average Chi-Square
Haplotype 1 (A-T) 19.70 1.2 11.6 l.OxlO"3
Haplotype 2 (-ATG-) 13.49 1.2 10.5 l.OxlO 3
Haplotype 3 (AATGT) 16.66 1.2 9.3 l.OxlO"3
The results in Table 8 show that among 1000 iterations only l%o of the obtained haplotypes has a pvalue comparable to the one obtained in Table 4.
These results clearly validate the statistical significance of the haplotypes obtained (haplotypes 1, 2 and 3, Table 7).
Example 27 Extraction of DNA
30 ml of blood are taken from the individuals in the presence of EDTA. Cells (pellet) are collected after centrifugation for 10 minutes at 2000 φm. Red cells are lysed by a lysis solution (50 ml final volume : 10 mM Tris pH7.6; 5 mM MgCl2; 10 mM NaCl). The solution is centrifuged (10 minutes, 2000 φm) as many times as necessary to eliminate the residual red cells present in the supernatant, after resuspension of the pellet in the lysis solution.
The pellet of white cells is lysed overnight at 42°C with 3.7 ml of lysis solution composed of:
- 3 ml TE 10-2 (Tris-HCl 10 mM, EDTA 2 mM) / NaCl 0.4 M - 200 μl SDS 10% - 500 μl K-proteinase (2 mg K-proteinase in TE 10-2 / NaCl 0.4 M). For the extraction of proteins, 1 ml saturated NaCl (6M) (1/3.5 v/v) is added. After vigorous agitation, the solution is centrifuged for 20 minutes at 10000 φm. For the precipitation of DNA, 2 to 3 volumes of 100% ethanol are added to the previous supernatant, and the solution is centrifuged for 30 minutes at 2000 φm. The DNA solution is rinsed three times with 70% ethanol to eliminate salts, and centrifuged for 20 minutes at 2000 φm. The pellet is dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water. The DNA concentration is evaluated by measuring the OD at 260 nm (1 unit OD = 50 μg/ml DNA).
To evaluate the presence of proteins in the DNA solution, the OD 260 / OD 280 ratio is determined. Only DNA preparations having a OD 260 / OD 280 ratio between 1.8 and 2 are used in the subsequent steps described below.
Once genomic DNA from every individual in the given population has been extracted, it is prefened that a fraction of each DNA sample is separated, after which a pool of DNA is constituted by assembling equivalent DNA amounts of the separated fractions into a single one.
TABLE 1
Figure imgf000130_0001
Figure imgf000131_0001
Figure imgf000132_0001
Figure imgf000133_0001
Figure imgf000134_0001
Figure imgf000135_0001
Figure imgf000136_0001
Figure imgf000137_0001
Figure imgf000138_0001
Figure imgf000139_0001
Figure imgf000140_0001
Figure imgf000141_0001
Figure imgf000142_0001
Figure imgf000143_0001
Figure imgf000144_0001
Figure imgf000145_0001
Figure imgf000146_0001
Figure imgf000147_0001
Figure imgf000148_0001
Figure imgf000149_0001
Figure imgf000150_0001
Figure imgf000151_0001
Figure imgf000152_0001
Figure imgf000153_0001
Figure imgf000154_0001
Figure imgf000155_0001
Figure imgf000156_0001
Figure imgf000157_0001
Figure imgf000158_0001
Figure imgf000160_0001
Figure imgf000161_0001
Figure imgf000162_0001
Figure imgf000163_0001
Figure imgf000164_0001
Figure imgf000165_0001
Figure imgf000166_0001
Figure imgf000167_0001
Figure imgf000168_0001
Figure imgf000169_0001
Figure imgf000170_0001
Figure imgf000171_0001
Figure imgf000172_0001
Figure imgf000173_0001
Figure imgf000174_0001
Figure imgf000175_0001
Figure imgf000176_0001
Figure imgf000177_0001
Figure imgf000178_0001
Figure imgf000179_0001
Figure imgf000180_0001
Figure imgf000181_0001
Figure imgf000182_0001
Figure imgf000183_0001
Figure imgf000184_0001
Figure imgf000185_0001
Figure imgf000186_0001
Figure imgf000187_0001
Figure imgf000188_0001
Figure imgf000189_0001
Figure imgf000190_0001
Figure imgf000191_0001
Figure imgf000192_0001
Figure imgf000193_0001
Figure imgf000194_0001
Figure imgf000195_0001
Figure imgf000196_0001
Figure imgf000197_0001
Figure imgf000198_0001
Figure imgf000199_0001
Figure imgf000200_0001
Sequence listing free text
The following free text appears in the accompanying Sequence Listing: upstream downstream amplification primer polymoφhic base in for
SEQ complement sequence

Claims

WHAT IS CLAIMED IS:
1. An isolated or purified polynucleotide comprising a contiguous span of at least 12 nucleotides of a sequence selected from the group consisting of SEQ ED No. 1 to 2260, and the complements thereof.
2. A polynucleotide according to claim 1, wherein said span comprises a map-related biallelic marker.
3. An isolated or purified polynucleotide consisting essentially of a contiguous span of at least 8 to 43 nucleotides of a sequence selected from the group consisting of SEQ ED No. 2261 to 3734, 3735 to 3908, and the complements thereof.
4. A polynucleotide according to claim 3, wherein said span comprises a map-related biallelic marker.
5. An isolated or purified polynucleotide comprising a contiguous span of at least 12 nucleotides of a sequence selected from the group consisting of SEQ ED No. 2261 to 3734, and the complements thereof, wherein said span comprises a map-related biallelic marker and the 1 st allele indicated in Table 1 is present at said map-related biallelic marker.
6. A polynucleotide according to any one of claims 2, 4, and 5, wherein said contiguous span is 18 to 35 nucleotides in length and said biallelic marker is within 4 nucleotides of the center of said polynucleotide.
7. A polynucleotide according to claim 6, wherein said polynucleotide consists essentially of said contiguous span and said contiguous span is 25 nucleotides in length and said biallelic marker is at the center of said polynucleotide.
8. An isolated or purified polynucleotide comprising a contiguous span of at least 12 nucleotides of a sequence selected from the group consisting of SEQ ED No. 3935 to 6194, 7866 to 10125, and the complements thereof.
9. An isolated or purified polynucleotide consisting essentially of a contiguous span of at least 8 to 43 nucleotides of a sequence selected from the group consisting of SEQ ED No.
6195 to 7668, 7669 to 7842, 10126 to 11599, 11600 to 11773, and the complements thereof.
10. A polynucleotide according to any one of claims 1, 3, 8, and 9, wherein the 3' end of said contiguous span is present at the 3' end of said polynucleotide.
11. A polynucleotide according to any one of claims 2, 3, and 5, wherein the 3' end of said contiguous span is located at the 3' end of said polynucleotide and said biallelic marker is present at the 3' end of said polynucleotide.
12. A polynucleotide according to either of claims 1 and 3, wherein the 3' end of said contiguous span is present at the 3' end of said polynucleotide and the 3' end of said polynucleotide is located within 10 nucleotides upsfream of a map-related biallelic marker in said sequence.
13. A polynucleotide according to claim 12, wherein the 3' end of said polynucleotide is located 1 nucleotide upstream of a map-related biallelic marker in said sequence.
14. A polynucleotide according to claim 13, wherein said contiguous span is 19 nucleotides in length and said polynucleotide consists of said contiguous span.
15. A polynucleotide according to any one of claims 1, 3, 5, 8, and 9 wherein said contiguous span comprises at least 21 contiguous nucleotides.
16. A polynucleotide according to any one of claims 1, 3, and 5, wherein said contiguous span comprises at least 30 contiguous nucleotides.
17. A polynucleotide according to any one of claims 1, 3, and 5, wherein said contiguous span comprises at least 43 contiguous nucleotides.
18. A polynucleotide for use in determining the identity of nucleotides at a map- related biallelic marker, wherein said determining is performed in a hybridization assay, sequencing assay, microsequencing assay, or an enzyme-based mismatch detection assay.
19. A polynucleotide for use in amplifying a segment of nucleotides comprising a map-related biallelic marker.
20. A polynucleotide according to either of claims 18 and 19, wherein said map- related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 3908, and the complements thereto.
21. A polynucleotide according to either of claims 18 and 19, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 2260, 2261 to 3734, and the complements thereto.
22. A polynucleotide according to any one of claims 1, 3, 5, 8, 9, 18, and 19 attached to a solid support.
23. An anay of polynucleotides comprising at least one polynucleotide according to claim 22.
24. An anay according to claim 23, wherein said anay is addressable.
25. A polynucleotide according to any one of claims 1, 3, 5, 7, 8, 9, 14, 18, and 19, further comprising a label.
26. A map of the human genome comprising an ordered anay of biallelic markers, wherein at least 1 of said biallelic markers is a map-related biallelic marker.
27. A map of according to claim 26, comprising all of the biallelic markers of SEQ ED Nos. 1 to 3908, and the complements thereto.
28. A method of genotyping comprising determining the identity of a nucleotide at a map-related biallelic marker in a biological sample.
29. A method according to claim 28, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 3908, and the complements thereto.
30. A method according to claim 28, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 2260, 2261 to 3734, and the complements thereto.
31. A method according to claim 28, wherein said biological sample is derived from a single subject.
32. A method according to claim 31, wherein the identity of the nucleotides at said biallelic marker is determined for both copies of said biallelic marker present in said subject's genome.
33. A method according claim 28, wherein said biological sample is derived from multiple subjects.
34. A method according to claim 28, further comprising amplifying a portion of said sequence comprising the biallelic marker prior to said determining step.
35. A method according to claim 34, wherein said amplifying is performed by PCR.
36. A method according to claim 28, wherein said determining is performed by a hybridization assay, a sequencing assay, a microsequencing assay, or an enzyme-based mismatch detection assay.
37. A method of determining the frequency in a population of an allele of a map- related biallelic marker, comprising: a) genotyping individuals from said population for said biallelic marker according to the method of claim 28; and b) determining the proportional representation of said biallelic marker in said population.
38. A method according to claim 37, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 3908, and the complements thereto.
39. A method according to claim 37, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 2260, 2261 to 3734, and the complements thereto.
40. A method according to claim 37, wherein said genotyping of step a) is performed on each individual of said population.
41. A method according to claim 37, wherein said genotyping is performed on a single biological sample derived from said population.
42. A method of detecting an association between an allele and a phenotype, comprising the steps of: a) determining the frequency of at least one map-related biallelic marker allele in a trait positive population according to the method of claim 37; b) determining the frequency of said map-related biallelic marker allele in a control population according to the method of claim 37; and c) determining whether a statistically significant association exists between said allele and said phenotype.
43. A method of estimating the frequency of a haplotype for a set of biallelic markers in a population, comprising: a) genotyping each individual in said population for at least one map-related biallelic marker according to claim 31 ; b) genotyping each individual in said population for a second biallelic marker by determining the identity of the nucleotides at said second biallelic marker for both copies of said second biallelic marker present in the genome; and c) applying a haplotype determination method to the identities of the nucleotides determined in steps a) and b) to obtain an estimate of said frequency.
44. A method according to claim 43, wherein said haplotype determination method is selected from the group consisting of asymmetric PCR amplification, double PCR amplification of specific alleles, the Clark method, or an expectation maximization algorithm.
45. A method according to claim 43, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 3908, and the complements thereto.
46. A method according to claim 43, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 2260, 2261 to 3734, and the complements thereto.
47. A method of detecting an association between a haplotype and a phenotype, comprising the steps of: a) estimating the frequency of at least one haplotype in a trait positive population according to the method of claim 43; b) estimating the frequency of said haplotype in a confrol population according to the method of claim 43; and c) determining whether a statistically significant association exists between said haplotype and said phenotype.
48. A method according to either claim 42 or 47, wherein said control population is a trait negative population.
49. A method according to either claim 42 or 47, wherein said case control population is a random population.
50. A method according to claim 42, wherein each of said genotyping of steps a) and b) is performed on a single pooled biological sample derived from each of said populations.
51. A method according to claim 42, wherein said genotyping of steps a) and b) is performed separately on biological samples derived from each individual in said populations.
52. A method according to either claim 42 or 47, wherein said phenotype is selected from the group consisting of disease, drug response, drug efficacy, treatment response, treatment efficacy, and drug toxicity.
53. A method according to claim 42, wherein the identity of the nucleotides at all of the biallelic markers of SEQ ED Nos. 1 to 3908 is determined in steps a) and b).
54. A computer readable medium having stored thereon the sequence of the polynucleotide according to any one of the claims selected from the group consisting of 1, 3,
5, 7, 8, 9 and 14.
55. A computer system comprising a processor and a data storage device wherein said data storage device has stored thereon the sequence of the polynucleotide according to any one of the claims selected from the group consisting of 1, 3, 5, 7, 8, 9 and 14.
56. The computer system of Claim 55, further comprising a sequence comparer and a data storage device having reference sequences stored thereon.
57. A method for comparing a first sequence to a reference sequence, comprising the steps of: a) reading said first sequence and said reference sequence through use of a computer program which compares sequences; and b) determining differences between said first sequence and said reference sequence with said computer program; wherein said first sequence is the sequence of the polynucleotide according to any one of the claims selected from the group consisting of 1, 3, 5, 7, 8, 9 and 14.
58. A diagnostic kit comprising a polynucleotide according to any one of claims 1, 3, 5, 7, 8, 9, 14, 18, and 19.
59. A method of identifying a gene associated with a detectable trait comprising the steps of: a) determining the frequency of each allele of at least one map-related biallelic marker in individuals having said detectable trait and individuals lacking said detectable trait according to the method of claim 41 ; b) identifying at least one allele of said biallelic marker having a statistically significant association with said detectable trait; and c) identifying a gene in linkage disequilibrium with said allele.
60. The method according to claim 59, further comprising the step of: d) identifying a mutation in gene which is associated with said detectable trait.
61. A method of identifying biallelic markers associated with a detectable trait comprising the steps of: a) determining the frequencies of a set of biallelic markers comprising at least one map-related biallelic marker in individuals who express said detectable trait and individuals who do not express said detectable trait; and b) identifying at least one biallelic marker in said set which are statistically associated with the expression of said detectable trait.
62. A method for determining whether an individual is at risk of developing a detectable trait or suffers from a detectable frait associated with said trait comprising the steps of: a) obtaining a nucleic acid sample from said individual; b) screening said nucleic acid sample with at least one map-related biallelic marker; and c) determining whether said nucleic acid sample contains at least one biallelic marker statistically associated with said detectable trait.
63. The method according to any one of claims 59, 61 and 62, wherein said detectable trait is selected from the group consisting of disease, drug response, drug efficacy, treatment response, treatment efficacy, and drug toxicity.
64. A method of administering a drug or treatment comprising: a) obtaining a nucleic acid sample from an individual; b) determining the identity of the polymoφhic base of at least one map-related biallelic marker according to the method of claim 31 which is associated with a positive response to said drug or treatment, or at least one map-related biallelic marker which is associated with a negative response to said drug or treatment; and c) administering said drug or treatment to said individual if said nucleic acid sample contains at least one biallelic marker associated with a positive response to said drug or treatment, or if said nucleic acid sample lacks at least one biallelic markers associated with a negative response to said drug or treatment.
65. A method of selecting an individual for inclusion in a clinical trial of a drug or treatment comprising: a) obtaining a nucleic acid sample from an individual; b) determining the identity of the polymoφhic base of at least one map-related biallelic marker according to the method of claim 31 which is associated with a positive response to said drug or treatment, or at least one biallelic marker associated with a negative response to said drug or freatment in said nucleic acid sample; and c) including said individual in said clinical trial if said nucleic acid sample contains at least one biallelic marker which is associated with a positive response to said drug or treatment, or if said nucleic acid sample lacks at least one biallelic markers associated with a negative response to said drug or treatment.
66. The method according to either of claims 64 and 65, wherein said administering step comprises administering said drug or treatment to said individual if said nucleic acid sample contains at least one biallelic marker associated with a positive response to said drug or treatment, and said nucleic acid sample lacks at least one biallelic marker associated with a negative response to said drug or treatment.
67. The method according to any one of claims 59, 61, 62, 64, and 65, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 3908.
68. The method according to any one of claims 59, 61, 62, 64, and 65, wherein said map-related biallelic marker is selected from the group consisting of the biallelic markers of SEQ ED Nos. 1 to 3734.
PCT/IB1999/000822 1998-04-21 1999-04-21 Biallelic markers for use in constructing a high density disequilibrium map of the human genome WO1999054500A2 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
AU34386/99A AU3438699A (en) 1998-04-21 1999-04-21 Biallelic markers for use in constructing a high density disequilibrium map of the human genome
CA002324866A CA2324866A1 (en) 1998-04-21 1999-04-21 Biallelic markers for use in constructing a high density disequilibrium map of the human genome
EP99915988A EP1071817A2 (en) 1998-04-21 1999-04-21 Biallelic markers for use in constructing a high density disequilibrium map of the human genome
US09/422,978 US6537751B1 (en) 1998-04-21 1999-10-20 Biallelic markers for use in constructing a high density disequilibrium map of the human genome
US10/349,143 US20040005584A1 (en) 1998-04-21 2003-01-21 Biallelic markers for use in constructing a high density disequilibrium map of the human genome
US11/370,584 US20060177863A1 (en) 1998-04-21 2006-03-08 Biallelic markers for use in constructing a high density disequilibrium map of the human genome

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US8261498P 1998-04-21 1998-04-21
US60/082,614 1998-04-21
US10973298P 1998-11-23 1998-11-23
US60/109,732 1998-11-23

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US29885099A Continuation-In-Part 1998-04-21 1999-04-21
US09/422,978 Continuation-In-Part US6537751B1 (en) 1998-04-21 1999-10-20 Biallelic markers for use in constructing a high density disequilibrium map of the human genome

Publications (3)

Publication Number Publication Date
WO1999054500A2 WO1999054500A2 (en) 1999-10-28
WO1999054500A9 true WO1999054500A9 (en) 2000-02-10
WO1999054500A3 WO1999054500A3 (en) 2000-03-16

Family

ID=26767666

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB1999/000822 WO1999054500A2 (en) 1998-04-21 1999-04-21 Biallelic markers for use in constructing a high density disequilibrium map of the human genome

Country Status (4)

Country Link
EP (1) EP1071817A2 (en)
AU (1) AU3438699A (en)
CA (1) CA2324866A1 (en)
WO (1) WO1999054500A2 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058517B1 (en) 1999-06-25 2006-06-06 Genaissance Pharmaceuticals, Inc. Methods for obtaining and using haplotype data
EP1285088A2 (en) * 2000-01-13 2003-02-26 Genset Biallelic markers derived from genomic regions carrying genes involved in central nervous system disorders
US6562955B2 (en) * 2000-03-17 2003-05-13 Tosoh Corporation Oligonucleotides for detection of Vibrio parahaemolyticus and detection method for Vibrio parahaemolyticus using the same oligonucleotides
US6931326B1 (en) 2000-06-26 2005-08-16 Genaissance Pharmaceuticals, Inc. Methods for obtaining and using haplotype data
JP2004504037A (en) * 2000-07-18 2004-02-12 ジェンセット ソシエテ アノニム Obesity-related biallelic marker map
GB0021667D0 (en) * 2000-09-04 2000-10-18 Glaxo Group Ltd Genetic study
DE10050361A1 (en) * 2000-10-11 2002-04-18 Genprofile Ag Statistical processing of gene sequences to determine possible haplotypes and their probability, useful e.g. for identifying genetic origins of complex diseases
WO2002035442A2 (en) * 2000-10-23 2002-05-02 Glaxo Group Limited Composite haplotype counts for multiple loci and alleles and association tests with continuous or discrete phenotypes
JP4757402B2 (en) * 2001-05-23 2011-08-24 キリンホールディングス株式会社 Polynucleotide probes and primers for detection of beer-turbid lactic acid bacteria and detection methods for beer-turbid lactic acid bacteria
AUPR823601A0 (en) * 2001-10-12 2001-11-08 University Of Queensland, The Automated genotyping
US7144999B2 (en) 2002-11-23 2006-12-05 Isis Pharmaceuticals, Inc. Modulation of hypoxia-inducible factor 1 alpha expression
WO2005048012A2 (en) * 2003-08-05 2005-05-26 Genaissance Pharmaceuticals, Inc. Methods for haplotype assignment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5858659A (en) * 1995-11-29 1999-01-12 Affymetrix, Inc. Polymorphism detection
ATE291583T1 (en) * 1993-11-03 2005-04-15 Orchid Biosciences Inc POLYMORPHISM OF MONONUCLEOTIDES AND THEIR USE IN GENETIC ANALYSIS
WO1998020165A2 (en) * 1996-11-06 1998-05-14 Whitehead Institute For Biomedical Research Biallelic markers
EP0892068A1 (en) * 1997-07-18 1999-01-20 Genset Sa Method for generating a high density linkage disequilibrium-based map of the human genome
DK1052292T3 (en) * 1997-12-22 2003-07-28 Genset Sa Prostate cancer gene

Also Published As

Publication number Publication date
CA2324866A1 (en) 1999-10-28
EP1071817A2 (en) 2001-01-31
WO1999054500A3 (en) 2000-03-16
AU3438699A (en) 1999-11-08
WO1999054500A2 (en) 1999-10-28

Similar Documents

Publication Publication Date Title
US20060177863A1 (en) Biallelic markers for use in constructing a high density disequilibrium map of the human genome
EP1129216B1 (en) Methods, software and apparati for identifying genomic regions harboring a gene associated with a detectable trait
US6703228B1 (en) Methods and products related to genotyping and DNA analysis
AU746682B2 (en) Biallelic markers for use in constructing a high density disequilibrium map of the human genome
EP1056889B1 (en) Methods related to genotyping and dna analysis
WO1999054500A9 (en) Biallelic markers for use in constructing a high density disequilibrium map of the human genome
US20060234221A1 (en) Biallelic markers of d-amino acid oxidase and uses thereof
US20040023275A1 (en) Methods for genomic analysis
US7105353B2 (en) Methods of identifying individuals for inclusion in drug studies
US20040029161A1 (en) Methods for genomic analysis
US20040048265A1 (en) Obesity associated biallelic marker maps
EP1423536A2 (en) Single nucleotide polymorphisms diagnostic for schizophrenia
WO2003025198A2 (en) Regulatory single nucleotide polymorphisms and methods therefor
JP2004512842A (en) Method for assessing risk of non-insulin dependent diabetes based on allyl mutation and body fat in the 5 &#39;flanking region of the insulin gene
EP1546398A4 (en) Single nucleotide polymorphisms diagnostic for schizophrenia
WO2003012139A2 (en) Methods for assessing the risk of obesity based on allelic variations in the 5&#39;-flanking region of the insulin gene

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C2

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1-2516, SEQUENCE LISTING, ADDED

AK Designated states

Kind code of ref document: A3

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

ENP Entry into the national phase

Ref document number: 2324866

Country of ref document: CA

Ref document number: 2324866

Country of ref document: CA

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: KR

WWE Wipo information: entry into national phase

Ref document number: 1999915988

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1999915988

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)