WO1999036567A2

WO1999036567A2 - Enhanced discrimination of perfect matches from mismatches using a modified dna ligase

Info

Publication number: WO1999036567A2
Application number: PCT/US1999/000176
Authority: WO
Inventors: Narayan Baidya
Original assignee: Hyseq, Inc.
Priority date: 1998-01-14
Filing date: 1999-01-14
Publication date: 1999-07-22
Also published as: AU2557799A

Abstract

The invention relates to methods using a modified DNA ligase which increases the discrimination of perfect matches from mismatches for complementary polynucleotides. The modified ligase enhances discrimination in a number of ways, for example, the ligase may increase the difference in the on rates and/or the off rates between a perfect match product and a mismatch product (a kinetic effect); or the ligase may increase the binding energy difference between a perfect match and a mismatch (a free energy [ΔG] effect); or the ligase may itself discriminate between perfect matches and mismatches (ΔG or kinetic effect); or some combination of these and other factors.

Description

ENHANCED DISCRIMINATION OF PERFECT MATCHES FROM MISMATCHES

USING A MODIFIED DNA LIGASE

TTET.P OF THE INVENTION

This invention relates in general to methods and apparatus for nucleic acid analysis, and, in particular, to methods and apparati for nucleic acid analysis. BACKGROUND

The rate of determining the sequence of the four nucleotides in nucleic acid samples is a majoT technical obstacle for further advancement of molecular biology, medicine, and biotechnology. Nucleic acid sequencing methods which involve separation of nucleic acid molecules in a gel have been in use since ] 978, The other proven method for sequencing nucleic acids is sequencing by hybridization (SBH).

The traditional method of determining a sequence of nucleotides ( i.e., the order of the A, G, C and T nucleotides in a sample) is performed by preparing a mixture of randoπdy-temώiated. differentially labelled nucleic acid fragments by degradation at specific nucleotides, or by dideoxy chain teπnination of replicating strands. Resulting nucleic acid fragments in the range of 1 to 500 bp are then separated on a gel to produce a ladder of bands wherein the adjacent samples differ in length by one nucleotide.

The array-based approach of SBH does not require single base resolution in separation, degradation, synthesis or imaging of a nucleic acid molecule. Using mismatch discriminative hybridization of short oligonucleotides K bases in length, lists of constituent K-mer oligonucleotides may be determined for target nucleic acid. Sequence for the target nucleic acid may be assembled by uniquely overlapping scored oligonucleotides.

There are several approaches available to achieve sequencing by hybridization. In a process called SBH Format 1, nucleic add samples are arrayed, and labeled probes are hybridized with the samples. Replica membranes with the same sets of sample nucleic acids may be used for parallel scoring of several probes and/or probes may be multiplexed. Nucleic acid samples may be arrayed and hybridized on nylon membranes or other suitable supports. Each membrane array may be reused many times. Format 1 is especially efficient for batch processing large numbers of samples. In SBH Format 2, probes are arrayed at locations on a substrate which correspond to their respective sequences, and a labelled nucleic acid sample fiagment is hybridized to the arrayed probes. In this case, sequence information about a fiagment may be determined in a simultaneous hybridization reaction with all of the arrayed probes. For sequencing other nucleic acid fragments, the same oligonucleotide array may be reused. The arrays may be produced by spotting or by in situ synthesis of probes.

In Format 3 SBH, two sets of probes are used. In one embodiment, a set may be in the form of arrays of probes with known positions, and another, labelled set may be stored in multiwell plates. In this case, target nucleic acid need not be labelled. Target nucleic acid and one or more labelled probes are added to the arrayed sets of probes. If one attached probe and one labelled probe both hybridize contiguously on the target nucleic acid, they are covalently ligated, producing a detected sequence equal to the sum of the length of the ligated probes. The process allows for sequencing long nucleic acid .fragments, e.g. a complete bacterial genome, without nucleic acid subcloning in smaller pieces.

In the present invention, SBH is applied to the efficient identification and sequencing of one or more nucleic acid samples. The procedure has many applications in nucleic acid diagnostics, forensics, and gene mapping. It also may be used to identify mutations responsible for genetic disorders and other traits, to assess biodiversity and to produce many other types of data dependent on nucleic acid sequence. SUMMARY OF THE INVENTION

The present invention provides a method for detecting a target nucleic acid species including the steps of providing an array of probes affixed to a substrate and a plurality of labeled probes wherein each labeled probe is selected to have a first nucleic acid sequence which is complementary to a first portion of a target nucleic acid and wherein the nucleic acid sequence of at least one probe affixed to the substrate is complementary to a second portion of the nucleic acid sequence of the target, the second portion being adjacent to the first portion; applying a target nucleic acid to the array under suitable conditions for hybridization of probe sequences to complementary sequences; introducing a labeled probe to the array; hybridizing a probe affixed to the substrate to the target nucleic acid; hybridizing the labeled probe to the target nucleic acid; affixing the labeled probe to an adjacently hybridized probe in the array; and detecting the labeled probe affixed to the probe in the array. According to preferred methods of the invention the array of probes affixed to the substrate comprises a universal set of probes. According to other preferred aspects of the invention at least two of the probes affixed to the substrate define overlapping sequences of the target nucleic acid sequence and more preferably at least two of the labelled probes define overlapping sequences of the target nucleic acid sequences. Still further, according to another aspect of the invention a method is provided for detecting a target nucleic acid of known sequence comprising the steps of: contacting a nucleic acid sample with a set of immobilized oligonucleotide probes attached to a solid substrate under hybridizing conditions wherein the immobilized probes are capable of specific hybridization with different portions of said target nucleic acid sequence; contacting the target nucleic add with a set of labelled oligonucleotide probes in solution under hybridizing conditions wherein the labeled probes are capable of spedfic hybridization with different portions of said target nudeic add sequence adjacent to the immobilized probes; covalently joining the immobilized probes to labelled probes that are immediately adjacent to the immobilized probe on the target sequence (e.g., with ligase); removing any non-ligated labelled probes detecting the presence of the target nuddc acid by detecting the presence of said labdled probe attached to the immobilized probes. The invention also provides a method of determining expression of a member of a set of partially or completely sequenced genes in a cell type, a tissue or a tissue mixture comprising the steps of: defining pairs of fixed and labded probes specific for the sequenced gene; hybridizing unlabeled nucldc add sample and corresponding labded probes to one or more aπays of fixed probes; forming covalent bonds between adjacent hybridized labded and fixed probes; removing unligated probes; and determining the presence of the sequenced gene by detection of labeled probes bound to prespedfied locations in the array. In a preferred embodiment of this aspect of the invention, the target nuddc acid will identify the presence of an infectious agent

Further, the present invention provides for an array of oligonucleotide probes comprising a nylon membrane; a plurality of subarrays of oligonucleotide probes on the nylon membrane, the subarrays comprising a plurality of individual spots wherein each spot is comprised of a plurality of oligonucleotide probes of the same sequence; and a plurality of hydrophobic barriers located between the subarrays on the nylon membrane, whereby the plurality of hyydrophobic barriers prevents cross contamination between adjacent subarrays.

Still further, the present invention provides a method for sequencing a repetitive sequence, having a first end and a second end, in a target nucleic add comprising the steps of: (a) providing a plurality of spacer oligonudeotides of varying lengths wherein the spacer oligonudeotides comprise the repetitive sequence; (b) providing a first oligonucleotide that is known to be adjacent to the first end of the repetitive sequence (c) providing a plurality of second oligonucleotides one of which is adjacent to the second end of the repetitive sequence, wherein the plurality of second oligonudeotides is labeled; (d) hybridizing the first and the plurality of second oligonucleotides, and one of the plurality of spacer oligonucleotides to the target nuddc add ; (e) ligating the hybridized oligonucleotides; (f) separating ligated oligonudeotides from unligated oligonudeotides; and (g) detecting label in the ligated oligonucleotides.

Still further, the present invention provides a method for sequencing a branch point sequence, having a first end and a second end, in a target nuddc add comprising the steps of: (a) providing a first oligonudeotide that is complementary to a first portion of the branch point sequence wherein the first oligonudeotide extends from the first end of the branch point sequence by at least one nudeotide; (b) providing a plurality of second oligonucleotides that are labeled, and are complementary to a second portion of the branch point sequence wherein the plurality of second oligonudeotides extend from the second end of the branch point sequence by at least one nucleotide, and wherein the portion of the second oligonudeotides that extend from the second end of the branch point sequence comprise sequences that are complementary to a plurality of sequences that arise from the branch point sequence (c) hybridizing the first oligonudeotide, and one of the plurality of second oligonucleotides to the target DNA; (d) ligating the hybridized oligonudeotides; (e) separating ligated oligonucleotides from unligated oligonucleotides; and (f) detecting labd in the ligated oligonudeotides.

Still further, the present invention provides a method for confirming a sequence by using probes that are predicted to be negative for the target nuddc add. The sequence of a target is then confirmed by hybridizing the target nucldc acid to the "negative" probes to confirm that these probes do not form perfect matches with the target nuddc add. Still further, the present invention provides a method for analyzing a nudeic add using oligonudeotide probes that are complexed with different labels so that the probes may be multiplexed in a hybridization reaction without a loss of sequence information (i.e., different probes have different labels so that hybridization of the different probes to the target can be distinguished). In a preferred embodiment, the labels are radioisotopes, or floursecent molecules, or enzymes, or electrophore mass labds. In a more preferred embodiment, the differently labeled oligonucleotides probes are used in format III SBH, and multiple probes (more than two, with one ptrobe being the immobilized probe) are ligated together.

Still further, the present invention provides a method for detecting the presence of a target nuddc add having a known sequence when the target is present in very small amounts compared to homologous nuddc acids in a sample. In a preferred embodiment, the target nuddc acid is an allele present at very low frequency in a sample that has nucldc acids from a large number of sources. In an alternative preferred embodiment, the target nucldc add has a mutated sequence, and is present at very low frequency within a sample of nuddc acids. Still further, the present invention provides a method for confirming the sequence of a target nucleic acid by using single pass gd sequencing. Primers for single pass gel sequencing are derived from the sequence obtained by SBH, and these primers are used in standard Sanger sequencing reactions to provide gel sequence information for the target nuddc add. The sequence obtained by single pass gel sequencing is then compared to the SBH derived sequence to confirm the sequence.

Still further, the present invention provides a method for solving branch points by using single pass gd sequencing. Primers for the single pass gel sequencing reactions are identified from the ends of the Sfs obtained after a first round of SBH sequencing, and these primers are used in standard Sanger-sequencing reactions to provide gel sequencing information through the branch points of the Sfs. Sfs are then aligned by comparing the Sanger-sequencing results through the branch points to the Sfs to identify adjoining Sfs.

Still further, the present invention provides for a method of preparing a sample containing target nucleic adds by PCR, without purifying the PCR products prior to the SBH reactions. In Format I SBH, crude PCR products are applied to a substrate without prior purification, and the substrate may be washed prior to introduction of the labeled probes. Still f rther, the present invention provides a method and an apparatus for analyzing a target nucldc acid. The apparatus comprises two aπays of nucldc adds that are mixed together at the desired time. In a preferred embodiment, the nucldc acids in one of the aπays are labded. In a more preferred embodiment, a material is disposed between the two arrays and this material prevents the mixing of nucldc adds in the arrays. When this material is removed, or rendered permeable, the nuddc adds in the two aπays are mixed together. In an alternative preferred embodiment, the nucldc adds in one array are target nucldc adds and the nucleic acids in the other are oligonu eotide probes. In another preferred embodiment, the nuddc acids in both arrays are oligonucleotide probes. In another prefeπed embodiment, the nucldc acids in one array are oligonudeotide probes and target nucldc acids, and nuddc adds in the other array axe oligonudeotide probes. In another preferred embodiment, the nuddc adds in both arrays are oligonucleotide probes and target nucleic adds.

One method of the present invention using the apparatus described above comprises the steps of providing an array of nucldc acids fixed to a substrate, providing a second array of nucleic adds, providing conditions that allow the nuddc acids in the second array to come into conuct with the nuddc acids of the fixed array wherein one of the arrays of nuddc acids are target nucldc acids and the other array is oligonucleotide probes, and analyzing the hybridization results. In a preferred embodiment, the fixed array is target nucleic acid and the second array is labeled oligonucleotide probes. In a more preferred embodiment, there is a material disposed between the two arrays that prevents mixing of the nucldc adds until the material is removed or rendered permeable to the nucldc adds.

In a second method of the present invention using the apparatus described above comprises the steps of providing two arrays of nuddc add probes, providing conditions that allow the two aπays of probes to come into contact with each other and a target nuddc acid, ligating together probes that are adjacent on the target nucldc acid, and analyzing the results. In a preferred embodiment, the probes in one array are fixed and the probes in the other array are labded. In a more preferred embodiment, there is a material disposed between the two arrays that prevents mixing of the probes until the material is removed or rendered permeable to the probes.

Still further, the present invention provides substrates on which arrays of oligonudeotide probes are fixed, wherein each probe is separated from its neighboring probes by a physical barrier that is resistant to the flow of the sample solution. In a preferred embodiment, the physical barrier is made of a hydrophobic material.

Still further, the present invention provides a method for making the arrays of oligonudeotide probes that are separated by physical barriers. In a preferred embodiment, a grid is applied to the substrate using an ink-jet head that applies a material which reduces the reaction volume of the array.

Still further, the present invention provides substrates on which oligonucleotides are fixed to form a three-dimensional array. The three-dimensional array combines high resolution for reading probe results (each levd has a relativdy low density of probes per cm²), with high information content in three dimensional space (multiple levels or probes).

Still further, the present invention provides a substrate to which oligonudeotide probes are fixed, wherein the oligonudeotide probes have spacers, and wherein the spacers increase the distance between the substrate and the informational portion of the oligonucleotide probe (e.g., the portion of the oligonudeotide probe which binds to the target and gives sequence information). In a preferred embodiment, the spacer comprises ribose sugars and phosphates, wherein the phosphates covalently bind the ribose sugars into a polymer by forming esters with the ribose sugars through thdr 5' and 3' hydroxyl groups.

Still further, the present invention provides a method for clustering cDNA clones into groups of similar or identical sequences, so that single representative dones may be selected from each group for sequencing. In a prefeπed embodiment, the method for clustering is used in the sequencing of a plurality of clones, comprising the steps of: interrogating each done with a plurality of oligonucleotide probes; deteπnining which probes bind to each clone and the signal intensity for eac probe; clustering clones into a plurality of groups by identifying clones that bind to similar probes with similar intensities; and sequencing at least one done from each group. In a more preferred embodiment, the plurality of probes comprises from about 50 to about 500 different probes. In a another more preferred embodiment, the plurality of probe comprises about 300 different probes. In a most prefeπed embodiment, the plurality of clones are a plurality of cDN A clones.

Still further, the invention relates to oligonudeotide probes complexed (covalent or noncovalent) to discrete partides ^■wherein the partides can be grouped into a plurality of sets based on a physical property. In a prefeπed embodiment, a different probe is attached to the discrete partides of each set, and the identity of the probe is determined by identifying the physical property of the discrete particles. In an alternative embodiment, the probe is identified on the basis of a physical property of the probe. The physical property includes any that can be used to differentiate the discrete partides, and includes, for example, size, flourescence, radioactivity, electromagnetic charge, or absorbance, or label(s) may be attached to the particle such as a dye, a radionuclide, or an EML. In a preferred embodiment, discrete particles are separated by a flow cytometer which detects the size, charge, flourescence, or absorbance of the partide.

The invention also relates to methods using the probes complexed with the discrete particles to analyze target nucldc adds. These probes may be used in any of the methods described above, with the modification of identifying the probe by the physical property of the discrete particle. These probes may also be used in a format III approach where the "free" probe is identified by a label, and the probe complexed to the discrete particle is identified by the physical property. In a prefeπed embodiment, the probes are used to sequence a target nucleic add using SBH. The invention also relates to methods using agents which destabilize the binding of complementary polynucleotide strands (decrease the binding energy), or increase stability of binding between complementary polynudeotide strands (increase the binding energy). In prefeπed embodiments, the agent is a tetraalkyl ammonium salt, sodium chloride, a phosphate salζ a borate salt, an organic solvent such as foimamide, glycol, dimethylsulfoxide, and dimethylformamide, urea, guanidinium, an amino add analog such as betake, a polyamine such as spεrmidine and spermine, or other positively charged molecules which neutralize the negative charge of the phosphate backbone, a detergent such as sodium dodecyl sulfate, and sodium lauryl sarcosinate, a minor/major groove binding agent, a positively charged polypeptide, an intercalating agent such as acridine, ethidiuro bromide, and anthracine, and apolyanion such as an alkyl polysulphonic add. In a preferred embodiment, an agent is used to reduce or increase the T_m of a pair of complementary polynudeotides. In a more prefeπed embodiment, a mixture of the agents is used to reduce or increase the T_m of a pair of complementary polynudeotides. In a preferred embodiment, the agent or agents are added so that the binding energy from an AT base pair is approximately equivalent to the binding energy of a GC base pair. The energy of binding of these complementary polynudeoti des may be increased by adding an agent that neutralizes or shidds the negative charges of the phosphate groups in the polynucleotide backbone. In a most prefeπed embodiment, the agent or agents are used to enhance the discrimination of discrimination of perfect matches from mismatches for complementary polynucleotides.

The invention also relates to methods of increasing the discrimination of perfect matches from mismatches for complementary polynucleotides. In preferred embodiments, this discrimination is increased by changing a physical property in the method, e.g., the temperature, and/or adding an agent which increases discrimination, e.g., spermadine or formamide. In a more preferred embodiment, a mixture of agents and/or physical conditions is used to increase the discrimination of perfect matches from mismatches between a probe and a target nudeic add. In a most prefeπed embodiment, the change in physical condition or addition of an agent enhances discrimination in a number of ways, for example, the physical condition or agent may increase the difference in the on rates or off rates between a perfect match product and a mismatch product (a kinetic effect); or the reaction time may be decreased so that binding of the probe to a perfect match site and/or a mismatch site does not reach equilibrium; or the physical condition or agent may increase the binding energy difference between a perfect match and a mismatch (a free energy [ΔG] effect); or the physical condition or agent may enhance the discrimination effect of another agent or physical condition (ΔG or kinetic effect); or the physical condition or agent may preferentially modify the perfect match or mismatch complexes formed between complementary polynucleotides; or the physical condition or agent may enhance the discrimination of the physical condition or agent which physically modifies the complexed polynudeotides (ΔG, kinetic, or conformational effect); or some combination of these and other factors. In a prefeπed embodiment, the agent, agents or physical condition(s) modify the activity of a protein which binds to and/or modifies the complexed or uncomplexed nudeic adds. In a preferred embodiment, the agent is one of those recited supra. In a prefeπed embodiment, the physical condition is sdected from the group comprising temperature, pH, ionic strength, time, and/or others such as, e.g., those listed in The Handbook of Chemistry and Physics, CRC Press.

The invention also relates to methods for enhancing the activity of a nudeic add modifying polypeptide on a target nucleic acid, comprising the steps of contacting the target nuddc add with at least one polynucleotide under conditions which allow a perfect match to be discriminated from a mismatch, wherein an agent is added to enhance the discrimination of the perfect match from the mismatch; and contacting the complex formed between the polynucleotide and the target nudeic add with the nuddc add modifying polypeptide, wherein the activity of the nucldc add modifying polypeptide is enhanced by the enhanced discrimination. In prefeπed embodiments, the nucleic acid modifying polypeptide is selected from the group comprising a ligase, a nucldc acid polymerase, an integrase, a gyrase, a nuclease, a helicase, a methylase, and a capping enzyme. In an alternative preferred embodiment, the methods are used to enhance the binding of nucleic acid binding proteins, such as, for example, transcription factors, repressors, and structural polypeptides such as, for example, histones. In a most preferred embodiment, the nucldc acid modifying polypeptide is a ligase that has been modified to enhance its discrimination of perfect matches from mismatches. nF.TAIT.ED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Format 1 SBH is appropriate for the simultaneous analysis of a large set of samples. Parallel scoring of thousands of samples on large aπays may be performed in thousands of independent hybridization reactions using small pieces of membranes. The identification of DNA may involve 1-20 probes per reaction and the identification of mutations may in some cases involve more than 1000 probes specifically selected or designed for each sample. For identification of the nature of the mutated DNA segments, specific probes may be synthesized or sdected for each mutation detected in the first round of hybridizations. DNA samples may be prepared in small aπays which may be separated by appropriate spacers, and which may be simultaneously tested with probes selected from a set of oligonucleotides which may be arrayed in multiwell plates. Small arrays may consist of one or more samples. DNA samples in each small array may include mutants or individual samples of a sequence. Consecutive small arrays may be organized into larger arrays, Such larger arrays may include replication of the same small array or may indude arrays of samples of different DNA fragments. A universal set of probes indudes sufficient probes to analyze a DNA fiagment with prespecified predsion, e.g. with respect to the redundancy of reading each base pair ("bp"). These sets may include more probes than are necessary for one spedfic fragment, but may indude fewer probes than are necessary for testing thousands of DNA samples of different sequence. DNA or allele identification and a diagnostic sequencing process may indude the steps of: 1) Sdection of a subset of probes from a dedicated, representative or universal set to be hybridized with each of a plurality of small arrays;

2) Adding a first probe to each subaπay on each of the arrays to be analyzed in paralld; 3) Performing hybridization and scoring of the hybridization results;

4) Stripping off previously used probes;

5) Repeating hybridization, scoring and stripping steps for the remaining probes which are to be scored;

5) Processing the obtained results to obtain a final analysis or to determine additional probes to be hybridized;

6) Performing additional hybridizations for certain subarrays; and

7) Processing complete sets of data and obtaining a final analysis,

This approach provides fast identification and sequencing of a small number of nucldc acid samples of one type (e.g. DNA, RNA), and also provides paralld analysis of many sample types in the form of subarrays by using a presynthesized set of probes of manageable size. Two approaches have been combined to produce an effident and versatile process for the determination of DNA identity, for DNA diagnostics, and for identification of mutations.

For the identification of known sequences, a small set of shorter probes may be used in place of a longer unique probe. In this approach, although there may be more probes to be scored, a universal set of probes may be synthesized to cover any type of sequence. For example, a full set of 6-mers indudes only 4,096 probes, and a complete set of 7-τners includes only 16,384 probes.

Full sequencing of a DNA fragment may be performed with two levds of hybridization. One levd is hybridization of a sufficient set of probes that cover every base at least once. For this purpose, a spedfic set of probes may be synthesized for a standard sample. The results of hybridization with such a set of probes reveal whether and where mutations (differences) occur in non-standard samples. Further, this set of probes may indude "negative" probes to confirm the hybridization results of the "positive" probes. To determine the identity of the changes, additional spedfic probes may be hybridized to the sample. This additional set of probes will have both "positive" (the mutant sequence) and "negative" probes, and the sequence changes will be identified by the positive probes and confirmed by the negative probes. In another embodiment, all probes from a universal set may be scored. A universal set of probes allows scoring of a relativdy small number of probes per sample in a two step process without an undesirable expenditure of time. The hybridization process may involve successive probings, in a first step of computing an optimal subset of probes to be hybridized first and, then, on the basis of the obtained results, a second step of determining additional probes to be scored from among those in a universal set. Both sets of probes have "negative" probes that confirm the positive probes in the set. Further, the sequence that is obtained may then be confirmed in a separate step by hybridizing the sample with a sd of "negative" probes identified from the SBH results. In SBH sequence assembly, K -1 oligonucleotides which occur repeatedly in analyzed DNA fragments due to chance or biologicd reasons may be subject to spedal consideration. If there is no additional information, relativdy small fragments of DNA may be fully assembled in as much as every base pair is read several times.

In the assembly of relatively longer fragments, ambiguities may arise due to the repeated occurrence in a set of positively-scored probes of aK-1 sequence (i.e., a sequence shorter than the length of the probe). This problem does not exist if mutated or similar sequences have to be determined (i.e., the K-l sequence is not identically repeated). Knowledge of one sequence may be used as a template to coπectly assemble a sequence known to be similar (e.g. by its presence in a database) by arraying the positive probes for the unknown sequence to display the best fit on the template.

The use of an aπay of sample avoids consecutive scoring of many digonudeotides on a single sample or on a small set of samples. This approach allows the scoring of more probes in parallel by manipulation of only one physical object. Subarrays of DNA samples 1000 bp in length may be sequenced in a relativdy short period of time. If the samples are spotted at 50 subarrays in an array and the aπay is reprobed 10 times, 500 probes may be scored. In screening for the occurrence of a mutation, enough probes may be used to cover each base three times. If a mutation is present, several covering probes will be affected. The use of information about the identity of negative probes may map the mutation with a two base precision. To solve a single base mutation mapped in this way, an additional 15 probes may be employed. These probes cover any base combination for two questionable positions (assuming that ddetions and insertions are not involved). These probes may be scored in one cyde on 50 subaπays which contain a given sample. In the implementation of a multiple label color scheme (i.e., multiplexing), two to six probes, each having a different label such as a different fluorescent dye, may be used as a pool, thereby reducing the number of hybridization cycles and shortening the sequencing process. In more complicated cases, there may be two dose mutations or insertions. They may be handled with more probes. For example, a three base insertion may be solved with 64 probes. The most complicated cases may be approached by several steps of hybridization, and the selecting of a new set of probes on the basis of results of previous hybridizations.

If subarrays to be analyzed include tens or hundreds of samples of one type, then several of them may be found to contain one or more changes (mutations, insertions, or ddetions). For each segment where mutation occurs, a specific set of probes may be scored. The total number of probes to be scored for a type of sample may be several hundreds. The scoring of replica aπays in parallel facilitates scoring of hundreds of probes in a relatively small number of cydes. In addition, compatible probes may be pooled. Positive hybridizations may be assigned to the probes selected to check particular DNA segments because these segments usually differ in 75% of their constituent bases.

By using a larger set of longer probes, longer targets may be analyzed. These targets may represent pools of fragments such as pools of exon dones.

A spedfic hybridization scoring method may be employed to define the presence of mutants in a genomic segment to be sequenced from a diploid chromosomal set. Two variations are where: i) the sequence from one chromosome represents a known allde and the sequence from the other represents a new mutant; or, ii) both chromosomes contain new, but different mutants. In both cases, the scanning step designed to map changes gives a maximal signal difference of two-fold at the mutant position. Further, the method can be used to identify which alleles of a gene are carried by an individual and vvhether ih& individual is homozygous or heterozygous for that gene.

Scoring two-fold signal differences required in the first case may be achieved efficiently by comparing corresponding signals with homozygous and heterozygous controls. This approach allows determination of a relative reduction in the hybridization signal for each particular probe in a given sample- This is significant because hybridization efficiency may vary more than two-fold for a particular probe hyb ridized with different nucldc add fragments having the same full match target. In addition, di: ϊerent mutant sites may affect more than one probe depending upon the number of oligonuclei itide probes. Decrease of the signal for two to four consecutive probes produces a more signi Scant indication of a mutant site. Results may be checked by testing with small sets of selected ] .robes among which one or few probes selected to give a full match signal which is on average d ght-fold stronger than the signals coming from mismatch-containing duplexes.

Partitioned me nbranes allow a very flexible organization of experiments to accommodate rdatively larger numb as of samples representing a given sequence type, or many different types of samples represented w ith relatively small numbers of samples. A range of 4-256 samples can be handled with particula ^■ efficiency. Subarrays within this range of numbers of dots may be designed to match the configura a ion and size of standard multiwell plates used for storing and labeling oligonu eotides. The : size of the subarrays may be adjusted for different number of samples, or a few standard subaπay ^■ : ;izes may be used. If all samples of a type do not fit in one subarray, additional subaπays or membranes may be used and processed with the same probes, ln addition, by adjusting the numb< r of replicas for each subaπay, the time for completion of identification or sequencing process ma Y be varied. As used herein, "intermediate fiagment" means an oligonudeotide between 5 and 1000 bases in length, and pre ferably between 10 and 40 bp in length.

In Format 3, a f irst set of oligonudeotide probes of known sequence is immobilized on a solid support under con ditions which permit them to hybridize with nuddc acids having respectivdy complementary sequences. A labeled, second set of oligonucleotide probes is provided in solution. Both withi l the sets and between the sets the probes may be of the same length or of different lengths. A nui Jdc acid to be sequenced or intermediate fragments thereof may be applied to the first set of probes in double-stranded form (especially where a recA protein is present to permit hybridization un ier non-denaturing conditions), or in single-stranded form and under conditions which permi : hybrids of different degrees of complementarity (for example, under conditions which allow discrimination between full match and one base pair mismatch hybrids). The nuddc acid to be s quenced or intermediate fragments thereof may be applied to the first set of probes before, after or simultaneously with the second set of probes. Probes that bind to adjacent sites on the target are bound together (e.g., by stacking interactions or by a ligase or other means of causing chemical bond formation between the adjacent probes). After permitting adjacent probes to be bound, fragments and probes which are not immobilized to the surface by chemical bonding to a member of the first set of probe are washed away, for example, using a high temperature (up to 100 degrees C) wash solution which melts hybrids. The bound probes from the second set may then be detected using means appropriate to the label employed (which may, for example, be chemiluminescent, fluorescent, radioactive, enzymatic, densitometric, or electrophore mass labels).

Herein, nudeotide bases "match" or are "complementary" if they form a stable duplex by hydrogen bonding under specified conditions. For example, under conditions commonly employed in hybridization assays, adenine ("A") matches thymine ("T"), but not guanine ("G") or cytosine ("C"). Similarly, G matches C, but not A or T. Other bases which will hydrogen bond in less spedfic fashion, such as inosine or the Universal Base ("M" base, Nichols et al 1994), or other modified bases, such as methylated bases, for example, are complementary to those bases for which they foim a stable duplex under spedfied conditions. A probe is said to be "perfedly complementary" or is said to be a "perfect match" if each base in the probe forms a duplex by hydrogen bonding to a base in the nuddc acid to be sequenced according to the Watson and Crick base paring rules (i.e., absent any suπounding sequence effects, the duplex formed has the maximal binding energy for a particular probe). "Perfectly complementary" and "perfect match" are also meant to encompass probes which have analogs or modified nucleotides. A "perfect match" for an analog or modified nudeotide is judged according to a "perfect match rule" sdected for thai analog or modified nucleotide (e.g., the binding pair that has maximal binding energy for a particular analog or modified nudeotide). Each base in a probe that does not form a binding pair according to the "rules" is said to be a "mismatch" under the specified hybridization conditions.

A list of probes may be assembled wherein each probe is a perfect match to the nudeic acid to be sequenced. The probes on this list may then be analyzed to order them in maximal overlap fashion. Such ordering may be accomplished by comparing a first probe to each of the other probes on the list to determine which probe has a 3' end which has the longest sequence of bases identical to the sequence of bases at the 5' end of a second probe. The first and second probes may then be overlapped, and the process may be repeated by comparing the 5' end of the second probe to the 3' end of all of the remaining probes and by comparing the 3' end of the first probe with the 5' end of all of the remaining probes. The process may be continued until there are no probes on the list which have not been overlapped with other probes. Altemativdy, more than one probe may be sdected from the list of positive probes, and more than one set of overlapped probes ("sequence nucleus") may be generated in paralld. ' The list of probes for either such process of sequence assembly may be the list of all probes which are perfectly complementary to the nucleic add to be sequenced or may be any subset thereof.

The 5' and 3' ends of the probes may be overlapped to generate longer stretches of sequence. This process of assembling probes continues until an ambiguity arises because of a branch point (a probe is repeated in the fiagment), repetitive sequences longer than the probes, or an undoned segment. The stretches of sequence between any two ambiguities are referred to as fragment of a subdone sequence (Sfs). Where ambiguities arise in sequence assembly due to the availability of alternative proper overlaps with probes, hybridization with longer probes spanning the site of overlap alternatives, competitive hybridization, ligation of alternative end to end pairs of probes spanning the site of ambiguity or single pass gel analysis (to provide an unambiguous ordering of Sfs) may be used.

By employing the above procedures, one may obtain any desired level of sequence, from a pattern of hybridization (which may be correlated with the identity of a nucldc add sample to serve as a signature for identifying the nuddc acid sample) to overlapping or non-overlapping probes up through assembled Sfs and on to complete sequence for an intermediate fragment or an entire source DNA molecule (e.g. a chromosome).

Sequendng may generally comprise the following steps:

(a) contacting an array of immobilized oligonucleotide probes with a nudeic add fiagment under conditions effective to allow the fiagment to form a primary complex with an immobilized probe having a complementary sequence; (b) contacting this primary complex with a set of labeled oligonucleotide probes in solution under conditions effective to allow the primary complex to hybridize to the labded probe, thereby forming secondary complexes wherein the fiagment is hybridized with both an immobilized probe and a labeled probe;

(c) removing from a secondary complex any labeled probe that has not hybridized adjacent to an immobilized probe; (d) detecting the presence of adjacent labded and unlabeled probes by detecting the presence of the label; and

(e) deteπnining a nudeotide sequence of the fragment by connecting the known sequence of the immobilized and labded probes. Hybridization and washing conditions may be selected to detect substantially perfect match hybrids (such as those wherein the fiagment and probe hybridize at six out of seven positions), may be selected to allow differentiation of perfect matches and one base pair mismatches, or may be selected to permit detection only of perfect match hybrids.

Suitable hybridization conditions may be routindy determined by optimization procedures or pilot studies. Such procedures and studies are routinely conducted by those skilled in the art to establish protocols for use in a laboratory. See e-g., Ausubel et al., Current Protocols in Molecular Biology, Vol 1-2, John Wiley & Sons (1989); Sambrook et al., Molecular Cloning A Laboratory Manual, 2nd Ed., Vols. 1-3, Cold Springs Harbor Press (1989); and Maniatis et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory Cold Spring Harbor, New York (1982), all of which are incorporated by reference herein. For example, conditions such as' temperature, concentration of components, hybridization and washing times, buffer components, and their pH and ionic strength may be varied.

In embodiments wherein the labded and immobilized probes are not physically or chemically linked, detection may rely solely on washing steps of controlled stringency. Under such conditions, adjacent probes have increased binding affinity because of stacking interactions between the adjacent probes. Conditions may be varied to optimize the process as described above.

In embodiments wherein the immobilized and labded probes are ligated, ligation may be implemented by a chemical ligating agent (e.g. water-soluble caibodii ide or cyanogen bromide), or a ligase enzyme, such as the commercially available T₄ DNA ligase may be employed. The washing conditions may be selected to distinguish between adjacent versus nonadjacent labded and immobilized probes exploiting the difference in stability for adjacent probes versus nonadjacent probes.

Oligonucleotide probes may be labeled with fluorescent dyes, chemihiminesceni systems, radioactive labds (e.g., ^S, ³H, ³²P or ³³P) or with isotopes detectable by mass spectrometry. Where a nudeic acid molecule of unknown sequence is longer than about 45 or 50 bp, the molecule may be fragmented and the sequences of the fragments determined. Fragmentation may be accomplished by restriction enzyme digestion, shearing or NaOH. Fragments may be separated by size (e,g, by gel electrophoresis)^' to obtain a preferred fiagment length of about ten to forty bps. Oligonucleotides may be immobilized, by a number of methods known to those skilled in the art, such as laser-activated photodeprotection attachment through a phosphate group using reagents such as a nudeoside phosphoramidite or a nu eoside hydrogen phosphorate. Glass, nylon, silicon and fluorocarbon supports may be used.

In a prefeπed embodiment, oligonucleotides are attached to a glass surface using a modified protocol from Zehn Gao et d., Nucl. Acids. Res. (1994) 22:5456-5465. In this protocol, the glass surface is activated by adding an amino-silane functional group, that is coupled with a phenyldiisothiocyanate (DITC). 5 '-amino oligonucleotides are attached to this glass substrate by spotting onto the DITC activated glass surface and incubating for one hour al 37 °C in a humid chamber. Oligonucleotides may be organized into arrays, and these aπays may indude all or a subset of al) probes of a given length, or sets of probes of selected lengths. Hydrophobic partitions may be used to separate probes or subarrays of probes. Arrays may be designed for various applications (e.g. mapping, partial sequendng, sequencing of targeted regions for diagnostic purposes, mRNA sequencing and large scde sequendng). A spedfic chip may be designed to be dedicated to a particular application by selecting a combination and arrangement of probes on a substrate.

For example, 1024 immobilized probe arrays of all oligonudeotide probes 5 bases in length (each array containing 1024 distinct probes) may be constructed. The probes in this example are 5- ers in an informational sense (they may actually be longer probes). A second set of 1024 5-mer probes may be labded, and one of each labded probe may be applied to an array of immobilized probes along with a fragment to be sequenced. In this example, 1024 aπays would be combined in a large superaxray, or "supercbip." In those instances where an immobilized probe and one of the labded probes hybridize end -to-end along a nucleic acid fragment, the two probes are joined, for example by ligation, and, after removing unbound label, 10-mers complementary to the sample fragment are detected by the correlation of the presence of a labd at a point in an array having an immobilized probe of known sequence to which was applied a labded probe of known sequence. The sequence of the sample fragment is simply the sequence of the immobilized probe continued in the sequence of the labeled probe. In this way, all one million possible 10-mers may be tested by a combinatorial process which employs only 5-meτs and which thus involves one thousandth of the amount of effort for oligonudeotide synthesis. In a preferred embodiment, the substrate which supports the array of oligonucleotide probes is partitioned into sections so that each probe in the array is separated from adjacent probes by a physical barrier which may be, for example, a hydrophobic material. In a preferred embodiment, the physicd barrier has a width of from 100 μm to 30 μm. In a more preferred embodiment, the distance from the center of each probe to the center of any adjacent probes is 325 μm. These arrays of probes may be "mass-produced" using a nonmoving, fixed substrate oτ a substrate fixed to a rotating drum or plate with an ink-jet deposition apparatus, for example, a rrricrodrop dosing head; and a suitable robotic system, for example, an anorad gantry.

In an alternative prefeπed embodiment, the oligonucleotide probes are fixed to a three- dimensional array. The three-dimensional aπay is comprised of multiple layers, and each layer may be analyzed separate and apart from the other layers. The three dimensional array may take a number of forms, including, for example, the array may be disposed on a substrate having multiple depressions with probes located at different depths within the depressions (each levd is made up of probes at similar depths within the depression); or the array may be disposed on a substrate having depressions of different depths with the probes located at the bottom of the depression, ox at the peaks separating the depressions or some combination of peaks and depressions may be used (each levd is made up of all the probes at a certain depth); or the array may be disposed on a substrate comprised of multiple sheets that are layered to form a three-dimensional array.

The probes in these aπays may include spacers that increase the distance between the surface of the substrate and the informational portion of the probes. The spacers may be comprised of atoms capable of forming at least two covalent bonds such as carbon, silicon, oxygen, sulfur, phosphorous, and the like, or may be comprised of molecdes capable of foπriing at least two covalent bonds such as sugar-phosphate groups, amino adds, peptides, nudeosides, nucleotides, sugars, carbohydrates, aromatic rings, hydrocarbon rings, linear and branched hydrocarbons, and the like. A nuddc add sample to be sequenced may be fragmented or otherwise treated (for example, by the use of recA) to avoid hindrance to hybridization from secondary structure in the sample. The sample may be fragmented by, for example, digestion with a restriction enzyme such as Qύ. JI, physical shearing (e.g. by ultrasound ), or by NaOH treatment. The resulting fragments may be separated by gd electrophoresis and fragments of an appropriate length, such as between about 10 bp and about 40 bp, may be extracted from the gel. In a prefeπed embodiment, the "fragments" of the nucleic add sample cannot be ligated to other fragments in the pool. Such a pool of fragments may be obtained by treating the fragmented nucldc acids with a phosphatase (e.g., calf intestinal phosphatase). Altemativdy, nonligatable fragments of the sample nucleic acid may be obtained by using random primers (e.g., N_s-N₉, where N = A, G, T, or C) in a Sanger- dideoxy sequendng reaction with the sample nudeic add. This will produce fragments of DNA that have a complementary sequence to the target nucldc add and that are terminated in a dideoxy residue that cannot be ligated to other fragments.

A reusable Format 3 SBH array may be produced by introducing a deavable bond between the fixed and labeled probes and then cleaving this bond after a round of Format 3 analyzes is finished. The labeled probes may be ribonucleotides or a ribonudeotide may be used as the joining base in the labded probe so that this probe may subsequently be removed, e.g., by RNAse or ura l-DNA glycosylate treatment, or NaOH treatment In addition, bonds produced by chemicd ligation may be sdectivdy cleaved. Other variations indude the use of modified oligonudeotides to increase specifi ty or efficiency, cyding hybridizations to increase the hybridization signal, for example by performing a hybridization cyde under conditions (e.g. temperature) optimally selected for a first set of labeled probes followed by hybridization under conditions optimally sdected for a second set of labeled probes. Shifts in reading frame may be determined by using mixtures (preferably mixtures of equimolar amounts) of probes ending in each of the four nucleotide bases A, T, C and G.

Branch points produce ambiguities as to the ordered sequence of a fragment. Although the sequence information is determined by SBH, dther: (i) long read length, single-pass gel sequendng at a fraction of the cost of complete gd sequencing; or (ii) comparison to related sequences, may be used to order hybridization data where such ambiguities ("branch points") occur. Primers for single pass gd sequencing through the branch points are identified from the SBH sequence information or from known vector sequences, e.g., the flanking sequences to the vedor insert site, and standard Sanger-sequencing reactions are performed on the sample nucleic acid. The sequence obtained from this single pass gel sequencing is compared to the Sfs that read into and out of the branch points to identify the order of the Sfs. Alternatively, iht Sfs may be ordered by comparing the sequence of the Sfs to related sequences and ordering the Sfs to produce a sequence that is closest to the related sequence.

In addition, the number of tandem repetitive nucleic add segments in a target fiagment may be determined by single-pass gd sequendng. As tandem repeats occur rarely in protein-encoding portions of a gene, the gd-sequencing step will be performed only when one of these noncoding regions is identified as being pf particular interest (e.g., if it is an important regulatory region). Obtaining information about the degree of hybridization exhibited for a set of only about 200 oligonucleotides probes (about 5% of the effort required for complete sequencing) defines a unique signature of each gene and may be used for sorting the cDNAs from a library to determine if the library contains multiple copies of the same gene. By such signatures, identical, similar and different cDNAs can be distinguished and inventoried.

Nucldc acids and methods for isolating, cloning and sequendng nucldc adds are well known to those of skill in the art. See e,g., Ausubel et al., Current Protocols in Molecular Biolog)', Vol. 1-2, John Wiley & Sons (1989); and Sambrook et al., Molecular Cloning A Laboratory Manual, 2nd Ed., Vols. 1-3, Cold Springs Harbor Press (1 89), both of which are incorporated by reference herein.

SBH is a well developed technology that may be practiced by a number of methods known to those skilled in the art. Spedfically, techniques related to sequendng by hybridization of the following documents is incorporated by reference herein: Drmanac et al., U.S. Patent No. 5,202,231 (hereby incorporated by reference herein) - Issued April 13, 1993; Drmanac et al., Genomics, 4, 114-128 (1989); Drmanac et al., Proceedings of the First Int'l Conf. Electrophoresis Superco puting Human Genome Cantor et al. eds, World Scientific Pub. Co., Singapore, 47-59 (1991); Drmanac et al., Science, 260, 1649-1652 (1993); Lehrach et al., Genome Analysis: Genetic and Physical Mapping, 1 , 39-81 (1990), Cold Spring Harbor Laboratory Press; Drmanac et al., Nucl. Acids Res., 4691 (1986); Stevanovic et al., Gene, 79, 139 (1989); Panusku et al., Mol. Biol. Evol , 1 , 607 (1990); Nizetic et al., Nucl. Acids Res. , 19, 182 (1991); Drmanac et al., J. Biomol. Struct. Dyn., 5, 1085 (1991); Hohdsd et ύ.,Mol. Gen., A, 125-132 (1991); Strezoska et al., Proc. Natl Acad. Sci. (USA), 88, 10089 (1991); Drmanac et al., Nucl Acids Res., 19, 5839 (1991); and Drmanac et al., Int. J. Genome Res., 1, 59-79 (1992).

The term "expression modulating fragment," EMF, means a series of nucleotide molecules which modulates the expression of an operably linked ORF or EMF.

As used herein, a sequence is said to "modulate the expression of an operably linked sequence" when the expression of the sequence is altered by the presence of the EMF. EMFs indude, but are not limited to, promoters, and promoter modulating sequences (indudble elements). One class of EMFs are fragments which induce the expression or an operably linked ORF in response to a specific regulatory fador or physiological event.

As used herein, an "uptake modulating fragment," UMF, means a series of nucleotide molecules which mediate the uptake of a linked DNA fragment into a cell. UMFs can be readily identified using known UMFs as a target sequence or target motif with the computer-based systems described above.

The present invention is illustrated in the following examples. Upon consideration of the present disclosure, one of skill in the art will appreciate that many other embodiments and variations may be made in the scope of the present invention. Accordingly, is intended that the broader aspects of the present invention not be limited to the disdosure of the following examples. EXAMPLE 1

Preparation of Sets of Probes

Two types of universal sets of probes may be prepared. The first is a complete set (or at least a noncomplementaxy subset) of relatively short probes, for example all 4096 (or about 2000 non-complementary) 6-mers, or all 16,384 (or about 8,000 non-complementary) 7-mers. Full noncomplementaxy subsets of 8-mers and longer probes are less convenient inasmuch as they indude 32,000 or more probes.

A second type of probe set is sdected as a smdl subset of probes still sufficient for reading every bp in any sequence with at least with one probe, For example, 12 of 16 dimers are suffident. A small subset for 7-mers, 8-mer and 9-mers for sequendng double stranded DNA may be about 3000, 10,000 and 30,000 probes, respectivdy. Sets of probes may also be sdected to identify a target nucleic add of known sequence, and/or to identify all es or mutants of a target nucleic acid with a known sequence. Such a set of probes contains sufficient probes so that every nucleotide position of the target nucleic add is read at least once. Alldes or mutants are identified by the loss of binding of one of the "positive" probes. The specific sequence of these alldes or mutants is then determined by inteπogating the target nucleic acid with sets of probes that contain every possible nudeotide change and combination of changes at these probe positions.

Sets of probes may also be comprised of from 50 probes to a universal set of probes (all probes of a certain length), more preferably the set is comprised of 100-500 probes, and in a most preferred embodiment, the probe set contains 300 probes. In a prefeπed embodiment, the set of probes are 6-9 nucleotides in length, and are used to duster cDNA clones into groups of similar or identical sequences, so that single representative clones may be selected from each group for sequencing.

Probes may be prepared using standard chemistry with one to three non-specified (mixed A,T,C and G) or universal (e.g. M base or inosine) bases at the ends. If radidabelling is used, probes may have an OH group at the 5' end for kinasing by radiolabdled phosphorous groups.

Altemativdy, probes labdled with any compatible system, such as fluorescent dyes, may be o employed. Other types of probes, such as PNA (Protdn Nucleic Adds)or probes containing modified bases which change duplex stability dso may be used. Probes may be stored in bar-coded multiwdl plates. For small numbers of probes, 96-well plates may be used; for 10,000 or more probes, storage in 384- or 864-wdl plates is preferred.

Stacks of 5 to 50 plates are enough to store all probes. Approximatdy 5 pg of a probe may be suffident for hybridization with one DNA sample. Thus, from a small synthesis of about 50 mg per probe, ten million samples may be analyzed. If each probe is used for every third sample, and if each sample is 1000 bp in length, then over 30 billion bases (10 human genomes) may be sequenced by a set of 5,000 probes.

EXAMPLE 2

Prnhes Having Mortified Oliponucteotides

Modified oligonudeotides may be introduced into hybridization probes and used under appropriate conditions therefor. For example, pyrimidines with a hdogen at the exposition may be used to improve duplex stability by influencing base stacking. 2,6-diaminopurine may be used to provide a third hydrogen bond in base pairing with thymine, thereby thermdly stabilizing DNA-duplexes. Using 2,5-diaminopurine may increase duplex stability to allow more stringent conditions for annealing, thereby improving the spedfidty of duplex formation, suppressing background problems and permitting the use of shorter oligomers.

The synthesis of the triphosphate versions of these modified nucleotides is disclosed by Hohdsel & Lehrach (1990).

One may also use the non-discriminatory base analogue, or universal base, as designed by Nichols et al. (1 94). This new analogue, l-(2 -deoxy- -D-ribfuranosyl)-3-nitropyπole (designated M), was generated for use in oligonudeotide probes and primers for solving the design problems that arise as a result of the degeneracy of the genetic code, or when only fragmentary peptide sequence data are available. This analogue maximizes stacking while minimizing hydrogen-bonding interactions without sterically disrupting a DNA duplex.

The M nudeoside analogue was designed to maximize stacking interactions using aprotic polar substituents linked to hetεroaromatic rings, enhancing intra- and inter-strand stacking interactions to lessen the role of hydrogen bonding in base-pairing spedfidty. Nichols et al. (3994) favored 3-nitropyπole 2 -deoxyribonucleoside because of its structura] and dectronic resemblance to p-nitroaniline, whose derivatives are among the smallest known intercalators of double-stranded DNA. The dimethoxytrityl-protected phosphoramidite of nudeoside M is dso available for incorporation into nudeotides used as primers for sequencing and polymerase chain reaction (PCR). Nichols et al. (1994) showed that a substantid number of nucleotides can be replaced by M ^■without loss of primer spedfidty.

A unique property of M is its ability to replace long strings of contiguous nucleosides and still yield fiinctiond sequencing primers. Sequences with three, six and nine M substitutions have all been reported to give readable sequencing ladders, and PCR with three different M-containing primers all resulted in amplification of the correct product (Nichols et al, 1994).

The ability of 3-nitropyπole-contaiχιing oligonudeotides to function as primers strongly suggests that a duplex structure must form with complementary strands. Optical thermal profiles obtained for the oligonucleotide pdrs d(5 -C_rT₅XTjG 3 ) and d(5 -C₂A_sYA₅G2-3 ) (where X and Y can be A, C, G, T or M) were reported to fit the norrad sigmoidal pattern observed for the DNA double-to single strand transition, The Tm values of the oligonucleotides containing X M base pairs (where X was A, C, G or T, and Y was M) were reported to dl fell within a 3°C range (Nichols et al, 1994). EXAMPLE 3

Sβlβcfiflp and Labeling of Probes

When an array of subarrays is produced, the sets of probes to be hybridized in each of the hybridization cycles on each of the subarrays is defined. For example, a set of 384 probes may be sdected from the universal set, and 96 probings may be performed in each of 4 cycles. Probes sdected to be hybridized in one cycle preferably have similar Cr+C contents.

Selected probes for each cycle are transferred to a 96-well plate and then are labdled by kinasing or by other labeling procedures if they are not labdled (e.g. with stable fluorescent dyes) before they are stored.

On the basis of the first round of hybridizations, a new set of probes may be defined for each of the subarrays for additional cycles. Some of the aπays may not be used in some of-the cycles. For example, if only 8 of 64 patient samples exhibit a mutation and 8 probes are scored first for each mutation, then dl 64 probes may be scored in one cycle and 32 subarrays are not used. These unused subarrays may then be treated with hybridization buffer to prevent drying of the filters. Probes may be retrieved from the storing plates by any convenient approach, such as a single channd pipetting device, or a robotic station, such as a Beckman Biomεk 1000 (Beckman Instruments, FuUerton, California) or a Mega Two robot (Megamation, Lawreπceville, New Jersey), A robotic station may be integrated with data andysis programs and probe managing programs. Outputs of these programs may be inputs for one or more robotic stations. Probes may be retrieved one by one and added to subarrays covered by hybridization buffer.

It is preferred that retrieved probes be placed in a new plate and labelled or mixed with hybridization buffer. The preferred method of retrieval is by accessing stored plates one by one and pipetting (or transferring by metal pins) a sufficient amount of each selected probe from each plate to spedfic wdls in an intermediary plate. Axi aπay of individudly addressable pipeπes or pins may be used to speed up the retrievd process. EXAMPLE 4

Preparation of Labeled Probes

The oligonudeotide probes may be prepared by automated synthesis, which is routine to those of skill in the art, for example, using and Applied Biosystems system. Altemativdy, probes may be prepared using Genosys Biotechnologies Inc. Methods using stacks of porous Teflon wafers.

Oligonucleotide probes may be labded with, for example, radioactive labds (^S, ³²P, ³³P, and preferably, ³³P) for arrays with 100-200 um or 100-400 um spots; non-radioactive isotopes (Jacobsen et al, 1990); or fluorophores (Brumbaugh et al, 1988). All such labding methods are routine in the art, as exemplified by the relevant sections in Sambrook et al ( 1989) and by further references such as Schubert etal. (1990), Murakami et al. (1991) and Cate e/ α/. (1991), dl artides being spedficdly incorporated herein by reference.

In regard to radiolabelling, the common methods are end-labeling using T4 polynucleotide kinase or high specific activity labeling using Klenow or even T7 polymerase. These are described as follows.

Synthetic oligonudeotides are synthesized without a phosphate group at their 5 termini and are therefore easily labded by transfer of the -³²P or -³ P from [ -³²P]ATP or [ -³³P]ATP using the enzyme bacteriophage T4 polynucleotide kinase. If the reaction is carried out effidently, the spedfidty activity of such probes can be as high as the spedfic activity of the [ - P]ATP or [ -³³P] ATP itsdf . The reaction described bdow is designed to label 10 pmoles of an oligonucleotide to high specific activity. Labding of different amounts of oligonudeotide can easily be achieved by increasing or decreasing the size of the reaction, keeping the concentrations of all components constant.

A reaction mixture wodd be created using 1,0 ul of oligonucleotide (10 pmolesΛil); 2.0 ul of 10 x bacteriophage T4 polynudeotide kinase buffer; 5.0 ul of [ -^PJATP or [ -^PjATP (sp. Act. 5000 Ci mmole; 10 mCi ml in aqueous solution) (10 pmoles) and 11.4 ul of water. Eight (8) units (-1 ul) of bacteriophage T4 polynudeotide kinase is added to the reaction mixture, and incubated for 45 minutes at 37°C. The reaction is heated for 10 minutes at 68°C to inactivate the bacteriophage T4 polynucleotide kinase. The efficiency of transfer of P or Υ to the oligonudeotide and its specific activity is then determined. If the specific activity of the probe is acceptable, it is purified. If the specific activity is too low, an additional 8 units of enzyme is added and incubated for a further 30 minutes at 37^CC before heating the reaction for 10 minutes at 68°C to inactivate the enzyme. Purification of radiolabeled oligonudeotides can be achieved by, e.g., precipitation with ethanol; precipitation with cetylpyridinium bromide; by chromatography through bio-gel P-60; or by chromatography on a Sep-Pak C_Ig column, or by polyacrylamide gel dectrophoresis.

Probes of higher specific activities can be obtained using the Klenow fiagment of £ coli. DNA polymerase I to synthesize a strand of DNA complementary to the synthetic oligonucleotide. A short primer is hybridized to an oligonudeotide template whose sequence is the complement of the desired radiolabded probe. The primer is then extended using the Klenow fiagment of £. coli DNA polymerase I to incorporate [ -³²P] dNTPs or [ -^P] dNTPs in a template-directed manner. After the reaction, the template and product are separated by denaturation followed by electrophoresis through a polyacrylamide gd under denaturing conditions. With this method, it is possible to generate oligonucleotide probes that contain several radioactive atoms per molecule of oligonucleotide.

To use this method, one would mix in a icrofuge tube the cdculated amounts of [a-32P]dNTPs or [a-33P]dNTPs necessary to achieve the desired specific activity and suffident to dlow complete synthesis of dl template strands. Then add to the tube the appropriate amounts of primer and template DNAs, with the primer being in three- to tenfold molar excess over the template.

0.1 volume of 10 x Klenow buffer would then be added and mixed well. 2-4 units of the Klenow fragment oiEcoli DNA polymerase I wodd then be added per 5 ul of reaction volume, mixed and incubated for 2-3 hours at 4^DC. If desired, the process of the reaction may be monitored by removing smdl (0.1 ul) aliquots and measuring the proportion of radioactivity that has become precipitable with 10% trichloroacetic arid (TCA),

The reaction would be diluted with an equd volume of gd-loading buffer, heated to 80oC for 3 minutes, and then the entire sample loaded on a denaturing polyacrylamide gd. Following electrophoresis, the gd is autoradiographed, dlowing the probe to be localized and removed from the gd. Various methods for fluorescent probe labding are dso avdlable, e.g., Brumbaugh et al. (1988) describe the synthesis of fiuorescently labeled primers. A deoxyuridine analog with a primary amine "linker arm" of 12 atoms attached at C-5 is synthesized. Synthesis of the andog consists of derivatizing 2 -deoxyuridine through organometallic intermediates to give 5 (methyl propenoyl)-2 -deoxyuridine. Reaction with dimethoxytrityl-chloride produces the corresponding 5 -dimethoxytrityl adduct The methyl ester is hydrolyzed, activated, and reacted with an appropriately monoacylated alkyl diamine. After purification, the resultant linker arm nudeosides are converted to nudeoside analogs suitable for chemicd oligonudeotide synthesis.

Oligonucleotides would then be made that include one or two linker arm bases by using modified phosphoridite chemistry. To a solution of 50 nmol of the linker arm oligonucleotide in 25 ul of 500 mM sodium biocarbonate (pH 9.4) is added 20 ul of 300 mM FITC in dimethyl sulfoxide. The mixture is agitated at room temperature for 6 his. The oligonudeotide is separated from free FITC by dution form a 1 x 30 cm Sephadeχ G-25 column with 20 mM ammonium acetate (pH 6), combining fractions in the first UV-absorbing peak.

In general, fluorescent labding of an oligonucleotide at its 5'-end initially involved two steps. First, a N-protected aminodkyl phosphoramidite derivative is added to the 5'-end of an oligonudeotide during automated nucldc add synthesis. After rεmovd of dl protecting groups, the NHS ester of an appropriate fluorescent dye is coupled to the 5'-amino group overnight followed by purification of the labeled oligonudeotide from the excess of dye using reverse phase HPLC or PAGE. Schubert et al. (1990) described the synthesis of a phosphoramidite that enables oligonudeotides labded with fluorescein to be produced during automated DNA synthesis.

Murakami et al dso described the preparation of flourescein-labded oligonudeotides.

Cate et al (1991) describe the use of oligonudeotide probes directly conjugated to alkdine phosphatase in combination with a direct chemiluminescent substrate (AMPPD) to dlow probe detection.

Labded probes codd readily be purchased form a variety of commerdd sources, induding GENSET, rather then synthesized.

Other labds include ligands which can serve as specific binding members to a labeled antibody, chemiluminescers, enzymes, antibodies which can serve as a specific binding pdr member for a labeled ligand, and the like. A wide variety of labds have been employed in immunoassays which can readily be employed, Still other labels include antigens, groups with specific reactivity, and eleαrochemicdly detectable modties.

In general, labding of nucleic adds with dectrophore mass labds ("EML") is described, for example, in Xu et al , J. Chromatography 764:95-102 (1997). Electxophores are compounds that can be detected with high sensitivity by electron capture mass spectrometry (EC-MS). EMLs can be attached to a probe using chemistry that is wdl known in the art for reversibly modifying a nucleotide (e.g., wdl known nucleotide synthesis chemistry teaches a variety of methods for attaching molecules to nucleotides as protecting groups). EMLs are detected using a variety of well known electron capture mass spectrometry devices (e.g., devices sold by Finnigan Corporation). Further, techniques that may be used in the detection of EMLs indude, for example, fast atomic bombardment mass spectrometry (see, e.g., Koster et al, Biomedicd Environ. Mass Spec. 14:111- 116 (1987)); plasma desorption mass spectrometry; dectrospray/ionspray (see, e.g., Fenn et al., J. Phys. Chem. 88:4451-59 (1984), PCT Appln. No. WO 90/14148, Smith et <rf. And. Chem. 62:882-89 (1990)); and matrix-assisted laser desorption/ionization (Hillenkamp, et al, "Matrix Assisted UV-Laser Desorption/ionization: A New Approach to Mass Spectrometry of Large

Biomolecdes," Biological Mass Spectrometry (Burlingame and McCloskey, eds.), Elsevier Science Publishers, Amsterdam, pp.49-60, 1990); Huth-Fehrε et al., "Matrix Assisted Laser Desorption Mass Spectrometry of Oligodeoxythymidylic Adds," Rapid Communications in Mass Spectrometry, 6:209-13 (1992)). In preferred embodmients, the EMLs are attached to a probe by a covdent bond that is light sensitive. The EML is released from the probe after hybridization with a target nucleic acid by a laser or other light source emitting the desired wavdength of light. The EML is then fed into a GC- MS (gas chromatograph -mass spectrometer) or other appropriate device, and identified by its mass. EXAMPLE S Preparation of Sequencing Chips and Airavs

A basic example is using 6-mers attached to 50 micron surfaces to give a chip with dimensions of 3 x 3 mm which can be combined to give an aπay of 20 x 20 cm. Another example is using 9-mer oligonudeotides attached to 10 x 10 microns surface to create a 9-mer chip, with dimensions of 5 x 5 mm. 4000 units of such chips may be used to create a 30 x 30 cm array. In an array in which 4,000 to 16,000 digochips are arranged into a square aπay. A plate, or collection of tubes, as dso depicted, may be packaged with the aπay as part of the sequencing kit.

The arrays may be separated physicdly from each other or by hydrophobic surfaces. One possible way to utilize the hydrophobic strip separation is to use techndogy such as the Iso-Grid Microbiology System produced by QA Laboratories, Toronto, Canada.

Hydrophobic grid membrane filters (HGMF) have been in use in andyticd food microbiology for about a decade where they exhibit unique attractions of extended numericd range and automated counting of colonies. One commereidly-avdlable grid is ISO-GRJD™ from QA Laboratories Ltd. (Toronto, Canada) which consists of a square (60 x 60 cm) of polysdfone polymer (Gdman Tuffiyn HT-450, 0.45u pore size) on which is printed a black hydrophobic ink grid consisting of 1600 (40 x 40) square cells. HGMF have previously been inoculated with bacterid suspensions by vacuum filtration and incubated on the differentid or selective media of choice.

Because the microbid growth is confined to grid cdls of known position and size on the membrane, the HGMF functions more like an MPN apparatus than a conventional plate or membrane filter. Peterkin et al. (1987) reported that these HGMFs can be used to propagate and store genomic libraries when used with a HGMF replicator. One such instrument replicates growth from each of the 1600 cells of the ISO-GRID and enables many copies of the master HGMF to be made (Peterkin etal. 1987). Sharpe et al. (1989) dso used ISO-GRID HGMF form QA Laboratories and an automated

HGMF counter (MI-100 Interpreter) and RP-100 Replicator. They reported a technique for maintaining and screening many microbid cdtures.

Peterkin and colleagues later described a method for screening DNA probes using the hydrophobic grid-membrane filter (Peterkin et al. 1989). These authors reported methods for effective colony hybridization directly on HGMFs. Previously, poor results had been obtained due to the low DNA binding capacity of the epoxysulfone polymer on which the HGMFs are printed. However, Peterkin et al (1 89) reported that the binding of DNA to the surface of the membrane was improved by treating the replicated and incubated HGMF with polyethyleneimine, a polycation, prior to contact with DNA. Although this early work uses cellulai DNA attachmeni, and has a different objective to the present invention, the methodology described may be readily adapted for Format 3 SBH.

In order to identify useful sequences rapidly, Peterkin et al (1989) used radiolabeled plasmid DNA from various clones and tested its spedfidty against the DNA on the prepared HGMFs. In this way, DNA from recombinant plasmids was rapidly screened by colony hybridization against 100 organisms on HGMF replicates which can be easily and reproduribly prepared.

Manipulation with small (2-3 mm) chips, and paralld execution of thousands of the reactions. The solution of the invention is to keep the chips and the probes in the corresponding arrays. In one example, chips containing 250,0009-mers are synthesized on a silicon wafer in the form of 8 x 8 mM plates (15 uM oligoπucleotide, Pease et d., 1994) arrayed in 8 x 12 format (96 chips) with a 1 mM groove in between. Probes are added dther by mdtichannd pipette or pin array, one probe on one chip. To score dl 40006-mers, 42 chip arrays have to be used, other using different ones, or by reusing one set of chip arrays several times. In the above case, using the earlier nomendature of the application, F=9; P=6; and F + P =

15. Chips may have probes of formula BxNn, where x is a number of specified bases B; and n is a number of non-specified bases, so that x = 4 to 10 and n = 1 to 4. To achieve more effident hybridization, and to avoid potentid influence of any support oligonudeotides, the sperified bases can be surrounded by unspecified bases, thus represented by a foπnda such as (N)nBx(N)m (FIG.4).

In another embodiment of the chips, the substrate which supports the aπay of oligonudeotide probes is partitioned into sections so that each probe in the aπay is separated from adjacent probes by a physicd barrier which may be, for example, a hydrophobic materid. In a preferred embodiment, the physicd baπier has a width of from 300 μ to 30 μm, and the distance between the center of each physicd barrier to the center of adjacent physicd barriers is at least 325 μm.

In a preferred embodiment, a hydrophobic materid is deposited onto the substrate to form barriers of the desired width using an ink-jet head, coupled to an appropriate robotic system. For example a microdrop dosing head, that has been adapted to apply a suspension or solution of a desired hydrophobic materid (e-g., an oil based materid that forms a barrier after the solvent has evaporated), may be coupled with an anorad gantry system and fitted to an appropriate housing and dispensing system so that a grid of the hydrophobic materid may be applied onto the desired substrate forming a plurality of wells on the substrate. After the grid of hydrophobic materid has been formed, different probes are spotted onto each well (or mixtures of probes may be applied to each well) using a robotic system similar to that used to form the grid, but that has been adapted to apply solutions or suspensions of probes. In one embodiment, the same robotic system is used to apply the hydrophobic grid and the probes. In this embodiment, the dispensing system is flushed after the hydrophobic grid is applied and then primed for delivery of probe. EXAMPLE 6

Preparation of Support Bound Olifonoclβotides

Oligonucleotides, i.e., s dl nudeic add segments, may be readily prepared by, for example, directly synthesizing the oligonucleotide by chemicd means, as is commonly practiced using an automated oligonucleotide synthesizer. In general, oligonucleotides may be bound to a support through appropriate reactive groups.

Such groups are well known in the art and include, for example, amino (-NH ); hydroxyl (-OH); or caxboxyl (CO₂H) groups. Support bound ob^'gonudeotides may be prepared by any of the methods known to those of skill in the art using any suitable support such as glass, polystyrene or Teflon. One strategy is to precisely spot oligonudeotides synthesized by standard synthesizers. Immobilization can be achieved by many methods, including, for example, using passive adsorption (Inouye & Hondo, 1990); using UV light (Nagata et al, 1985; Dahlen et al, 1987; Morriey & Collins, 1989); or by covdent binding of base modified DNA (Kdler et al, 1988; 1989); or by formation of amide groups between the probe and the support (Wdl et al, 1995; Chebab ex al, 1992; and Zhang et al., 1991); dl references being specifically incorporated herein. i Another strategy that may be employed is the use of the strong biotin-streptavidin interaction as a linker. For example, Broude et al. (1994) describe the use of Biotiπylated probes, dthough these are duplex probes, that are immobilized on streptavidin-coated magnetic beads. Streptavidin-coated beads may be purchased from Dynd, Oslo. Of course, this same linking chemistry is applicable to coating any surface with streptavidin. Biotinylated probes may be purchased from various sources, such as, e.g., Operon Technologies (Ala eda, CA). Nunc Laboratories (Naperville, IL) is also sdling suitable materid that codd be used. Nunc Laboratories have devdoped a method by which DNA can be covdently bound to the microwdl surface termed Covdink NH. CovaLink NH is a polystyrene surface grafted with secondary amino groups (>NH) that serve as bridge-heads for further covdent coupling. CovaLink Moddes may be purchased from Nunc Laboratories. DNA molecules may be boxmd to CovaLink exclusivdy at the 5'-end by a phosphoramidate bond, dlowing immobilization of more than 1 pmol of DNA (Rasmussen et al, 1991).

The use of CovaLink NH strips for covdent binding of DNA molecules at the 5'-end has been described (Rasmussen et d., 1991). In this technology, a phosphoramidate bond is employed (Chu et d., 1983). This is benefidd as immobilization using ody a single covdent bond is preferred. The phosphoramidate bond joins the DNA to the CovaLink NH secondary amino groups that are positioned at the end of spacer arms covdently grafted onto the polystyrene surface through a 2 nm long spacer arm. To link an oligonucleotide to CovaLink NH via an phosphoramidate bond, the oligonudeotide terminus must have a S'-end phosphate group. It is, perhaps, even possible for biotin to be covdently bound to CovaLink and then streptavidin used to bind the probes.

More spedficdly, the linkage method includes dissolving DNA in water (7.5 ng/ul) and denaturing for 10 min. at 95°C and cooling on ice for 10 min. Ice-cold 0.1 M 1-methylimidazole, pH 7.0 (1-Mdm₇), is then added to a find concentration of 10 mM 1-Melm₇. A ss DNA solution is then dispensed into CovaLink NH strips (75 ul well) standing on ice. Carbodiimide 0.2 M l-ethyl-3-(3-dimethylaminopropyl)-carbodiimide (BDC), dissolved in

10 mM 1-Melm₇, is made fresh and 25 ul added per well. The strips are incubated for 5 hours at 50°C. After incubation the strips are washed using, e,g., Nunc-Immuno Wash; first the wells are washed 3 times, then they are soaked with washing solution for 5 min., and findly they are washed 3 times (where in the washing solution is 0.4 N NaOH, 0.25% SDS heated to 50°C). It is contemplated that a further suitable method for use with the present invention is that described in PCT Patent Application WO 90/03382 (Southern & Maskos), incorporated herein by reference. This method of preparing an oligonudeotide bound to a support involves attaching a nudeoside 3'-reagent through the phosphate group by a covdent phosphodiester link to diphatic hydroxyl groups earned by the support. The oligonudeotide is then synthesized on the supported nudeoside and protecting groups removed from the synthetic digonudeotide chdn under standard conditions that do not cleave the digonudeotide from the support Suitable reagents include nudeoside phosphoramidite and nudeoside hydrogen phosphorate.

An on-chip strategy for the preparation of DNA probe for the preparation of DNA probe arrays may be employed. For example, addressable laser-activated photodeprotection may be employed in the chemicd synthesis of oligonucleotides directly on a glass surface, as described by Fodor et al. (1991), incorporated herein by reference. Probes may dso be immobilized on nylon supports as described by Van Ness et al (1991 ); or linked to Teflon using the method of Duncan & Cavalier (1988); all references being specificdly incorporated herein.

To link an oligonucleotide to a nylon support, as described by Van Ness et al. (1991 ), requires activation of the nylon surface via alkylation and sdective activation of the 5'-amine of oligonucleotides with cyanuric chloride. .

One particular way to prepare support bound digonudeotides is to utilize the light-generated synthesis described by Pease et al. (1994, incorporated herein by reference). These authors used cuπent photolithographic techniques to generate aπays of immobilized oligonudeotide probes (DNA chips). These methods, in which light is used to direct the synthesis of oligonucleotide probes in high-density, miniaturized arrays, utilize photolabile 5-protected N-acyl-deoxynucleoside phosphoramidites, surface linker chemistry and versatile combinatorid synthesis strategies. A matrix of 256 spatidly defined oligonucleotide probes may be generated in this manner and then used in the advantageous Format 3 sequencing, as described herein. Of course, one codd easily purchase a DNA chip, such as one of the light-activated chips described above, from a co merdd source. In this regard, one may contact Affy etrix of Santa Clara, CA 95051, and Beckman.

In a prefeπed embodiment, the probes of the invention include an informationd portion (the portion which hybridizes to the target nucleic acid and gives sequence information) a reactive group to be attached to the substrate (solid support), and randomized positions, i.e., any of the four bases may be found at these positions. A prefeπed probe has the sequence 5'-(T)₆-(N)r (B)_s, where T = thymine (binds to solid support), N = A, C, G, or T (randomized positions), and B = the five information positions of the probe (informationd portion). In a most prefeπed embodiment, the probe may be bound to the support and a spacer moiety is found at the end of the probe or intemd to the probe and 5' of (N)₃. The spacers may be comprised of atoms capable of forming at least two covdent bonds such as carbon, silicon, oxygen, sulfur, phosphorous, and the like, or may be comprised of molecdes capable of forming at least two covdent bonds such as sugar-phosphate groups, amino adds, peptides, nucleosides, nudeotides, sugars, carbohydrates, aromatic rings, hydrocarbon rings, linear and branched hydrocarbons, and the like. EXAMPLE 7

Preparation of Nucleic Acid Fragments

The nucldc adds to be sequenced may be obtained from any appropriate source, such as cDNAs, genomic DNA, chromosomd DNA, microdissected chromosome bands, cos id or YAC inserts, and RNA, induding mRNA without any amplification steps. For example, Sambrook et al (1989) describes three protocds for the isolation of high molecdar weight DNA from mammdian cdls (p. 9.14-9.23).

Target nuddc add fragments may be prepared as dones in M13, plasmid or lambda vectors and or prepared directly from genomic DNA or cDNA by PCR or other amplification methods. Samples may be prepared or dispensed in multiwdl plates. About 100-1000 ng of DNA samples may be prepared in 2-500 ml of find volume. Target nuddc acids prepared by PCR may be directly applied to a substrate for Format I SBH without purification. Once the target nucldc acids are fixed to the substrate, the substrate may be washed or directly anneded with probes.

The nucldc adds would then be fragmented by any of the methods known to those of skill in the art including, for example, using restriction enzymes as described at 9.24-9.28 of Sambrook et al. (1989), shearing by ultrasound and NaOH treatment.

Low pressure shearing is dso appropriate, as described by Schriefer et al. (1990, incorporated herein by reference). In this method, DNA samples are passed through a smdl French pressure cdl at a variety of low to intermediate pressures. A lever device dlows controlled application of low to intermediate pressures to the cell. The resdts of these studies indicate that low-pressure shearing is a useful dtemative to sonic and enzymatic DNA fragmentation methods. One particularly suitable way for fragmenting DNA is contemplated to be that using the two base recognition endonuclease, C ^'Jl, described by Fitzgerald et al. (1992). These authors described an approach for the rapid fragmentation and fractionation of DNA into particdar sizes that they contemplated to be smtable for shotgun cloning and sequencing. The present inventor envisions that this will also be particdarly useful for generating random, but relatively smdl, fragments of DNA for use in the present sequencing technology.

The restriction endonucleasε CviJI normally cleaves the recognition sequence PuGCPy between the G and C to leave blunt ends. Atypicd reaction conditions, which dter the spedfidty of this enzyme (CviJI**), yield a quasi-random distribution of DNA fragments form the smdl molecule pUCl 9 (2688 base pairs). Fitzgerald et al. (1992) quantitatively evduated the randomness of this fragmentation strategy, using a CviJI"- digest of pUC19 that was size fractionated by a rapid gd filtration method and directly ligated, without end repair, to a lac Z minus Ml 3 cloning vector. Sequence andysis of 76 clones showed that CviJI** restricts pyGCPy and PuGCPu, in addition to PuGCPy sites, and that new sequence data is accumulated at a rate consistent with random fragmentation.

As reported in the literature, advantages of this approach compared to sonication and agarose gel fractionation indude: smaller amounts of DNA are required (0.2-0.5 ug instead of 2-5 ug); and fewer steps are involved (no preligation, end repair, che icd extraction, or agarose gel dectrophoresis and dution are needed). These advantages are also proposed to be of use when preparing DNA for sequencing by Format 3.

In a preferred embodiment, the "fragments" of the nucleic add sample are prepared so that they cannot be ligated to each other. Such a pool of fragments may be obtained by treating the fragmented nucldc acids obtained by enzyme digestion or physic shearing, with a phosphatase (e.g., calf intestind phosphatase). Altemativdy, nonligatable fragments of the sample nudeic acid may be obtained by using random primers (e.g., N_s-N₉, where N = A, G, T, or C), which have no phosphate at their 5'-ends, in a Sanger-dideoxy sequencing reaction with the sample nudeic acid. This will produce fragments of DNA that have a complementary sequence to the target nudeic acid and that are teπrώiated in a dideoxy residue and which cannot be ligated to other fragments. ^respective of the manner in which the nudeic add fragments are obtained or prepared, it is important to denature the DNA to give single stranded pieces available for hybridization. This is achieved by incubating the DNA solution for 2-5 minutes at 80-90°C. The solution is then cooled quickly to 2°C to prevent reπaturation of the DNA fragments before they are contacted with the chip. EXAMPLE S

Preparation of DNA Arrays

Arrays may be prepared by spotting DNA samples on a support such as a nylon membrane. Spotting may be performed by using arrays of metal pins (the positions of which correspond to an aπay of wdls in a microtiter plate) to repeated by transfer of about 20 nl of a DNA solution to a nylon membrane. By offset printing, a density of dots higher than the density of the wells is achieved. One to 25 dots may be accommodated in 1 mm², depending on the type of labd used. By avoiding spotting in some presdected number of rows and columns, separate subsets (subarrays) may be formed. Samples in one subarray may be the same genomic segment of DNA (or the same gene) from different individuds, or may be different, overlapped genomic clones. Each of the subaπays may represent replica spotting of the same samples. In one example, a sdected gene segment may be amplified from 64 patients. For each patient, the amplified gene segment may be in one 96-wdl plate (dl 96 wdls containing the same sample). A plate for each of the 64 patients is prepared. By using a 96-pin device, dl samples may be spotted on one 8 x 12 cm membrane. Subarrays may contain 64 samples, one from each patient. Where the 96 subarrays are identicd, the dot span may be 1 mm and there may be a 1 mm space between subarrays.

Another approach is to use membranes or plates (avdlable from NUNC, Naperville, Illinois) which may be partitioned by physicd spacers e.g. a plastic grid molded over the membrane, the grid being similar to the sort of membrane applied to the bottom of mdtiwell plates, or hydrophobic strips. A fixed physicd spacer is not prefeπed for imaging by exposure to flat phosphor-storage screens or x-ray films. EXAMPLE 9

Hybridization and Scormy Process

Labded probes may be mixed with hybridization buffer and pipetted, preferably by multichannd pipettes, to the subarrays. To prevent mixing of the probes between subarrays (if there are no hydrophobic strips or physicd barriers imprinted in the membrane), a corresponding plastic, metal or ceramic grid may be firmly pressed to the membrane. Also, the volume of the buffer may be reduced to about 1 ml or less per mm². The concentration of the probes and hybridization conditions used may be as described previously except that the washing buffer may be qdckly poured over the array of subarra s to dlow fast dilution of probes and thus prevent significant cross-hybridization. For the same reason, aminimd concentration of the probes may be used and hybridization time extended to the maximd practical levd. For DNA detection and sequencing, knowledge of a "normd" sequence dlows the use of the continuous stacking interaction phenomenon to increase the signd. In addition to the labelled probe, additiond unlabelled probes which hybridize back to back with a labdled one may be added in the hybridization reaction. The amount of the hybrid may be increased several times. The probes may be connected by ligation. This approach may be important for resdving DNA regions forming "compressions".

In the case of radiolabdled probes, images of the filters may be obtained, preferably by phosphorstoragε technology. Fluorescent labds may be scored by CCD cameras, confoca] microscopy or otherwise. In order tp properly scde and integrate data from different hybridizati on experiments, raw signds are normalized based on the amount of target in each dot Differences in the amount of target DNA per dot may be corrected for by dividing signds of each probe by an average signd for dl probes scored on one dot. The normdized signds may be scded, usually from 1-100, to compare data from different experiments. Also, in each subarray, several control DNAs may be used to determine an average background signd in those samples which do not contain a full match target. For samples obtained from diploid (polyploid) scores, homozygptic controls may be used to allow recogmtion of heterozygotes in the samples.

EXAMPLE S

Hvhririizatinn With Oligonudeotides Oligonudeotides were dther purchased from Genosys Inc., Houston, Texas or made on an

Applied Biosystems 381 A DNA synthesizex. Most of the probes used were not purified by HPLC or gel electrophoresis. For example, probes were designed to have both a single perfectly complementary target in interferon, a Ml 3 clone containing a 921 bp Eco RI-Bgl II human B 1 - interferon fragment (Ohno and Tangiuchi, Proc. Natl. Acad. Sd. 74: 4370-4374 (1981)], and at least one target with an end base mismatch in Ml 3 vector itself.

End labeling of oligonucleotides was performed as described [Maniatis et d., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory Cold Spring Harbor, New York (1982)] in 10 ml containing T4-polynudeotide kinase (5 units Amersham), γ^-ATP (3.3 pM, 10 mCi Amersham 3000 Ci/mM) and oligonucleotide (4 pM, 10 ng). Spedfic activities of the probes were 2.5-5 X 109 cpmraM.

Single stranded DNA (2 to 4 ml in 0.5 NaOH, 1.5 M NaCl) was spotted on a Gene Screen membrane wetted with the same solution, the filters were neutralized in 0.05 M Na_jHPO,, pH 6.5, baked in an oven at 80°C for 60 min. and UV irradiated for 1 min. Then, the filters were incubated in hybridization sdution (0.5 M Na₂HP0₄ pH 7.2, 7% sodium lauroyl sarcosine for 5 min at room temperature and placed on the surface of a plastic Petri dish. A drop of hybridization solution ( 10 ml, 0.5 M

pH 7.2, 7% sodium lauroyl sarcosine) with a ³²P end-labeled digomer probe at 4 nM concentration was placed over 1-6 dots per filter, overldd with a square piece of polyethylene (approximately 1 X 1 cm.), and incubated in a moist chamber at the indicated temperatures for 3 hr. Hybridization was stopped by placing the filter in 6X SSC washing solution for 3 X 5 minute at 0°C to remove unhybridized probe. The filter was either dried, or further washed for the indicated times and temperatures, and autoradiographed. For discrimination measurements, the dots were exdsed from the dried filters after autoradiography [a phosphoimager (Molecdar Dynamics, Sunnyvde, California) may be used] placed in liquid scintillation cocktdl and counted. The uncorrected ratio of cpms for IF and M13 dots is given as D.

The conditions reported herein dlow hybridization with very short oligonucleotides but ensure discriminations between matched and mismatched oligonudeotides that are complementary to and therefore bind to a target nucldc add. Factors which influence the effident detection of hybridization of specific short sequences based on the degree of discriminations (D) between a perfectly complementary target and an imperfectly complementary target with a single mismatch in the hybrid axe defined. In experimental tests, dot blot hybridization of twenty-dght probes that were 6 to 8 nucleotides in length to two Ml 3 clones or to model oligonucleotides bound to membrane filters was accomplished. The prindples guiding the experimental procedures are given bdow.

Oligonucleotide hybridization to filter bound target nudeic acids ody a few nucleotides longer than the probe in conditions of probe excess is a pseudo-first order reaction with respect to target concentration. This reaction is defined by: _{S So = e}-* OP). Wherein S, and S₀ are target sequence concentrations at time t and to, respectively. (OP) is probe ^" concentration and t is temperature. The rate constant for hybrid formation, k_j, increases ody slightly in the 0°C to 30°C range (Porschke and Eigen, J. Mol Biol. &: 361 (1971 ); Craig et d., J. Mol Biol £2: 383 (1 71)]. Hybrid rodting is a first order reaction with respect to hybrid concentration (here replaced by mass due to filter bound state) as shown in: H/H_o^e*"¹¹

In this equation, H, and H^ are hybrid concentrations at times t and respectively; k„ is a rate constant for hybrid melting which is dependent on temperature and sdt concentration [Ikuta et zl., Nucl AcidsRes. 1 : 797 (1987); Porsclike and Eigen, J Mol Biol ≤2: 361 (1971); Craig et d., J. Mol. Biol. 62: 303 (1971)]. During hybridization, which is a strand assodation process, the bad , melting, or strand dissodation, reaction takes place as well. Thus, die amount of hybrid formed in time is result of forward and back reactions. The equilibrium may be moved towards hybrid formation by increasing probe concentration and/or decreasing temperature, However, during washing cydes in large volumes of buffer, the melting reaction is dominant and the back reaction hybridization is insignificant, since the probe is absent. This andysis indicates workable Short Oligonucleotide Hybridization (SOH) conditions call be varied for probe concentration or temperature.

D or discrimination is defined in equation four:

H_p ( _o, ) and H, ( are the amounts hybrids remaining after a washing time, t„ for the identicd amounts of perfectly and imperfectly complementary duplex, respectively. For a given temperature, the discrimination D changes with the 10 length of washing time and reaches the maximd vdue when H; = B which is equation five.

The background, B, represents the lowest hybridization signd detectable in the system. Since any further decrease of H, may not be examined, D increases upon continued washing.

Washing past , just decreases H_p relative to B, and is seen as a decrease in D. The optimd washing time, O for imperfect hybrids, from equation three and equation five is: t_w = -ln <B /H_I (t_β)yk_ιn.,

Since H_p is being washed for the same t . combining equations, one obtains the optimd discrimination function: The change of D as a function, of T is important because of the choice of an optimd washing temperature. It is obtained by substituting the Arhenius equation which is:

into the previous equation to form the find equation: D = H_p((to) B X (B/H; (to)) <*'**'

Wherein B is less than Hj (IQ).

Since the activation energy for perfect hybrids, E^_p , and the activation energy for imperfect hybrids, E_j._;, can be either εqud, or E^-, less than E^ D is temperature independent, or decreases with increasing temperature, respectivdy. This resdt implies that the search for stringent temperature conditions for good discrimination in SOH is unjustified. By washing at lower temperatures, one obtains equd or better discrimination, but the time of washing exponentidly increases with the decrease of temperature. Discrimination more strongly decreases with T, if Hj(to) increases relative to H_p (to). D at lower temperatures depends to a higher degree on the H_p (to)/B ratio than on the

H_p (to) / H; (to) ratio. This result indicates that it is better to obtain a sufficient quantity of H_p in the hybridization regardless of the discrimination that can be achieved in this step. Better discrimination can then be obtained by washing, since the higher amounts of perfect hybrid dlow more lime for differentid melting to show an effect. Similarly, using larger amounts of target nucldc add a necessary discrimination can be obtained even with small differences between K^ andK.ιι.5.

Extrapolated to a more complex situation than covered in this simple modd, the result is that washing at lower temperatures is even more important for obtaining d crimination in the case of hybridization of a probe having many end-mismatches within a given nucleic arid target. Using the described theoreticd prindples as a gd de for experiments, reliable hybridizati o s have been obtained with probes six to right nucleotides in length. All experiments were performed with a floating plastic sheet providing a film of hybridization solution above the filter. This procedure dlows maximd reduction in the amount of probe, and thus reduced label costs in dot blot hybridizations. The high concentration of sodium lauroyl sarcosine instead of sodium lauroyl sulfate in the phosphate hybridizati on buffer dlows dropping the reaction from room temperature

4] down to 12°C. Similarly, the 4-6 X SSC, 10% sodium lauroyl saxcosine buffer dlows hybridization at temperatures as low as 2°C. The detergent in these buffers is for obtaining tolerable background with up to 40 nM concentrations of labdled probe. Preliminary characterization of the theπnd stability of short oligonucleotide hybrids was determined on a prototype octamεx with 50% G+C content, i.e. probe of sequence TGCTCATG. The theoreticd expectation is that this probe is among the less stable oetameis, Its transition enthalpy is similar to those of more stable heptamers or, even to probes 6 nucleotides in length (Bresslauer et d., Proc. Natl. Acad. Sci. U.S.A. SI- 3746 (1986)). Parameter T_d, the temperature at which 50% of the hybrid is melted in unit time of a minute is 18°C. The resdt shows that T_d is 15°C lower for the 8 bp hybrid than for an 11 bp duplex [Wallace et d., Nucleic Acids Res. &: 3543 (1979)].

In addition to experiments with model oligonudeotides, an Ml 3 vector was chosen as a system for a practicd demonstration of short oligonucleotide hybridization. The main aim was to show useful end-mismatch discrimination with a target similar to the ones which will be used in various applications of the method of the invention. Oligonucleotide probes for the M13 model were chosen in such a way that the Ml 3 vector itsdf contains the end mismatched base. Vector IF, an Ml 3 recombinant containing a 921 bp human interferon gene insert, carries single perfectly matched target. Thus, IF has dther the identicd or a higher number of mismatched targets in comparison to the Ml 3 vector itsdf,

Using low temperature conditions and dot blots, suffident differences in hybridization signds were obtained between tie dot containing the perfect and the mismatched targets and the dot containing the mismatched targets only. This was true for the 6-mer oligonudeotides and was dso true for the 7 and S-mer oligonudeotides hybridized to the large IF-MB pair of nudeic adds.

The hybridization signd depends on the amount of target available on the filter for reaction with the probe. A necessary control is to show that the difference in sign intensity is not a reflection of varying amounts of nuddc acid in the two dots. Hybridization with a probe that has the same number and kind of targets in both IF and M13 shows that there is an equal amount of DNA in the dots. Since the efficiency of hybrid formation increases with hybrid length, the signal for a duplex having six nudeotides was best detected with a high mass of oligonucleotide target bound to the filter. Due to thrir lower molecular wdght, a larger number of oligonucleotide target molecdes can be bound to a given surface area when compared to large molecdes of nucldc aci that serves as target.

To measure the sensitivity of detection with unpurified DNA, various amounts of phage supematants were spotted on the filter and hybridized with a ^P-labdled octamer. As little as 50 million unpurified phage containing no more than 0.5 ng of DNA gave a detectable signal indicating that sensitivity of the short oligonucleotide hybridization method is sufficient. Reaction time is short, adding to the practicality.

As mentioned in the theoreticd section above, the equilibrium yidd of hybrid depends oil probe concentration and/or temperature of reaction. For instance, the signd level for the same amount of target with 4 nM octamer at 13°C is 3 times lower than with a probe concentration of 40 nM, and is decreased 4.5-times by raising the hybridization temperature to 25°C.

The utility of the low temperature wash for achieving maximd discrimination is demonstrated. To make the phenomenon visudly obvious, 50 times more DNA was put in the Ml 3 dot than in the IF dot using hybridization with a vector spedfic probe. In this way, the signd after the hybridization step with the actud probe was made stronger in the, mismatched that in the matched case. The H_p /H; ratio was 1:4. Inversion of signd intensities after prolonged washing at 7°C was achieved without a massive loss of perfect hybrid, resdting in a ratio of 2:1. In contrast, it is impossible to achieve any discrirni ation at 25°C, since the matched target signd is dready brought down to the background levd with 2 minute washing; at the same time, the signd from the mismatched hybrid is still detectable. The loss of discrimination at 13°C compared to 7°C is not so great but is clearly visible. If one considers the 90 minute point at 7°C and the 15 minute point at 13°C when, the mismatched hybrid signd is near the background levd, which represents optimd washing times for the respective conditions, it is obvious that the amount of several times greater at 7°C than at 13°C. To illustrate this further, the time course of the change discrimination with washing of the same amount of starting hybrid at the two temperatures shows the higher maximd D at the lower temperature. These results confirm the trend in the change of D with temperature and the ratio of amounts of the two types of hybrid at the start of the washing, step.

In order to show the general utility of the short oligonucleotide hybridization conditions, we have looked hybridization of 4 heptamers, 10 octamers and an additiond 14 probes up to 12 nudeotides in length in our simple Ml 3 system. These include-the nonamer GTTTTTTAA and octamer GGCAGGCG representiiig the two extremes of GC content. Although GC content and sequence are expected to influence the stability of short hybrids [Bresslauer et d., Proc. Natl Acad. Sci. U.S.A. £2: 3746 (1986)], the low temperature short oligonucleotide conditions were applicable to dl tested probes in achieving suffident discrimination. Since the best discrimination vdue obtained with probes 13 nudeotides in length was 20, a several fold drop due to sequence variation is easily tolerated.

The M13 system has the advantage of showing the effects of target DNA complexiτy on the levds of discrimination. For two octamers having either none or five mismatched targets and differing in ody one GC pair the observed discriminations were 18.3 and 1.7, respectivdy. In order to show the utility of this method, three probes 8 nucleotides in length were tested on a collection of 1 plasmid DNA dots made from a library in Bluescript vector. One probe was present and specific for Bluescript vector but was absent in M13, while the other two probes had targets that were inserts of known sequence. This system dlowed the use of hybridization negative or positive control DNAs with each probe. This probe sequence (CTCCCTTT) dso had a complementary target in the interferon insert. Since the M 13 dot is negative while the interferon insert in either M13 or Bluescript was positive, the hybridization is sequence spedfic. Similarly, probes that detect the target sequence in ody one of 51 inserts, or in none of the examined inserts dong with controls that confirm that hybridization wodd have occuπed if the appropriate targets were present in the dones. Thermal stability curves for very short oligonucleotide hybrids that are 6*8 nucleotides in length are at least 15°C lower than for hybrids 11-12 nudeotides in length [Fig. 1 and dlace et al., Nucleic Acids Res. jfc 3543-3557 (1979)]. However, performing the hybridization reaction at a low temperature and with a very practical 0.4-40 nM concentration of oligonucleotide probe dlows the detection of complementary sequence in a known or unknown nucldc add target. To determine an unknown nucldc acid sequence completdy, an entire set containing 65,535 8-mer probes may be used. Suffident amounts of nudeic acid for this purpose are present in convenient biologicd samples such as a few microiiters of Ml 3 culture, a plasmid prep from 10 ml of ba erid culture or a single colony of bacteria, or less than 1 ml of a standard PCR reaction.

Short oligonudeotides 6-10 nudeotides long give excellent discrimination. The relative decrease in hybrid stability with a single end mismatch is greater than for longer probes. Resdts with the octamer TGCTCATG support this condusion, In the experiments, the target with a G T ^~ end mismatch, hybridization to the target of this type of mismatch is the most stable of all other types of digonudeotide. This discrimination achieved is the same as or greater than an intemd G T mismatch in a 19 base paired duplex greater than an intemd G/T mismatch in a 1 paired duplex [Ikuta et d., Nud. Acids res. 15: 797 (1987)]. Exploiting these discrimination properties using the described hybridization conditions for short oligonudeotide hybridization allows a very precise determination of oligonucleotide targets. In contrast to the ease of detecting discrimination between perfect and imperfect hybrids, a problem that may exist with using very short oligonucleotides is the preparation of suffident amounts of hybrids. In practice, the need to discriminate H_p and Hj is dded by increasing the amount of DNA in the dot and/or the probe concentration, or by decreasing the hybridization temperature. However, higher probe concentrations usudly increase background. Moreover, there are limits to the amounts of target nudeic add that are practical to use. This problems was solved by the higher concentration of the detergent Sarcosyl which gave an effective background with 4 nM of probe, Further impxovements may be effected dther in the use of competitors for unspecific binding of probe to filter, or by changing the hybridization support materid. Moreover, for probes having E₈ less than 45 Kcal/mol (e.g. for many heptamers and a majority of hexamers, modified oligonucleotides give a more stable hybrid [Asseline, et d., Proc. Nat'lAcad Sci. Hi 3297 (1984)] than their unmodified counterparts. The hybridization conditions described in this invention for short oligonudeotide hybridization using low temperatures give better discriminating for dl sequences and duplex hybrid inputs. The only price pad in achieving uniformity in hybridization conditions for different sequences is an increase in washing time from minutes to up to 24 hours depending on the sequence. Moreover, the washing time can be further reduced by decreasing the sdt concentration.

Although there is excdlent discrimination of one matched hybrid over a mismatched hybrids, in short oligonucleotide hybridization, signds from mismatched hybrids exist, with the majority of the mismatch hybrids resdting from end mismatch, This may limit insert sizes that may be effectively examined by a probe of a certain length.

The influence of sequence complexity on discrimination cannot be ignored. However, the complexity effects are more significant when defining sequence information by short oligonudeotide hybridization for spedfic, nonrandom sequences, and can be overcome by using an appropriate probe to target length ratio. The length ratio is chosen to make unlikely, on statisticd grounds, the occurrence of specific sequences which have a number of end-mismatches which would be able to diminate or fdsdy invert discrimination. Resdts suggest the use of oligonucleotides 6, 7, and 8 nudeotides in length o target nucldc add inserts shorter than 0.6, 2.5, and 10 kb, respectivdy, EXAMPLE 11 DNA Sequencing

An aπay of subarrays dlows for effident sequencing of a small set of samples arrayed in the form of replicated subarrays; For example, 64 samples may be axrayed on a 8 X 8 mm subarray and 16 X 24 subarrays may be replicated on a 15 X 23 cm membrane with 1 mm wide spacers between the subarrays. Several replica membranes may be made, For example, probes from a universd set of three thousand seventy-two 7-mers may be divided in thiity-two 96-wdl plates and labelled by kinasing. Four membranes may be processed in paralld during one hybridization cycle. On each membrane, 384 probes may be scored. All probes may be scored in two hybridization cycles. Hybridization intensities may be scored and the sequence assembled as described below. If a single sample subarray or subarrays contains several unknowns, especidly when similar samples are used, a smdler number of probes may be suffident if they are intelligently sdected on the basis of resdts of previously scored probes. For example, if probe AAAAAAA is not positive, there is a smdl chance that any of 8 overlapping probes are positive, If AAAAAAA is positive, then two probes are usudly positive. The sequencing process in this case consists of first hybridizing a subset of πiinimdly overlapped probes to define positive anchors and then to successivdy select probes which confirms one of the most likely hypotheses about the order of anchors and size and type of gaps between them. In this second phase, pools of 2-10 probes may be used where each probe is sdected to be positive in ody one DNA sample which is different from the samples expected to be positive with other probes from the pool.

The subarray approach dlows efficient implementation of probe competition (overlapped probes) or probe cooperation (continuous stacking of probes) in sdving branching problems. After hybridization of a universal set of probes the sequence assembly program determines candidate sequence subfragments (SFs). For the further assembly of SFs, additiond information has to be provided (from overlapped sequences of DNA fragments, similar sequences, single pass gel sequences, or from other hybridization or restriction mapping data). Primers for single pass gel sequencing through the branch points are identified from the SBH sequence infoπnation or from known vector sequences, e.g., the flanking sequences to the vector insert site, and standard Sanger- sequencing reactions are performed on the sample DNA. The sequence obtained from this sing! e pass gel sequendng is compared to the Sfs that read into and out of the branch points to identify the order of the Sfs. Further, singe pass gel sequencing may be combined with SBH to de novo sequence or re-sequence a nudeic add.

Competitive hybridization and continuous stacking interactions can dso be used to assemble Sfs. These approaches are of limited commerdd vdue for sequencing of large numbers of samples by SBH wherein a labelled probe is applied to a sample affixed to an array if a udform array is used. Fortunately, andysis of smdl numbers of samples using replica subarrays dlows efficient implementation of both approaches. On each of the replica subanays, one branching point may be tested for one or more DNA samples using pools of probes similarly as in solving mutated sequences in different samples spotted in the same subaπay (see above). If in each of 64 samples described in this example, there are about 100 branching points, and if 8 samples axe andyzed in parallel in each subarray, then at least 800 subarray probings solve dl branches, This means that for the 3072 basic probings an additional 800 probings (25%) are employed. More preferably, two probings are used for one branching point. If the subanays are smdler, less additional probings are used. For example, if subarrays consist of 16 samples, 200 additional probings may be scored (6%). By using 7-mer probes (N^^ ^) and competitive or collaborative branching solving approaches or both, fragments of about 1000 bp fragments may be assembled by about 4000 probings. Furthermore, using 8-mer probes (NB ₈N) 4 kb or longer fragments may be assembled with 12,000 probings. Gapped probes, for example, NB₄NB₃N or NB₄NB₄N may be used to reduce the number of branching points. EXAMPLE12

DNA Analysis hv Transient Attachment to Suharravs of Prohes and Lipation of Labelled Probes

Oligonudeotide probes having an informative length of four to 40 bases are synthesized by standard chemistry and stored in tubes or in multiwdl plates. Spedfic sets of probes comprising one to 10,000 probes are arrayed by deposition or in situ synthesis on separate supports or distinct sections of a larger support. In the last case, sections or subarrays may be separated by physical or hydrophobic barriers. The probe arrays may be prepared by in situ synthesis. A sample DNA of appropriate size is hybridized with one or more specific aπays. Many samples may be inteπogated as pools at the same subarrays or independently with different subarrays within one support. Simultaneously with the sample or subsequently, a single labelled probe or a pool of labdled probes is added on each of the subarrays. If attached and labdled probes hybridize back to back on the complementary target in the sample DNA they are ligated. Occurrence of ligation will be measured by detecting a label from the probe.

This procedure is a variant of the described DNA andysis process in which DNA samples are not permanently attached to the support. Transient attachment is provided by probes fixed to the support. In this case there is no need for a target DNA arraying process. In addition, ligation dlows detection of longer oligonucleotide sequences by combining short labelled probes with short fixed probes.

The process has several unique features. Basicdly, the transient attachment of the target allows its reuse. After ligation occur the target may be released and the label will stay covdently attached to the support. This feature dlows cycling the target and production of detectable signd with a smdl quantity of the target. Under optimd conditions, targets do not need to be amplified, e.g. natural sources of the DNA samples may be directly used for diagnostics and sequencing purposes. Targets may be released by cyding the temperature between effϊrieπt hybridization and effident mdting of duplexes. More preferably, there is no cyding. The temperature and concentrations of components may be defined to have an equilibrium between free targets and targets entered in hybrids at about 50:50% levd. In this case there is a continuous production of ligated products. For different purposes different eqdlibrium ratios are optimd.

An electric fidd may be used to enhance target use. At the beginning, a horizontal field pdsing within each subarray may be employed to provide for faster target sorting. In this phase, the equilibrium is moved toward hybrid formation, and unlabdled probes may be used. After a target sorting phase, an appropriate washing (which may be helped by a vertical electric field for restricting movement of the samples) may be performed. Several cycles of discriminative hybrid melting, target harvesting by hybridization and ligation and removing of unused targets may be introduced to increase specifidty. In the next step, labdled probes are added and vεrticd electricd pulses may be applied. By increasing temperature, an optimal free and hybridized target ratio may be achieved. The vertical dectric fidd prevents diffusion of the sorted targets.

The subanays of fixed probes and sets of labelled probes (speddly designed or selected from a univeisd probe set) may be arranged in various ways to allow an efficient and flexible sequencing and diagnostics process. For example, if a short fiagment (about 100-500 bp) of a bacterid genome is to be partially or completdy sequenced, smdl arrays of probes (5-30 bases in length) designed on the bases of known sequence may be used. If interrogated with a different pool of 10 labdled probes per subarray, an aπay of 10 subarrays each having 10 probes, dlows checking of 200 bases, assuming that ody two bases connected by ligation are scored. Under the conditions where mismatches are discriminated throughout the hybrid, probes may be displaced by more than one base to cover the longer target with the same number of probes. By using long probes, the target may be interrogated directly without amplification or isolation from the rest of DNA in the sample. Also, several targets may be andyzed (screened for) in one sample simdtaneously. If the obtained results indicate occurrence of a mutation (or a pathogen), additiond pools of probes may be used to detect type of the mutation or subtype of pathogen. This is a desirable feature of the process which may be very cost effective in preventive diagnosis where only a smdl fraction of patients is expected to have an infection or mutation.

In the processes described in the examples, various detection methods may be used, for example, τadiolabds, fluorescent labels, enzymes or antibodies (chemilu inescence), large olecdes or particles detectable by light scattering or interferometric procedures.

EXAMPLE 1?

Sequencing a Target Using Octamers and Noπarneπ.

Data resulting from the hybridization of octamer and nonamer oligonucleotides shows that sequendng by hybridization provides an extremely high degree of accuracy. In this experiment, a known sequence was used to predict a series of contiguous overlapping component octamer and nonamer oligonucleotides.

In addition to the perfectly matching oligonucleotides, mismatch oligonudeotides, mismatdi oligonudeotides wherein intemd or end mismatches occur in the duplex formed by the oligonucleotide and the target were examined. In these andyses, the lowest practicd temperature was used to maximize hybridization formation. Washes were accomplished at the same or lower temperatures to ensure maximd discrimination by utilizing the greater dissociation rate of mismatch versus matched oligonucleotide/target hybridization. These conditions are shown to be applicable to dl sequences dthough the absolute hybridization yield is shown to be sequence dependent. The least destabilizing mismatch that can be postdated is a simple end mismatch, so that the test of sequencing by hybridization is the ability to discriminate perfectly matched oligonudeotidε/target duplexes from end-mismatched oligonucleotide/target duplexes.

The discriminative vdues for 102 of 105 hybridizing oligonucleotides in a dot blot format were greater than 2 dlowing a highly accurate generation of the sequence. This system dso dlowed an analysis of the effect of sequence on hybridization formation and hybridization instability.

One hundred base pairs of a known portion of a human-interferon genes prepared by PCR, i.e. a 100 bp target sequence, was generated with data resdting from the hybridization of 105 oligonucleotides probes of known sequence to the target nucleic add. The oligonudeotide probes used included 72 octamer and 21 nonamer oligonucleotides whose sequence was perfectly complementary to the target. The set of 93 probes provided consecutive overlapping frames of the target sequence e displaced by one oτ two bases.

To evduate the effect of mismatches, hybridization was examined for 12 additiond probes that contained at least one end mismatch when hybridized to the 100 bp test target sequence. Also tested was the hybridization of twelve probes with target end-mismatched to four other control nucldc acid sequences chosen so that the 12 oligonucleotides formed perfectly matched duplex hybrids with the four control DNAs. Thus, the hybridization of intemd mismatched, end-mismatched and perfectly matched duplex pairs of oligonucleotide and target were evduated for each digonucleotide used in the experiment The effect of absolute DNA target concentration on the hybridization with the test octamer and nonamer oligonucleotides was deter ined by defining target DNA concentration by detecting hybridization of a different oligonucleotide probe to a single occurrence non- target site within the co-amplified plasmid DNA.

The results of this experiment showed that dl oligonucleotides containing perfect matching complementary sequence to the target or control DNA hybridized more strongly than those oligonucleotides having mismatches. To come to this condusion, we examined H_p and D vdues for each probe. H_p defines the amount of hybrid duplex formed between a test target and an oligonucleotide probe. By assigning vdues of between 0 and 10 to the hybridization obtained for the 105 probes, it was apparent that 68.5% of the 105 probes had an H_p greater than 2.

Discrimination (D) vdues were obtained where D was defined as the ratio of signd intensities between 1 ) the dot containing a perfect matched duplex foπned between test oligonudeotide and target or control nucleic acid and 2) the dot containing a mismatch duplex formed between the same oligonucleotide and a different site within the target or control nucldc add. Variations in the vdue of D resdt from either 1) perturbations in the hybridization efficiency • which dlows visudization of signd over background, or 2) the type of mismatch found between the test oligonudeotide and the target. The D vdues obtained in this experiment were between 2 and 40 for 102 of the 105 oligonudeotide probes examined. Cdcdations of D for the group of 102 digonucleotides as a whole showed the average D was 10.6.

There were 20 cases where oligonucleotide target duplexes exhibited an end-mismatch. In five of these, D was greater than 10. The large D vdue in these cases is most likely due to hybridization destabilization caused by other than the most stable (G T and G/A) end mismatches. The other possibility is there was an error in the sequence of either the oligonucleotides or the target.

Eπor in the target for probes with low H_p was exduded as a possibility because such an eπor would have affected the hybridization of each of the other right overlapping oligonudeotides. There was no apparent instability due to sequence mismatch for the other overlapping digonudeotides, indicating the target sequence was coπect. Eπor in the oligonucleotide sequence was exduded as a possibility after the hybridization of seven newly synthesized oligonudeotides as re-examined. Only 1 of the seven oligonucleotides resdted in a better D vdue. Low hybrid formation vdues may result from hybrid instability or from an inabUity to form hybrid duplex. An inability to form hybrid duplexes wodd result from dther 1) self complementarity of the chosen probe or 2) target/target sdf hybridization. Oligonudeotide/oligonudeotide duplex formation may be favored over oligonucleotide/target hybrid duplex foπnation if the probe was sdf-complementary. Similarly, target target association may be favored if the target was sdf-complementaiy or may form intemd palindromes. In evduating these possibilities, it was apparent from probe andysis that the questionable probes did not form hybrids with themselves. Moreover, in examining the contribution of target/target hybridization, it was determined that one of the questionable oligonudeotide probes hybridized inefficiently with two different DNAs containing the same target. The low probability that two different DNAs have a self-complementary region for the same target sequence leads to the conclusion that target target hybridization did not contribute to low hybridization formation. Thus, these resdts indicate that hybrid instability and not the inability to form hybrids was the cause of the low hybrid formation observed for specific oligonudeotides. The resdts also indicate that low hybrid formation is due to the specific sequences of certain oligonucleotides. Moreover, the resdts indicate that reliable results may be obtained to generate sequences if octamer and nonamer oligonucleotides are used. These results show that using the methods described long sequences of any specific target nucleic acid may be generated by maximd and unique overlap of constituent oligonucleotides. Such sequencing methods are dependent on the content of the individud component oligomers regardless of their frequency and thdr position.

The sequence which is generated using the dgorithm described below is of high fidelity. The dgorithm tolerates fdse positive signds from the hybridization dots as is indicated from the feet the sequence generated from the 105 hybridization vdues, which induded four less reliable vdues, was coπect. This fiddity in sequencing by hybridization is due to the "all or none" kinetics of short oligonucleotide hybridization and the difference in duplex stability that exists between perfectly matched duplexes and mismatched duplexes. The ratio of duplex stability of matched and end-mismatched duplexes increases w h decreasing duplex length. Moreover, binding energy decreases with decreasing duplex length resdting in a lower hybridization efficiency. However, the resdts provided show that octamer hybridization allows the bdancing of the factors affecting duplex stability and discrimination to produce a highly accurate method of sequendng by hybridization. Resdts presented in other examples show that oligonucleotides that are 6, 7, or 8 nudeotides can be effectively used to generate reliable sequence on targets that are 0.5 kb (for hexamers) 2 kb (for septamers) and 6kb (for octamers). The sequence of long fragments may be overlapped to generate a complete genome sequence. EXAMPLE 14

Analyzing the Pata.Qbfained

Image files are andyzed by an image andysis program, like DOTS program (Dπnanac et al, 1993), and scded and evduated by statisticd functions included, e.g., in SCORES program (Drmanac etal. 1994). From the distribution of the signds an optimd threshold is determined for transforming signd into +/- output. From the position of the labd detected, F + P niicleotide sequences from the fragments wodd be determined by combining the known sequences of the immobilized and labded probes corresponding to the labeled positions. The complete nudeic add sequence or sequence subfragments of the original molecule, such as a human chromosome, would then be assembled from the overlapping F + P sequence determined by computationd deduction. One option is to transform hybridization signds e.g., scores, into +/- output during the sequence assembly process. In this case, assembly will start with a F + P sequence with a very high score, for example F + P sequence AAAAAATTTTTT . Scores of dl four possible overlapping probes AAAAATTTTTTA , AAAAA1 I^'l l 1 1 , AAAAATTTTTTC and AAAAATTTTTTG and tiiree additional probes that are different at the beginning (TAAAAATTTTTT, ;

CAAAAATTTTTT, ; GAAAAATTTTTT, are compared and three outcomes defined: (i) only the starting probe and only one of the four overlapping proves have scores that are significantly positive relatively to the other six probes, in this case the AAAAAATTTTTT sequence will be extended for one nudeotide to the right; (ii) no one probe except the starting probe has a significantly positive score, assembly will stop, e.g., the AAAAAATTTTT sequence is at the end of the DNA molecde that is sequenced; (iii) more than one significantly positive probe among the overlapped and or other three probes is found; assembly is stopped because of the error or branching (Drmanac etal, 1989).

The processes of computationd deduction would employ computer programs using existing dgorithms (see, e.g., Pevzner, 1989; Drmanac er al, 1991; Labat and Drmanac, 1993; each incorporated herein by reference).

If, in addition to F + P, F (space 1)P, F (space 2)P, F(space 3)P or F(space 4)P are determined, dgorithms will be used to match dl data sets to correct potentid eπors or to solve the situation where there is a branching problem (see, e.g., Drmanac et al, 1989; Bains et al. 1 88; each incorporated herein by reference). EXAMPLE 15

Conducting Sequencing hv Two Step HvbridJTatinn

Following the certain examples to describe the execution of the sequencing methodology contemplated by the inventor. First, the whole chip wodd be hybridized with mixture of DNA as complex as 100 million of bp (one human chromosome). Guiddines for conducting hybridization can be found in papers such as Drmanac et al (1990); Khrapko et al. (1991); and Broude et al. (1994). These artides teach the ranges of hybridization temperatures, buffets and washing steps that are appropriate for use in the initid steps of Format 3 SBH.

The present inventor particularly contemplates that hybridization is to be carried out for up to several hours in high sdt concentrations at a low temperature (-2°C to 5°C) because of a rdatively low concentration of target DNA that can be provided. For this purpose, SSC buffer is used instead of sodium phosphate buffer (Drmanac et al, 1990), which predpitates at 10°C. Washing does not have to be extensive (a few minutes) because of the second step, and can be completely eliminated when the hybridization cyding is used for the sequencing of highly complex DNA samples. The same buffer is used for hybridization and washing steps to be able to continue with the second hybridization step with labded probes.

After proper washing using a simple robotic device on each array, e.g., a 8 x 8 mm array, one labeled, probe, e.g., a 6-mer, would be added. A 96-tip or 96-pin device wodd be used, performing this in 42 operations. Again, a range of discriminatory conditions could be employed, as previously described in the sdentific literature.

The present inventor particularly contemplates the use of the following conditions. First, after adding labded probes and incubating for several minutes only (because of the high concentration of added oligonudeotides) at a low temperature (0-5°C the temperature is increased to 3-10°C depending on F + P length, and the washing buffer is added. At this time, the washing buffer used is one compatible with any ligation reaction (e.g., 100 mM salt concentration range). After adding ligase, the temperate is increased again to 15-37°C to dlow fast ligation (less than 30 min) and further discriinination of full match and mismatch hybrids.

The use of cationic detergents is dso contemplated for use in Format 3 SBH, as described by Pontius & Berg (1991 , incorporated herein by reference). These authors describe the use of two simple cationic detergents, dodecy- and cetyltrimethylammonium bromide (DTAB and CTAB) in DNA renaturation.

DTAB and CTAB are variants of the quaternary amine tetramethylammonium bromide (TMAB) in which one of the methyl groups is replaced by dther a 12-carbon (DTAB) or a 16-carbon (CTAB) alkyl group. TMAB is the bromide salt of the tetramethylammonium ion, a reagent used in nuddc acid renaturation experiments to decrease the G-C content bias of the melting temperature. DTAB and CTAB are similar in structure to sodium dodecyl sdfate (SDS), with the leplacement of the negativdy charged sulfate of SDS by a positively charged quaternary amine. While SDS is commody used in hybridization buffers to reduce nonspecific binding and inhibit nuclεases, it does not greatly affect the rate of renaturation.

When using a ligation process, the enzyme codd be added w h the labded probes or after the proper washing step to reduce the background. Although not previously proposed for use in any SBH method, ligase technology is wdl established within the field of molecular biology. For example, Hood and colleagues described a ligase-mediated gene detection technique (Landegren et al, 1988), the methodology of which can be readily adapted for use in Format 3 SBH. Wu &

Wdlace dso describe the use of bacteriophage T4 DNA ligase to join two adjacent, short synthetic olignucleotides. Their oligo ligation reactions were carried out in 50 mM Tris HC1 pH 7.6, 10 mM MgCl₂, 1 mM ATP, 1 mM DTT, and 5% PEG. Ligation reactions were heated to 100^QC for 5-10 min followed by cooling to 0°C prior to the addition of T4 DNA ligase (1 unit; Bethesda Research Laboratory). Most ligation reactions were carried out at 30°C and terminated by heating to 100°C for 5 min.

Find washing appropriate for discriminating detection of hybridized adjacent, or ligated, oligonucleotides of length (F + P), is then performed. This washing step is done in water for several minutes at 40-60°C to wash out all the non-ligated labeled probes, and dl other compounds. to maximally reduce background. Because of the covdently bound labeled oligonucleotides, detection is simplified (it does not have time and low temperature constrains).

Depending on the Iabd used, imaging of the chips is done with different apparati. For radioactive labels, phosphor storage screen technology and Phosphorlmager as a scanner may be used (Molecdar Dynamics, Sunnyvde, CA). Chips are put in a cassette and covered by a phosphorous screen. After 1 -4 hours of exposure, the screen is scanned and the image file stored a: a computer hard disc. For the detection of fluorescent labds, CCD cameras and epifluorescent or confocal microscopy are used. For the chips generated directly on the pixds of a CCD camera, detection can be perfbπried as described by Eggers et al (1994, incorporated herein by reference). Charge-coupled device (CCD) detectors serve as active solid supports that quantitativdy detect and image the distribution of labeled target molecules in probe-based assays, These devices use the inherent characteristics of microdectronics that accommodate highly parallel assays, ultrasensitive detection, high throughput, integrated data acquisition and computation. Eggers ei al. (1994) describe CCDs for use with probe-based assays, such as Format 3 SBH of the present invention, that dlow quantitative assessment within seconds due to the high sensitivity and direct coupling employed.

The integrated CCD detection approach enables the detection of molecdar binding events on chips. The detector rapidly generates a two-dimensiond pattern that umqudy characterizes the sample. In the specific operation of the CCD-based molecdar detector, distinct biological probes are immobilized directly on the pixds of a CCD or can be attached to a disposable cover slip placed on the CCD surface. The sample molecules can be labeled with radioisotope, cheπtiluminescent or fluorescent tags.

Upon exposure of the sample to the CCD-based probe array, photons or radioisotope decay products are emitted at the pixel locations where the sample has bound, in the case of Format 3, to two complementary probes. In turn, dectron-hole pdrs are generated in the silicon when the charged particles, or radiation from the labded sample, are incident on the CCD gates. Electrons are then collected beneath adjacent CCD gates and sequentially read out on a display modde. The number of photodectrons generated at each pixel is directly proportiond to the number of molecdar binding events in such proximity. Consequently, molecular binding can be quantitativdy determined (Eggers el al, 1994). By placing the imaging array in proximity to the sample, the collection efficiency is improved by a factor of at least 10 over lens-based techniques such as those found in conventiond CCD cameras. That is, the sample (emitter) is in near contact with the detector (imaging aπay), and this eliminates conventiond imaging optics such as lenses and mirrors.

When radioisotopes are attached as reporter groups to the target molecdes, energetic particles are detected. Several reporter groups that emit particles of varying energies have been successfully utilized with the micro-fabricated detectors, including ³²P, ^P, ³⁵S, ^,4C and ¹²⁵L. The higher eneTgy particles, such as from ³²P, provide the highest molecdar detection sensitivity, whereas the lower energy particles, such as from ^S, provide better resolution. Hence the choice of the radioisotope reporter can be tailored as required. Once the particdar radioisotope labd is selected, the detection performance can be predicted by cdcdating the signd-to-noise ration (SNR), as described by Eggers et al (1994).

An dtemative luminescent detection procedure involves the use of fluorescent or chemiluminescent reporter groups attached to the target molecdes. The fluorescent labels can be attached covalently or through interactioa Fluorescent dyes, such as ethidium bromide, with intense absorption bands in the near UV (300-350 nm) range and principd emission bands in the visible (500-650 nm) range, axe most suited for the CCD devices employed since the quantum efficiency is several orders of magnitude lower at the excitation wavdength then at the fluorescent signal wavelength.

From the perspective of detecting luminescence, the polysilicon CCD gates have the bdlt-in caparity to filter away the contribution of incident light in the UV range, yet are very sensitive to the visible luminescence generated by the fluorescent reporter groups, Such inherently large discrimination agdnst UV excitation enables large SNRs (greater than 100) to be achieved by the CCDs as formulated in the incorporated paper by Eggers et al (1994).

For probe immobilization on the detector, hybridization matrices may be produced on inexpensive SiOj wafers, which are subsequently placed on the surface of the CCD fdlowing hybridization and drying. This format is economicdly effϊrient since the hybridization of the DNA is conducted on inexpensive disposable Si0₂ wafers, thus dlowing reuse of the more expensive CCD detector. Altemativdy, the probes can be immobilized directly on the CCD to create a dedicated probe matrix. To immobilize probes upon the Si0₂ coating, a uniform epoxide layer is linked to the film surface, employing an epoxy-silane reagent and standard Si0₂ modification chemistry. Amine-modified oligonudeotide probes are then linked to the Si0₂ surface by means of secondary amine formation with the epoxide ring. The resulting linkage provides 17 rotatable bonds of separation between the 3 base of the oligonudeotide and the Si0₂ surface. To ensure complete amine deprotonation and to minimize secondary structure formation during coupling, the reaction is performed in 0.1 M KOH and incubated at 37°C for 6 hours.

In Format 3 SBH in general, signals are scored per each of billion points. It wodd not be necessary to hybridize dl aπays, e-g., 40005 5 mm, at a time and the successive use of smdler number of arrays is possible.

Cycling hybridizations are one possible method for increasing the hybridization signd. In one cycle, most of the fixed probes will hybridize with DNA fragments with tail sequences non-complementary for labeled probes. By increasing the temperature, those hybrids will be mdted. In the next cycle, some of them (-0.1%) will hybridize with an appropriate DNA fragment and additiond labded probes will be ligated. In this case, there occurs a discriminative melting of DNA hybrids with mismatches for both probe sets simdtaneously.

In the cyde hybridization, dl components are added before the cyding starts, at the 37°C for T4, or a higher temperature for a thermostable ligase. Then the temperature is decreased to 15-37°C and the chip is incubated for up to 10 minutes, and then the temperature is increased to 37°C or higher for a few minutes and then again reduced. Cycles can be repeated up to 10 times. In one variant, an optimal higher temperature (10-50°C) can be used without cycling and longer ligation reaction can be performed (1-3 hours).

The procedure described herein dlows complex chip manufacturing using standard synthesis and pre se spotting of oligonucleotides because a relatively smdl number of oligonucleotides are necessary. For example, if dl 7-mer oligos are synthesized (16384 probes), lists of 256 million 14-mers can be ddesmined.

One important variant of the invented method is to use more than one differently labeled probe per base array. This can be executed with two purposes in mind; mdtiplexing to reduce number of sεparatdy hybridized arrays; or to determine a list of even longer oligosequences such as 3 x 6 or 3 x 7. In this case, if two labds are used, the specifidty of the 3 consecutive oligonucleotides can be dmost absolute because positive sites must have enough signals of both labds.

A further and additiond variant is to use chips containing BxNy probes with y being from 1 to 4 . Those chips dlow sequence reading in different frames. This can dso be achieved by using appropriate sets of labded probes or both F and P probes codd have some unspecified end positions (i.e., some element of termind degeneracy). Universal bases may dso be employed as part of a linker to join the probes of defined sequence to the solid support. This makes the probe more avdlable to hybridization and makes the construct more stable. If a probe has 5 bases, one may, e.g., use 3 universd bases as a linker (FIG.4). EXAMPLE It?

Determining Sequence from Hybridization Data

Sequence assembly may be inteπupted where ever a given overlapping (N-l ) mer is duplicated two or more times. Then either of the two N-mers differing in the last nudeotide may be used in extending the sequence. This branching point limits unambiguous assembly of sequence. Reassembling the sequence of known oligonucleotides that hybridize to the target nucldc add to generate the complete sequence of the target nuddc acid may not be accomplished in some cases. This is because some information may be lost if the target nudeic acid is not in fragments of appropriate size in relation to the size of oligonudeotide that is used for hybridizing. The quantity of information lost is proportiond to the length of a target being sequenced. However, if sufficiently short targets are used, their sequence msy be unambiguously determined.

The probable frequency of duplicated sequences that wodd interfere with sequence assembly which is distributed dong a certain length of DNA may be calculated. This derivation requires the introduction of the definition of a parameter having to do with sequence organization: the sequence subfragment (SF). A sequence subfragment results if any part of the sequence of a target nudeic acid starts and ends with an (N-l)mer that is repeated two or more times within the target sequence. Thus, subfragments are sequences generated between two points of branching in the process of assembly of the sequences in the method of the invention. The sum of dl subfragments is longer than the actud target nudeic acid because of overlapping short ends. Generally, subfragments may not be assembled in a linear order without additiond information since they have shared (N-l)mers at their ends and starts. Different numbers of subfragments axe obtained for each nucleic add target depending on the number of its repeated (N-l) mεrs. The number depends on the vdue of N-l and the length of the target.

Probability cdcdations can estimate the interrelationship of the two factors. If the ordering of positive N-mers is accomplished by using overlapping sequences of length N- 1 or at an average distance of A,,, the N-l of a fragment Lf bases long is given by equation one: N_sl=l+A₀ KXP f L_f

Where K greater than or = 2, and P (K, L_r) represents the probability of an N-mer occurring K-times on a fiagment L_f base long. Also, a computer program that is able to form subfragments from the content of N-mers for any given sequence is described below in Example 18. The number of subfiagments increases with the increase of lengths of fragments for a given length of probe. Obtained subfiagments may not be uniqudy ordered among themsdves. Although not complete, this information is very useful for comparative sequence andysis and the recognition of fiinctiond sequence characteristics. This type of information may be called partial sequence. Another way of obtaining partid sequence is the use of only a subset of oligonucleotide probes of a given length.

There may be rdatively good agreement between predicted sequence according to theory and a computer simdarion for a random DNA sequence. For instance, for N-l = 7, [using an 8-mer or groups of sixteen 10-mers of type 5' (A,T,C,G) B, (A,T,C,G) 3'] a target nucldc arid of 200 bases will have an average of three subfragments. However, because of the dispersion around the mean, a library of target nucleic acid shodd have inserts of 500 bp so that less than 1 in 2000 targets have more than three subfiagments. Thus, in an ided case of sequence determination of a long nucleic acid of random sequence, a representative library with sufficiently shoπ inserts of target nucldc arid may be used. For such inserts, it is possible to reconstruct the individual target by the method of the invention. The entire sequence of a large nucldc acid is then obtained by overlapping of the defined individud insert sequences.

To reduce the need for very short fragments, e.g. 50 bases for 8-mer probes. The information contained in the overlapped fragments present in every random DNA fragmentation process like doning, or random PCR is used. It is dso possible to use pools of short physicd nucldc arid fragments. Using 8-mers or 11-mers like 5' (A, T, C, G) N_g (A, T, C ,G )3' for sequencing 1 megabase, instead of needing 20,00050 bp fragments only 2,100 samples are sufficient This number consists of 700 random 7 kb dones (basic library), 1250 pools of 20 clones of 500 bp (subfragments ordering library) and 150 clones from jumping (or similar) library. The developed dgorithm (see Example 18) regenerates sequence using hybridization data of these described samples. EXAMPLE 17 Algorithm

This example describes an dgorithm for generation of a long sequence written in a four letter dphabet from constituent k-tuple words in a nώiimd number of separate, randomly defined fragments of a starting nudeic acid sequence where K is the length of an oligonudeotide probe. The algorithm is primarily intended for use in the sequencing by hybridization (SBH) process. The dgorithm is based on subfiagments (SF), informative fragments (IF) and the possibility of using pools of physicd nuddc sequences for defining informative fragments.

As described, subfiagments may be caused by branch points in the assembly process resdting from the repetition of a K-l oligomer sequence in a target nuddc acid. Subfragments are sequence fragments found between any two repetitive words of the length K-l that occur in a sequence. Mdtiple occurrences of K-l words are the cause of interruption of ordering the overlap of K-words in the process of sequence generation. Interruption leads to a sequence remaining in the form of subfiagments. Thus, the unambiguous segments between branching points whose order is not uniquely determined are cdled sequence subfragments.

Informative fragments are defined as fragments of a sequence that are determined by the nearest ends of overlapped physicd sequence fragments.

A certain number of physical fragments may be pooled without losing the possibility of defining informative fragments. The total length of randomly pooled fragments depends on the length of k-tuples that are used in the sequencing process.

The dgorithm consists of two main units. The first part is used for generation of subfragments from the set of k-tuples contained in a sequence. Subfiagments may be generated within the coding region of physicd nucldc acid sequence of certain sizes, or within the informative fragments defined within long nucleic acid sequences. Both types of fragments are members of the basic library. This dgorithm does not describe the determination of the content of the k-tuples of the informative fragments of the basic library, i.e. the step of preparation of informative fragments to be used in the sequence generation process.

The second part of the dgorithm determines the linear order of obtained subfragments with the puipose of regenerating the complete sequence of the nucleic acid fragments of the basic library. For this puipose a second, ordering library is used, made of randomly pooled fragments of the starting sequence. The dgorithm does not include the step of combining sequences of basic fragments to regenerate an entire, egabase plus sequence. This may be accomplished using the link-up of fragments of the basic library which is a prerequisite for informative fragment generation. Altemativdy, it may be accomplished after generation of sequences of fragments of the basic library by this dgorithm, using search for their overlap, based on the presence of common end-sequences.

The dgorithm requires ndther knowledge of the number of appearances of a given k-tuple in a nucldc acid sequence of the basic and ordering libraries, nor does it require the information of which k-tuple words are present on the ends of a fiagment. The dgorithm operates with the mixed content of k-tuples of various length. The concept of the dgorithm enables operations with the k-tuple sets that contain fdse positive and felse negative k- tuples. Ody in specific cases does the content of the f se k-tuples primarily influence the completeness and correctness of the generated sequence. The dgorithm may be used for optimization of parameters in simulation experiments, as well as for sequence generation in the actud SBH experiments e.g. generation of the genomic DNA sequence. In optimization of parameters, the choice of the oligonucleotide probes (k-tuples) for practical and convenient fragments and/or the choice of the optimd lengths and the number of fragments for the defined probes are espeddly important

This part of the dgorithm has a central role in the process of the generation of the sequence from the content of k-tuples. It is based on the udque ordering of k«tuples by means of maximal overlap. The main obstacles in sequence generation are specific repeated sequences and fdse positive and or negative k-tuples. The aim of this part of the dgorithm is to obtain the minimd number of the longest possible subfragments, with correct sequence. This part of the dgorithm consists of one basic, and several control steps. A two-stage process is necessary since certain information can be used only after generation of dl primary subfiagments. The main problem of sequence generation is obtaining a repeated sequence from word contents that by definition do not carry information on the number of occurrences of the particdar k-tuples. The concept of the entire dgorithm depends on the basis on which this problem is solved. In principle, there are two opposite approaches: 1) repeated sequences may be obtained at the beginning, in the process of generation of pSFs, or 2) repeated sequences can be obtained later, in the process of the find ordering of the subfragments. In the first case, pSFs contain an excess of sequences and in the second case, they contain a deficit of sequences. The first approach requires elimination of the excess sequences generated, and the second requires permitting πωltiple use of some of the subfragments in the process of the find assembling of the sequence.

The difference in the two approaches in the degree of strictness of the rule of unique overlap of k-tuples. The less severe rule is: k-tuple X is unambiguously maxiπidly overlapped with k-tuple Y if and only if, the rightmost k-l end of k-tuple X is present ody on the leftmost end of k-tuple Y. This rule dlows the generation of repetitive sequences and the formation of surplus sequences.

A stricter rule which is used in the second approach has an addition caveat: k-tuple X is unambiguously maximdly overlapped with k-tuple Y if and ody if, the rightmost K- 1 end of k-tuple X is present only on the leftmost end of k-tuple Y and if the leftmost K-l end of k- tuple Y is not present on the rightmost end of any other k-tuple. The dgorithm based on the stricter rule is simpler, and is described herein.

The process of dongation of a given subfragment is stopped when the right k-l end of the last k-tuple included is not present on the left end of any k-tuple or is present on two or more k-tuples. If it is present on ody one k-tuple the second part of the rule is tested. If in addition there is a k-tuple which differs from the previously included one, the assembly of the given subfragment is terminated ody on the first leftmost position. If this additiond k-tuple does not exist, the conditions are met for unique k-l overlap and a given subfragment is extended to the right by one element.

Beside the basic πile, a supplementary one is used to dlowthe usage of k-tuples of different lengths. The maximd overlap is the length of k-l of the shorter k-tuple of the overlapping pair. Generation of the pSFs is performed starting from the first k-tuple from the file in ^•which k-tuples are displayed randomly and independently from their order in a nudeic add sequence. Thus, the first k-tuple in the file is not necessarily on the beginning of the sequence, nor on the start of the particular subfragment The process of subfragment generation is performed by ordering the k-tuples by means of unique overlap, which is defined by the described wle. Each used k-tuple is erased from the file. At the point when there are no further k-tuples unambiguously overlapping with the last one induded, the bdlding of subfragment is terminated and the bmldup of another pSF is started. Since generation of a majority of subfragments does not begin from thrir actud starts, the formed pSF are added to the k-tuple file and are considered as a longer k-tuple. Another possibility is to form subfragments going in both directions from the starting k- tuple. The process ends when further overlap, i.e. the extension of any of the subfiagments, is not possible.

The pSFs can be divided in three groups: 1 ) Subfiagments of the maximd length and correct sequence in cases of exact k-tuple set; 2) short subfiagments, formed due to the used of the maximd and unambiguous overlap rule on the incomplete set, and/or the set with some false positive k-tuples; and 3) pSFs of an incoπect sequence- The incompleteness of the set in 2) is caused by fdse negative resdts of a hybridization experiment, as wdl as by using an incoπect set of k-tuples. These are formed due to the fdse positive and false negative k_'tuples and can be : a) misconnected subfiagments; b) subfragments with the wrong end and c) false positive k-tuples which appears as fdse minimd subfragments.

Considering fdse positive k-tuples, there is the possibility for the presence of a k- tuple containing more than one wrong base or containing one wrong base somewhere in the middle, as wdl as the possibility for a k-tuple with a wrong base on the end. Generation of short, eπoneous or misconnected subfragments is caused by the latter k-tuples. The k-tuples of the former two kinds represent wrong pSFs with length equd to k-tuple length.

In the case of one f se negative k-tuple, pSFs are generated because of the impossibility of maximd overlapping. In the case of the presence of one false positive k-tuple with the wrong base on its leftmost or rightmost end, pSFs are generated because of the impossibility of unambiguous overlapping. When both false positive and fdse negative k-tuples with a common k-l sequence are present in the file, pSFs axe generated, and one of these pSFs contains the wrong k-tuple at the relevant end.

The process of coπecting subfiagments with eπors in sequence and the linking of unambiguously connected pSF is performed after subfragment generation and in the process of subfragment ordering. The first step which consists of cutting the misconnected pSFs and obtaining the find subfragments by unambiguous connection of pSFs is described below.

There are two approaches for the formation of misconnected subfragments. In the first a mistake occurs when an eπoneous k-tuple appears on the points of assembly of the repeated sequences of lengths k-l. In the second, the repeated sequences are shorter than k-l. These situations can occur in two variants each. In the first variant, one of the repeated sequences represents the end of a fiagment. In the second variant, the repeated sequence occurs at any position within the fragment For the first possibility, the absence of some k-tuples from the file (f se negatives) is required to generate a misconnection. The second possibility reqdres the presence of both f se negative and false positive k-tuples in the file. Considering the repetitions of k-l sequence, the lack of ody one k-tuple is sufficient when dther end is repeated intemdly. The lack of two is needed for strictly intemd repetition. The reason is that the end of a sequence can be considered informaticdly as an endless linear array of false negative k-tuples. From the "smdler than k-l case", only the repeated sequence of the length of k-2, which requires two or three specific eπoneous k-tuples, will be considered. It is very likely that these will be the only cases which will be detected in a red experiment, the others being much less frequent.

Recognition of the misconnected subfragments is more strictly defined when a repeated sequence does not appear at the end of the fiagment. In this situation, one can detect further two subfragments, one of which contains on its leftmost, and the other on its rightmost end k-2 sequences which are dso present in the misconnected subfragment. When the repeated sequence is on the end of the fragment, there is ody one subfragment which contains k-2 sequence causing the mistake in subfragment formation on its leftmost or rightmost end.

The removd of misconnected subframents by their cutting is performed according to the common rule: If the leftmost or rightmost sequence of the length of k-2 of any subfragments is present in any other subfragment, the subfragment is to be cut into two subfragments, each of them containing k-2 sequence. This nile does not cover rarer situations of a repeated end when there are more than one fdse negative k-tuple on the point of repeated k-l sequence. Misconnected subfiagments of this kind can be recognized by using the information from the overlapped fragments, or informative fragments of both the basic and ordering libraries, In addition, the misconnected subfragment will remain when two or more felse negative k-tuples occux on both positions which contain the identicd k-l sequence. This is a very rare situation since it requires at least 4 spedfic false k-tuples. An additiond nile can be introduced to cut these subfragments on sequences of length k if the given sequence can be obtained by combination of sequences shorter than k-2 from the end of one subfragment and the start of another.

By strict application of the described rule, some completeness is lost to ensure the accuracy of the output. Some of the subfragments will be cut dthough they are not misconnected since they fit into the pattern of a misconnected subfragment. There are several situations of this kind. For example, a fiagment, beside at least two identicd k-l sequences, contains any k-2 sequence from k-l or a fiagment contains k-2 sequence repeated at least twice and at least one false negative k-tuple containing given k-2 sequence in the middle, etc. The aim of this part of the dgorithm is to reduce the number of pSFs to a minimd number of longer subfiagments with correct sequence. The generation of unique longer subfragments or a complete sequence is possible in two situations. The first situation concerns the specific order of repeated k-I words. There are cases in which some or dl maximdly extended pSFs (the first group of pSFs) can be uniquely ordered. For example, in fragment S-Rl-a-R2-b-Rl-c-R2-E where S and E are the start and end of a fragment, a, b , and c are different sequences specific to respective subfragments and Rl and R2 are two k-l sequences that are tandemly repeated, five subfragments are generated (S-Rl , Rl-a-R2, R2-b-Rl, R1-C-R2, and R-E). They may be ordered in two ways; the origind sequence above or S-Rl- c-R-b-Rl-a-R-E. In contrast, in a fragment with the same number and types of repeated sequences but ordered differently, i.e. S-Rl-a-Rl-b-R-c-R-E, there is no other sequence which includes all subfragments. Examples of this type can be recognized only after the process of generation of pSFs. They represent the necessity for two steps in the process of pSF generation. The second situation of generation of false short subfragments on positions of nonrepeated k-l sequences when the files contain fdse negative and /or positive k-tuples is more important. The solution for both pSF groups consists of two parts, First, the false positive k- tuples appearing as the nonexisting minimd subfragments are di inated. All k-tuple subfragments of length k which do not have an overlap on either end, of the length of longer than k-a on one end and longer than k-b on the other end, are eliininated to enable formation of the maximd number of connections. In our experiments, the vdues for a and b of 2 and 3, respectivdy, appeared to be adequate to eliminate a suffident number of fdse positive k-tuples.

The merging of subfragments that can be umqudy connected is accomplished in the second step. The role for connection is: two subfiagments may be unambiguously connected if, and only if, the overlapping sequence at die relevant end or start of two subfragments is not present at the start and/or end of any other subfragment. The exception is if one subfragment from the considered pdr has the identicd beginning and end. In that case connection is permitted, even if there is another subfragment with the same end present in the file. The main problem here is the precise definition of overlapping sequence. The connection is not peπnitted if the overlapping sequence unique for ody one pair of subfragments is shorter than k-2, of it is k-2 or longer but an additiond subfiagment exists with the overlapping sequence of any length longer than k-4, Also, both the canonicd ends of pSFs and the ends after omitting one (or few) last bases are considered as the overlapping sequences. After this step some false positive k-tuples (as ininimd subfiagments) and some subfragments with a wrong end may survive. In addition, in very rare occasions where a certain number of some specific false k-tuples are simdtaneously present, an eπoneous connection may take place. These cases will be detected and solved in the subfragment ordering process, and in the additiond control steps dong with the handling of uncut "misconnected" subfiagments.

The short subfragments that are obtained are of two kinds. In the common case, these subfragments may be unambiguously connected among themselves because of the distribution of repeated k-l sequences. This may be done afterthe process of generation of pSFs and is a good example of the necessity for two steps in the process of pSF generation. In the case of using the file containing false positive and or false negative k-tuples, short pSFs are obtained on the sites of nonrepeated k-l sequences. Considering false positive k-tuples, a k-tuple may contain more than one wrong base (or containing one wrong base somewhere in the middle), as wdl as k-tuple on the end. Generation of short and eπoneous (or misconnected) subfragments is caused by the latter k-tuples. The k-tuples of the former kind represent wrong pSFs with length equd to k- tuple length.

The aim of merging pSF part of the dgorithm is the reduction of the number of pSFs to the minimd number of longer subfragments with the coπect sequence. All k-tuple subfiagments that do not have an overlap on dther end, of the length of longer than k-a on one, and longer than k-b on the other end, are diminated to enable the maximd number of connections. In this way, the majority of false positive k-tuples are discarded. The nile for connection is: two subfiagments can be unambiguously connected if, and only if the overlapping sequence of the relevant end or start of two subfragments is not present on the start and or end of any other subfragment The exception is a subfragment with the identical beginning and end. In that case connection is permitted, provided

6>7 that there is another subfragment with the same end present in the file. The main problem here is of precise definition of overlapping sequence. The presence of at least two specific false negative k- tuples on the points of repetition of k-l or k-2 sequences, as well as combining of the false positive and false negative k-tuples may destroy or "mask" some overlapping sequences and can produce an unambiguous, but wrong connection of pSFs. To prevent this, completeness must be sacrificed on account of exactness: the connection is not permitted on the end-sequences shorter than k-2, and in the presence of an extra overlapping sequence longer than k-4. The overlapping sequences are defined from the end of the pSFs, or omitting one, or few last bases.

In the very rare situations, with the presence of a certain number of some spedfic false positive and false negative k-tuples, some subfragments with the wrong end can survive, some false positive k-tuples (as minimd subfragments) can remain, or the eπoneous connection can take place. These cases are detected and solved in the subfiagments ordering process, and in the additiond control steps dong with the handling of uncut, misconnerted subfragments.

The process of ordering of subfragments is similar to the process of their generation. If one considers subfragments as longer k-tuples, ordering is performed by their unambiguous connection via overlapping ends. The informationd basis for unambiguous connection is the division of subfiagments generated in fragments of the basic library into groups representing segments of those fragments. The method is andogous to the biochemicd solution of this problem based on hybridization with longer oligonudeotides with relevant connecting sequence. The connecting sequences are generated as subfragments using the k-tuple sets of the appropriate segments of basic library fragments. Rdevant segments are defined by the fragments of the ordering library that overlap with the respective fragments of the basic library. The shortest segments are informative fragments of the ordering library. The longer ones are several neighboring informative fragments or tot overlapping portions of fragments corresponding of the ordering and basic libraries. In order to decrease the number of separate samples, fragments of the ordering library are randomly pooled, and the unique k- tuple content is determined.

By using the large number of fragments in the ordering library very short segments are generated, thus reducing the chance of the πuiltiple appearance of the k-l sequences which are the reasons for generation of the subfragments. Furthermore, longer segments, consisting of the various regions of the given fiagment of the basic library, do not contain some of the repeated k- 1 9/36567

sequences^, In every segment a connecting sequence (a connecting subfragment) is generated for a certain pair of the subfragments from the given fragment The process of ordering consists of three steps: (1) generation of the k- tuple contents of each segment; (2) generation of subfragments in each segment; and (3) connection of the subfiagments of the segments. Primary segments are defined as significant intersections and differences of k-tuple contents of a given fiagment of the basic library with the k-tuple contents of the pools of the ordering library. Secondary (shorter) segments are defined as intersections and differences of the k-tuple contents of the primary segments.

There is a problem of accumulating both false positive and negative k-tuples in both the differences and intersections. The false negative k-tuples from starting sequences accumdate in the intersections (overlapping parts), as, wdl as false positive k-tuples occurring randomly in both sequences, but not in the relevant overlapping region. On the other hand, the majority of false positives from dther of the starting sequences is not taken up into intersections. This is an example of the reduction of experimental errors from individud fragments by using information from fragments overlapping with them. The false k- tuples accumdate in the differences for another reason. The set of false negatives from the origind sequences are enlarged for false positives from intersections and the set of false positives for those k-tuples which are not included in the intersection by eπor, i.e. are false negative in the intersection. If the starting sequences contain 10% false negative data, the primary and secondary intersections will contain 19% and 28% false negative k- tuples, respectivdy. On the other hand, a mathematicd expectation of 77 false positives may be predicted if the basic fiagment and the pools have lengths of 500 bp and 10,000 bp, respectivdy. However, there is a possibility of recovering most of the "lost" k-tuples and of dirninating most of the false positive k-tuples.

First, one has to determine a basic content of the k-tuples for a given segment as the intersection of a given pdr of the k-tuple contents. This is followed by including dl k- tuples of the starting k-tuple contents in the intersection, which contain at one end k-l and at the other end k-+ sequences which occur at the ends of two k-tuples of the basic set. This is done before generation of the differences thus preventing the accuπ lation of false positives in that process. Following that, the same type of edargement of k-tuple set is applied to differences with the distinction that the boπowing is from the intersections. All borrowed k-tuples are diminated from the intersection files as false positives.

The intersection, i.e. a set of common k-tuples, is defined for each pair (a basic fiagment) X (a pool of ordering library). If the number of k-tuples in the set is significant it is enlarged with the false negatives according to the described nile. The primary difference set is obtained by subtracting from a given basic fragment the obtained intersection set The false negative k-tuples are appended to the difference set by borrowing from the intersection set according to the described rule and, at the same time, removed from the intersection set as false positive k-tuples. When the basic fragment is longer than the pooled fragments, this difference can represent the two separate segments which somewhat reduces its utility in further steps . The primary segments are dl generated intersections and differences of pairs (a basic fragment) X (a pool of ordering library) containing the significant number of k-tuples. K-tuple sets of secondary segments are obtained by comparison of k-tuple sets of all possible pairs of primary segments, The two differences are defined from each pdr which produces the intersection with the significant number of k-tuples. The majority of avdlable information from overlapped fiagments is recovered in this step so that there is little to be gained from the third round of forming intersections, and differences.

(2) Generation of the subfiagments of the segments is performed ideπticdly as described for the fragments of the basic library.

(3) The method of connection of subfragments consists of sequentidly determining the correctly linked pdrs of subfragments among the subfragments from a given basic library fiagment which have some overlapped ends. In the case of 4 relevant subfragments, two of which contain the same beginning and two having the same end, there are 4 different pairs of subfragments that can be connected. In general 2 are correct and 2 are wrong. To find correct ones, the presence of the connecting sequences of each pair is tested in the subfragments generated from all primary and secondary segments for a given basic fiagment. The length and the position of the connecting sequence are chosen to avoid interference with sequences which occur by chance. They are k+2 or longer, and include at least one element 2 beside overlapping sequence in both subfiagments of a given pdr. The connection is permitted only if the two connecting sequences are found and the remaining two do not exist. The two linked subfiagments replace former subfragments in the file and the process is cydicdly repeated. 9/36567

Repeated sequences are generated in this step. This means that some subfragments are included in linked subfragments more than once. They will be recognized by finding the relevant connecting sequence which engages one subfragment in connection with two different subfragments, The recognition of misconnected subfragments generated in the processes of bdlding pSFs and merging pSFs into longer subfiagments is based on testing whether the sequences of subfragments from a given basic fragment exist in the sequences of subfiagments generated in the segments for the fiagment. The sequences from an incorrectly connected position will not be found indicating the misconnected subfragments. Beside the described three steps in ordering of subfiagments some additiond control steps or steps applicable to specific sequences will be necessary for the generation of more complete sequence without mistakes.

The determination of which subfragment bdongs to which segment is performed b comparison of contents of k-tuples in segments and subfragments. Because of the eπors in the k-tuple contents (due to the primary eπor in pools and statisticd errors due to the frequency of occuπences of k-tuples) the exact partitioning of subfragments is impossible. Thus, instead of "dl or none" partition, the chance of coming from the given segment (P(sf,s)) is determined for each subfragment This possibility is the function of the lengths of k-tuples, the lengths of subfragments, the lengths of fragments of ordering library, the size of the pool, and of the percentage of false k-tuples in the file: P(sf,sHCk-F Lsf, where Lsf is the length of subfragment, Ck is the number of common k-tuples for a given subfragment/segment pair, and F is the parameter that indudes relations between lengths of k-tuples, fragments of basic library, the size of the pool, and the error percentage. Subfragments attributed to a particular segment are treated as redundant short pSFs and are submitted to a process of unambiguous connection. The definition of unambiguous connection is slightly different in this case, since it is based on a probability that subfiagments with overlapping end(s) bdong to the segment considered. Besides, the accuracy of unambiguous connection is controlled by following the connection of these subfiagments in other segments. After the connection in different segments, dl of the obtained subfiagments are merged together, shorter 9/36567

subfiagments included within longer ones are eliminated, and the remaning ones are submitted to the ordinary connecting process. If the sequence is not regenerated completely, the process of partition and connection of subfragments is repealed with the same or less severe criterions of o probability of belonging to the particular segment, followed by unambiguous connection. Using severe criteria for defining unambiguous overlap, some information is not used.

Instead of a complete sequence, several subfiagments that define a number of possibilities for a given fragment are obtained. Using less severe criteria an accurate and complete sequence is generated. In a certain number of situations, eg. an erroneous connection, it is possible to generate a complete, but an incorrect sequence, or to generate "monster" subfiagments with no connection among them . Thus, for each fiagment of the basic library one obtains: a) severd possible solutions where one is coπect and b) the most probable correct sdution. Also, in a very smdl number of cases, due to the mistake in the subfragment generation process or due to the specific ratio of the probabilities of bdonging, no unambiguous solution is generated or one, the most probable solution. These cases remain as incomplete sequences, or the unambiguous solution is obtained by comparing these data with other, overlapped fragments of basic library.

The described dgorithm was tested on a randomly generated, 50 kb sequence, containing 40% GC to simulate the GC content of the human genome. In the middle part of this sequence were inserted various All, and some other repetitive sequences, of a total length of about 4 kb. To simulate an in vitro SBH experiment, the following operations were performed to prepare appropriate data.

- Positions of sixty 5 kb overlapping "clones" were randomly defined, to sirmilate preparation of a basic library:

- Positions of one thousand 500 bp "clones" were randomly determined to simulate making the ordering library. These fragments were extracted from the sequence. Random pools of 20 fiagments were made, and k-tuple sets of pools were determined and stored on the hard disk. These data are used in the subfragment ordering phase: For the same density of clones 4 million clones in basic library and 3 million dones in ordering library are used for the entire human genome. The total number of 7 million dones is several fold smdler than the number of clones a few kb long for random cloning of almost dl of genomic DNA and sequencing by a gd-based method. /36567

From the data on the starts and ends of 5 kb fiagments, 117 "informative fiagments" were determined to be in the sequence. This was followed by determination of sets of overlapping k-tuples of which the single "informative fiagment" consist. Only the subset of k-tuples matching a predetermined list were used. The list contained 65% 8-mers, 30% 9-mexs, and 5% 10-12-mers. Processes of generation and the ordering of subfiagments were performed on these data.

The testing of the dgorithm was performed on the simulated data in two experiments. The sequence of 50 informative fragments was regenerated with the 100% correct data set (over 20,000 bp), and 26 informative fragments (about 10,000 bp) with 10% false k-tuples (5% positive and 5% negative ones). In the first experiment, dl subfiagments were correct and in ody one out of 50 informative fragments the sequence was not completely regenerated but remained in the form of 5 subfragments. The andysis of positions of overlapped fragments of ordering library has shown that they lack the information for the unique ordering of the 5 subfragments. The subfiagments may be connected in two ways based on overlapping ends, 1-2-3-4-5 and 1-4-3-2-5. The only difference is the exchange of positions of subfiagments 2 and 4. Since subfiagments 2, 3, and 4 are relatively short (total of about 100 bp), the relativdy greater chance existed, and occurred in this case, that none of the fragments of ordering library started or ended in the subfragment 3 region.

To simulate red sequencing, some false ("hybridization") data was included as input in a number of experiments. In oligomer hybridization experiments, under proposed conditions, the only situation producing unreliable data is the end mismatch versus full match hybridization.

Therefore, in simulation only those k-tuples differing in a single dement on dther end from the red one were considered to be false positives. These "false" sets are made as follows. On the origind set of a k-tuples of the informative fragment, a subset of 5% false positive k-tuples are added. Fdse positive k-tuples are made by randomly picking a k-tuple from the set, copying it and dtering a nucleotide on its beginning or end. This is followed by subtraction of a subset of 5% randomly chosen k-tuples. In this way the statisticdly expected number of the most complicated cases is generated in which the correct k-tuple is replaced with a k-tuple with the wrong base on the end. Production of k-tuple sets as described leads to up to 10% of false data, This vdue varies from case to case, due to the randomness of choice of k-tuples to be copied, dtered, and erased. Nevertheless, this percentage 3-4 times exceeds the amount of unreliable data in red hybridization 9/36567

experiments. The introduced eπor of 10% leads to the two fold increase in the number of subfiagments both in fragments of basic library (basic library informative fragments) and in segments, About 10% of the find subfiagments have a wrong base at the end as expected for the k-tuple set which contains false positives (see generation of primary subfiagments). Neither the cases of misconnection of subfragments nor subfragments with the wrong sequence were observed. In 4 informative fragments out of 26 examined in the ordering process the complete sequence was not regenerated. In dl 4 cases the sequence was obtained in the form of several longer subfiagments and severd shorter subfiagments contained in the same segment. This result shows that the dgorithmic principles dlow working with a large percentage of false data. The success of the generation of the sequence from its k-tuple content may be described in terms of completeness and accuracy. , In the process of generation, two particdar situations can be defined: 1 ) Some part of the information is missing in the generated sequence, but one knows where the a bigdties are and to which type they bdong, and 2) the regenerated sequence that is obtained does not match the sequence from which the k- tuple content is generated, but the mistake can not be detected. Assuming the dgorithm is devdoped to its theoreticd limits, as in the use of the exact k-tuple sets, only the first situation can take place. There the incompleteness results in a certain number of subfragments that may not be ordered unambiguously and the problem of determination of the exact length of monotonous sequences, i.e. the number of perfect tandem repeats. With false k-tuples, incoπect sequence may be generated. The reason for mistakes does not lie in the shortcomings of the dgorithm, but in the fact that a given content of k-tuples unambiguously represents the sequence that differs from the origind one. One may define three classes of error, depending on the kind of the false k- tuples present in the file. False negative k-tuples (which are not accompanied with the false positives) produce "ddetions", Fdse positive k-tuples are producing "dongations (uπequd crossing over)". Fdse positives accompanied with false negatives are the reason for generation of "insertions", done or combined with "ddetions", The ddetions are produced when dl of the k-tuples (or their majority) between two possible starts of the subfragments are false negatives. Since eveary position in the sequence is defined by k k- tuples, the occurrence of the ddetions in a common case requires k consecutive false negatives. (With 10% of the false negatives and k=8, this situation takes place after every 108 dements). This /36567

situation is extremdy infrequent even in mammdian genome sequencing using random libraries containing ten genome eqdvdeπts.

Elongation of the end of the sequence caused by false positive k-tuples is the spedd case of "insertions", since the end of the sequence can be considered as the endless linear array of false negative k-tuples. One may consider a group of false positive k-tuples producing subfragments longer than one k-tuple. Situations of this kind may be detected if subfragments are generated in overlapped fiagments, like random physical fiagments of the ordering library. An insertion, or insertion in place of a deletion, can arise as a result of specific combinations of false positive and false negative k-tuples. In the first case, the number of consecutive false negatives is s d/er than k. Both cases require severd overlapping false positive k-tuples. The insertions and deletions are mostly theoreticd possibilities without sizable practicd repercussions since the requirements in the number and specificity of false k-tuples are simply too high.

In every other situation of not meeting the theoreticd requirement of the πώώnd number an the kind of the false positive and/or negatives, mistakes in the k-tuples content may produce only the lesser completeness of a generated sequence.

SBH, a sample nucleic add is sequenced by exposing the sample to a support-bound probe of known sequence and a labeled probe or probes in solution. Wherever the probes ligase is introduced into the mixture of probes and sample, such that, wherever a support has a bound probe and a labeled probe hybridized back to back dong the sample, the two probes will be chemically linked by the action of the ligase. After washing, ody che icdly linked support-bound and labded probes are detected by the presence of the labded probe. By knowing the identity of the support-bound probe at a particdar location in an aπay, and the identity of the labded probe, a portion of the sequence of the sample may be determined by the presence of a label at a point in an aπay on a Format with a sample of three substrate. And not chances not working are maximdly overlapping sequences of dl of the ligated probe pairs, the sequence of the sample may be reconstructed. Not of the sample to be sequenced may be a nucleic arid fiagment or oligonudeotide often base pairs ("bp")_' The sample is preferably four to one thousand bases in length,

The length of the probe is a fiagment less than ten bases in length, and, preferably, is between four and nine bases in length. In tin^'s way, aπays of support-bound probes may include all /36567

oligonudeotides of a given length or may indude only oligonucleotides sdected for a particular test. Where dl oligonudeotides of a given length are used, the number of central oligonudeotides may be calculated by 4^N where N is the length of the probe,

EXAMPLE IS Be-Using Sequencing Chips

When ligation is employed in the sequencing process, then the ordinary oligonudeotide chip cannot be immediately reused. The inventor contemplates that this may be overcome in various ways.

One may employ ribonucleotides for the second probe, probe P, so that this probe may subsequently be removed by RNAse treatment. RNAse treatment may utilize RNAse A an endoribonudease that specifically attacks single-stranded RNA 3 to pyrimidine residues and cleaves the phosphate linkage to the adjacent nudeotide. The end products are pyrimidine 3 phosphates and oligonucleotides with termind pyrimidine 3 phosphates. RNAse A works in the absence of cofactors and divdent cations. To utilize an RNAse, one would generally incubate the chip in any appropriate

RNAse-containing buffer, as described by Sambrook et al (1 89; incorporated herein by reference). The use of 30-50 ul of RNAse-containing buffer per 8 8 mm or 9 x 9 m array at

37°C for between 10 and 60 minutes is appropriate. One would then wash with hybridization buffer. Although not widdy applicable, one codd dso use the uracil base, as described by Craig et al (1989), incorporated herein by reference, in specific embodiments. Destruction of the ligated probe combination, to yield a re-usable chip, would be achieved by digestion with the E Coli repdr enzyme, uiacil-DNA glycosylase which removes uracil from DNA.

One could dso generate a specifically deavable bond between the probes and then deave the bond after detection. For example, this may be achieved by chemicd ligation as described by

Shabarova et al, (1991) and Dolinnaya et al, (1988), both references being sperificdly incoiporated herein by reference.

Shabarova etal (1 91) describe the condensation of oligodeoxyribonucleotides with cyanogen bromide as a condensing agent In thrir one step chemicd ligation reaction, the 6567

oligonudeotides are heated to 97°C slowly cooled to 0°C, then 1 ul 10 mM BrCN in acetonitrile is added.

Dolinnaya et al (1988) show how to incorporate phosphcxrarnidiate and pyrophosphate intemucleotide bonds in DNA duplexes. They dso use a chemical ligation method for modification of the sugar phosphate backbone of DNA, with a water-soluble carbodiimide (CDI) as a coupling agent. The sdective deavage of aphosphoamide bond involves contact with 15% CH₃COOH for 5 min at 95°C The selective cleavage of a pyϊophosphate bond involves contact with a pyridine-water mixture (9:1) and freshly distilled (CF₃CO)₂0. EXAMPLE S

In a simple case, the god may be to discover whether selected, known mutations occur in a DNA segment Less than 12 probes may suffice for this purpose, for example, 5 probes positive for one dlele, 5 positive for the other, and 2 negative for both. Because of the smdl number of probes to be scored per sample, large numbers of samples may be andyzed in paralld. For example, with 12 probes in 3 hybridization cycles, 96 different genomic lori or gene segments from 64 patient may be andyzed on one 6 x 9 in membrane containing 12 x 24 subarrays each with 64 dots representing the same DNA segment from 64 patients. In this example, samples may be prepared in sixty-four 96-well plates. Each plate may represent one patient, and each well may represent one of the DNA segments to be andyzed. The samples from 64 plates may be spotted in four replicas as four quarters of the same membrane.

A set of 12 probes may be sdected by single channd pipetting or by a single pin transferring device (or by an array of individially-controlled pipets orpins) for each of the 96 segments,' and the sdected probes may be arrayed in twdve 96-well plates. Probes may be labdled, if they are not prelabelled, and then probes from four plates may be mixed with hybridization buffer and added to the subarrays prefereπtidly by a 96-channd pipeting device. After one hybridization cycle it is possible to strip off previously-applied probes by incubating the membrane at 37° to 55°C in the preferably undiluted hybridization or washing buffer.

The likelihood that probes positive for one allde are positive and probes positive for the other dlde are negative may be used to determine which of the two dleles is present. In this 6567

redundant scoring scheme, some level (about 10%) of eπors in hybridization of each probe may be tolerated.

An incomplete set of probes may be used for scoring most of the dldes, especidly if the smdler redundancy is sufficient, e.g. one or two probes which prove the presence or absence in a sample of one of the two dleles. For example, with a set of four thousand 8-mers there is a 91 % chance of finding at least one positive probe for one of the two dldes for a randomly sdected locus. The incomplete set of probes may be optimized to reflect G+C content and other biases in the andyzed samples.

For full gene sequencing, genes may be amplified in an appropriate number of segments. For each segment, a set of probes (about one probe per 2-4 bases) may be seleαed and hybridized. These probes may identify whether there is a mutation anywhere in the andyzed segments. Segments (i.e., subanays which contain these segments) where one or more mutated sites are detected may be hybridized with additiond probes to find the exact sequence at the mutated sites. If a DNA sample is tested by every second 6-mer, and a mutation is localized at the position that is suπounded by positively hybridized probes TGCAAA and TATTCC and covered by three negative probes: CAAAAC, AAACTA and ACTATT, the mutated nucleotides must be A and/or C occurring in the normd sequence at that position. They may be changed by a single base mutation, or by a one or two nucleotide deletion and or insertion between bases AA, AC or CT.

One approach is to sdect a probe that extends the positively hybridized probe TGCAAA for one nucleotide to the right, and which extends the probe TATTCC one nudeotide to the left. With these 8 probes (GCAAAA, GCAAAT, GCAAAC, GCAAAG and ATATTC, TTATTC, CTATTC, GTATTC) two questionable nucleotides are determined.

The most likdy hypothesis about the mutation may be determined. For example, A is found to be mutated to G. There are two solutions satisfied by these resdts. Either replacement of A with G is the ody change or there is in addition to that change an insertion of some number of bases between newly determined G and the following C. If the result with bridging probes is negative these options may then be checked first by at least one bridging probe comprising the mutated position (AAGCTA) and with an additiond 8 probes: CAAAGA, CAAAGT, CAAAGC, CAAAGG and ACTATT, TCTATT, CCTATT, GCTATT, I There are many other ways to sdect mutation-solving probes. /36567

In the case of diploid, particular comparisons of scores for the test samples and homozygotic control may be performed to identify heterozygotes (see above). A few consecutive probes are expected to have roughly twice smdler signds if the segment covered by these probes is mutated on one of the two chromosomes. EXAMPLE 20

Identification of Genes fMiitatinrisl Respηfisihle for fienetic Disorders and Other Traits

Using universd sets of longer probes (8-mers or 9-mers) on immobilized aπays of samples, • DNA fragments as long as 5-20 kb may be sequenced without subdoning. Furthermore, the speed of sequencing readily may be about 10 million bp/day/hybridization instrument. This performance dlows for resequenciπg a large fraction of human genes or the human genome repeatedly from scientificdly or medically interesting individuds. To resequence 50% of the human genes, about 100 million bp is checked. That may be done in a relativdy short period of time at an affordable cost.

This enormous resequencing capability may be used in several ways to identify mutations and/or genes that encode for disorders or any other traits. Basicdly, mRNAs (which may be converted into cDNAs) from particdar tissues or genomic DNA of patients with particular disorders may be used as starting materids. From both sources of DNA, separate genes or genomic fragments of appropriate length may be prepared either by cloning procedures or by in vitro amplification procedures (for example by PCR). If cloning is used, the minimd set of clones to be andyzed may be sdected from the libraries before sequencing. That may be done efficiently by hybridization of a smdl number of probes, especidly if a small number of dones longer than 5 kb is to be sorted. Cloning may increase the amount of hybridization data about two times, but does not require tens of thousands of PCR primers.

In one variant of the procedure, gene or genomic fragments may be prepared by restriction cutting with enzymes like Hga I which cuts DNA in following way: GACGCQIS CTGCGQllO'). Protruding ends of five bases are different for different fiagments, One enzyme produces appropriate fragments for a certain number of genes. By cutting cDNA or genomic DNA with severd enzymes in separate reactions, every gene of interest may be excised appropriately. In one approach, the cut DNA is fractionated by size. DNA fiagments prepared in this way (and optionally 6567

treated with Exonudease Dl which individudly removes nudeotides from the 3' end and increases length and specifidty of the ends) may be dispensed in the tubes or in mdtiwdl plates. From a relatively smdl set of DNA adapters with a common portion and a variable protruding end of appropriate length, a pdr of adapters may be sdected for every gene fragment that needs to be amplified. These adapters are ligated and then PCR is performed by univeisd primers. From 1000 adapters, a million pdrs may be generated, thus a million different fragments may be specifically amplified in the identicd conditions with a univeisd pair of primers complementary to the common end of the adapters.

If a DNA difference is found to be repeated in severd patients, and that sequence change is nonsense or can change function of the corresponding protein, then the mutated gene may be responsible for the disorder. By andyzing a significant number of iπdividuds with particular traits, fiinctiond dldic variations of particdar genes codd be associated by specific traits.

This approach may be used to eliminate the need for very expensive genetic mapping on extensive pedigrees and has specid vdue when there is no such genetic data or materid. EXAMPLE 21

Scoriηp Sinple Nucleotide Polymorphisms in Genetic Mapping

Techniques disdosed in this application are appropriate for an efficient identification of genomic fiagments with single nucleotide polymorphisms (SNUPs). In 10 individuds by applying the described sequencing process on a large number of genomic fragments of known sequence that may be amplified by cloning or by in vitro amplification, a sufficient number of DNA segments with SNUPs may be identified. The polymorphic fragments are further used as SNUP markers. These markers are dther mapped previously (for example they represent mapped STSs) or they may be mapped through the screening procedure described below.

SNUPs may be scored in every individud from relevant families or populations by amplifying markers and arraying them in the form of the aπay of subarrays. Subarrays contain the same marker amplified from the andyzed individuals. For each marker, as in the diagnostics of known mutations, a set of 6 or less probes positive for one dlde and 6 or less probes positive for the other dlele may be selected and scored. From the significant association of one or a group of the markers with the disorder, chromosomal position of the responsible gene(s) may be determined. Because of the high throughput and low cost, thousands of markers may be scored for thousands of individuds. This amount of data dlows localization of a gene at a resolution level of less than one million bp as well as localization of genes involved in poJygenic diseases. Localized genes may be identified by sequencing partiαdar regions from relevant normd and affected individuds to score a mutation(s). PCR is prefeπed for amplification of markers from genomic DNA Each of the markers require a spedfic pair of primers. The existing markers may be convertible or new markers may be defined which may be prepared by cutting genomic DNA by Hga 1 type restriction enzymes, and by ligation with a pair of adapters.

SNUP markers can be amplified or spotted as pools to reduce the number of independent amplification reactions. In this case, more probes are scored per one sample. When 4 markers are pooled and spotted on 12 replica membranes, then 48 probes (12 per marker) may be scored in 4 cydes.

EXAMPLE 22 Detection and Verification of Identity of DNA Fragments DNA fragments generated by restriction cutting, cloning ox in vitro amplification (e.g.

PCR) frequently may be identified in a experiment Identification may be performed by verifying the presence of a DNA band of specific size on gel electrophoresis. Altemativdy, a specific oligonucleotide may be prepared and used to verify a DNA sample in question by hybridization. The procedure devdoped here dlows for more effident identification of a large number of samples without preparing a specific oligonucleotide for each fragment. A set of positive and negative probes may be seleαed from the univeisd set for each fragment on the basis of the known sequences. Probes that are sdected to be positive usually are able to form one or a few overlapping groups and negative probes are spread over the whole inser

This technology may be used for identification of STSs in the process of their mapping on the YAC clones. Each of the STSs may be tested on about 100 YAC dones or pools of YAC dones. DNAs from these 100 reartions possibly are spotted in one subarray. Different STSs may represent consecutive subanays. In several hybridization cydes, a signature may be generated for each of the DNA samples, which signature proves or disproves existence of the particdar STS in the given YAC clone with necessary confidence. To reduce the number of independent PCR reactions or the number of independent samples for spotting, several STSs may be amplified simdtaneously in a reaction or PCR samples may be mixed, respectivdy. In this case more probes have to be scored per one dot The pooling of STSs is independent of pooling YACs and may be used on single YACs or pools of YACs. This scheme is especidly attractive when several probes labelled with different colors are hybridized together.

In addition to confirmation of the existence of a DNA fiagment in a sample, the amount of DNA may be estimated using intensities of the hybridization of several separate probes or one or more pools of probes. By comparing obtained intensities with intensities for control samples having • a known amount of DNA the quantity of DNA in dl spotted samples is determined simdtaneously. Because ody a few probes are necessary for identification of a DNA fiagment, and there are N possible probes that may be used for DNA N bases long, this application does not require a large set of probes to be sufficient for identification of any DNA segment. From one thousand 8-mers, on average about 30 full matching probes may be seleαed for a 1000 bp fiagment. EXAMPLE 23

Identification of Infectious Disease Organisms and Their Variants

DNA-based tests for the detection of viral, barterid, fungd and other parasitic organisms in patients are usudly more reliable and less expensive than alternatives. The major advantage of DNA tests is to be able to identify spedfic strains and mutants, and eventually be able to apply more effective treatment Two applications are described bdow.

The presence of 12 known antibiotic resistance genes in baαerid infections may be tested by amplifying these genes. The amplified products from 128 patients may be spotted in two subarrays and 24 subarrays for 12 genes may then be repeated four times on a 8 x 12 cm membrane. For each gene, 12 probes may be sdected for positive and negative scoring. Hybridizations may be performed in 3 cycles. For these tests, a much smdler set of probes is most likely to be universd. For example, from a s« of one thousand 8-mers, on average 30 probes are positive in 1000 bp fragments, and 10 positive probes are usually sufficient for a highly reliable identification. As described in Example 9, several genes may be amplified and or spotted together and the amount of the given DNA may be determined. The amount of amplified gene may be used as an indicator of the level of infeαion. Another example involves possible sequencing of one gene or the whole genome of an HIV virus. Because of rapid diversification, the virus poses many difficulties for sdection of an optimal therapy. DNA fragments may be amplified from isolated viruses from up to 64 patients and resequenced by the described procedure. On the basis of the obtained sequence the optimd therapy may be selected. If there is a mixture of two virus types of which one has the basic sequence

(similar to the case of heterozygotes), the mutant may be identified by quantitative comparisons of its hybridization scores with scores of other samples, especidly control samples containing the basic vims type ody, Scores twice as smdl may be obtained for three to four probes that cover the site mutated in one of the two virus types present in the sample (see above). EXAMPLE 24 . , forensic apd Parental Identification

Sequence polymorphisms make an individud genomic DNA unique. This permits analysis of blood or other body Adds or tissues from a crime scene and comparison with samples from crimind suspects. A sufficient number of polymorphic sites are scored to produce a unique signature of a sample. SBH may easily score single nudeotide polymoiphisms to produce such signatures,.

A set of DNA fragments (10-1000) may be amplified from samples and suspects. DNAs from samples and suspects representing one fragment are spotted in one or several subarrays and each subarray may be replicated 4 times. In three cydes, 12 probes may drterrnine the presence of dlele A or B in each of the samples, including suspects, for each DNA locus. Matching the patterns of samples and suspects may lead to discovery of the suspeα responsible for the crime.

The same procedure may be applicable to prove or disprove the identity of parents of a child. DNA may be prepared and polymorphic lod amplified from the child and adults; patterns of A or B dleles may be determined by hybridization for each. Comparisons of the obtained patterns, dong with positive and negative controls, dde in the deteimination of familid rdationships. In this case, only a significant portion of the alldes need match with one parent for identification. Large numbers of scored loci dlow for the avoidance of statistical errors in the procedure or of masking effects of de novσ mutations. 6567

EXAMPLE 2?

Assessing Genetic Diversity of Populations or Specfcs pnd Biological Diversity of Ecological Niches Measuring the frequency of alldic variations on a significant number of loci (for example, several genes or entire mitochondrid DNA) peπnits devdopment of different types of condusions, such as conclusions regarding the impact of the environment on the genotypes, history and evolution of a popdation or its susceptibility to diseases or extinction, and others. These assessments may be performed by testing specific known dleles or by full resequencing of some loci to be able to define de novo mutations which may reved fine variations or presence of mutagens in the environment

^• Additionally, biodiversity in the microbid world may be surveyed by resequencing evolutionarily conserved DNA sequences, such as ύxe genes for ribosomd RNAs or genes for highly conservative proteins. DNA may be prepared from the environment and particdar genes amplified using primers coπesponding to conservative sequences, DNA fiagments may be cloned preferentidly in a plasmid veαor (or diluted to the level of one molecde per well in mdtiwdl plates and than amplified in vitro). Clones prepared this way may be resεquenced as described above. Two types of information are obtained. First of dl, a catalogue of different species may be defined as well as the density of the individuds for each species. Another segment of information may be used to measure the influence of ecologicd faαors or pollution on the ecosystem. It may reved whether some species are eradicated or whether the abundance ratios among species is dtered due to the pollution. The method dso is applicable for sequencing DNAs from fossils. EXAMPLE 26 petectiop or Quantification of Nucleic Arid Species DNA or RNA species may be detected and quantified by employing a probe pair induding an unlabeled probe fixed to a substrate and a labded probe in a solution. The species may be detected and quantified by exposure to the unlabeled probe in the presence of the labeled probe and ligase. Specifically, the formation of an extended probe by ligation of the labeled and unlabeled probe on the sample nucleic acid backbone is indicative of the presence of the species to be detected. Thus, the presence of label at a specific point in the array on the substrate after removing 567

udigated labded probe indicates the presence of a sample species while the quantity of label indicates the expression level of the species,

Alternatively, one or more unlabded probes may be arrayed on a substrate as first members of pdrs with one or more labded probes to be introduced in solution. According to one method, mdtiplexing of the labd on the array may be carried out by using dyes which fluoresce at distinguishable wavdengths. In this manner, a mixture of cDNAs applied to an array with pairs of labded and unlabeled probes specific for species to be identified may be examined for the presence of and expression levd of cDNA species. According to a prefeπed embodiment this approach may be carried out to sequence portions of cDNAs by sdecting pairs of unlabeled and labded probes pdrs comprising sequences which overlap dong the sequence of a cDNA to be detected. Probes may be sdeded to deted the presence and quantity of partiαϋar pathogenic organisms genome by induding in the composition selected probe pairs which appear in combination ody in target pathogenic genome organisms. Thus, while no single probe pdr may necessarily be specific for the pathogenic organism genome, the combination of pairs is. Similarly, in detecting or sequencing cDNA≤, it might occur that a particular probe is not be specific for a cDNA or other type of species. Neverthdess, the presence and quantity of a particular species may be determined by a result wherein a combination of sdeded probes situated at distinct array locations is indicative of the presence of a particular species.

An infectious agent with about 1 Okb or more of DNA may be deteαed using a support-bound detection chip without the use of polymerase chain reaction (PCR) or other target amplification procedures. According to other methods, the genomes of infectious agents induding baαeria and viruses are assayed by amplification of a single target nudeotide sequence through PCR and detection of the presence of target by hybridization of a labdled probe specific for the target sequence. Because such an assay is specific for ody a single target sequence it therefore is necessary to amplify the gene by methods such as PCR to provide sufficient target to provide a detectable signd.

According to this example, an improved method of detecting nudeotide sequences characteristic of infertious agents through a Format 3-tyρe reaction is provided wherein a solid phase detection chip is prepared which comprises an array of mdtiple different immobilized oligonucleotide probes specific for the infectious agent of interest. A single dot comprising a 6567

mixture of many unlabded probes complementary to the target nucleic acid concentrates the label specific to a species at one location thereby improving sensitivity over diffuse or single probe labding. Such multiple probes may be of overlapping sequences of the target nudeotide sequence but may dso be non-overlapping sequences as well as non-adjacent Such probes preferably have a length of about 5 to 12 nucleotides.

A nucleic acid sample exposed to the probe aπay and target sequences present in the sample will hybridize with the multiple irnmobilized probes. A pool of multiple labded probes selected to specificdly bind to the target sequences adjacent to the immobilized probes is then applied with the sample to an aπay of unlabeled oligonucleotide probe mixtures. Ligase enzyme is then applied to the chip to ligate the adjacent probes on the sample. The detection chip is then washed to remove unhybridized and unligated probe and sample nucldc adds and the presence of sample nucleic acid may be determined by the presence or absence of label. This method provides reliable sample deteαion with about a 1000-fold reduction of molarity of the sample agent

As a further aspect of the invention, the signd of the labdled probes may be amplified by means such as providing a common tail to the free probe which itself comprises multiple chromogenic, enzymatic or radioactive labds or which is itsdf susceptible to specific binding by a further probe agent which is multiply labdled. In this way, a second round of signd amplification may be carried out. Labded or udabeled probes may be used in a second round of amplification. In this second round of amplification, a lengthy DNA sample with multiple labels may result in an increased amplification intensity signd between 10 to 100 fold which may resdt in a total signd amplification of 100,000 fold. Through the use of both aspects of this example, an intensity signd approximately 100,000 fold may give a positive resdt of probe-DNA ligation without having to employ PCR or other amplification procedures.

According to a further aspect of the invention an aπay or super array may be prepared which consists of a complete set of probes, for example 40966-mer probes. Arrays of this type are universd in a sense that they can be used for detection or partid to complete sequencing of any nucleic acid species. Individud spots in an aπay may contain single probe species or mixtures of probes, for example N(l-3) B(4-6) N(l-3) type of mixtures that are synthesized in the single reaction (N represents dl four nucleotides, B one specific nudeotide and where the associated numbers are a range of numbers of bases le., 1-3 means "from one to three bases".) These 6567

mixtures provide stronger signd for a nucleic acid species present at low concentration by collecting signd from different parts of the same long nudeic acid species molecule. The universal set of probes may be subdivided in many subsets which are spotted as unit arrays separated by barriers that prevent spreading of hybridization buffer with sample and labeled probe(s). For detection of a nucleic acid species with a known sequence one of more oligonucleotide sequences comprising both udabelled fixed and labded probes in solution may be sdected. Labeled probes are synthesized or selected from the presynthesized complete sets of, for example, 7-mers. The labded probes are added to coπesponding unit arrays of fixed probes such that a pdr of fixed and labeled probes will adjacently hybridize to the target sequence such that upon administration of ligase the probes will be covdently bound.

If a unit aπay contains more than one fixed probe (as separated spots or within the same spot) that are positive in a given nucleic arid species dl coπesponding labded probes may be mixed and added to the same unit array. The mixtures of labded probes are even more important when mixtures of nucleic acid species are tested. One example of a complex mixture of nudeic acid species are mRNAs in one cell or tissue.

According to one embodiment of the invention unit aπays of fixed probes allow use of every possible immobilized probe with cocktails of a relatively smdl number of labeled probes. More complex cocktails of labeled probes may be used if a multiplex labeling scheme is implemented. Preferred multiplexing methods may use different fluorescent dyes or molecdar tags that may be separated by mass spectroscopy.

Altemativdy, according to a preferred embodiment of the invention, relatively short fixed probes may be selected which frequently hybridize to many nudeic acid sequences. Such short probes are used in combination with a cocktail of labeled probes which may be prepared such that at least one labded probe corresponds to each of the fixed proves. Prefeπed cocktails are those in which none of the labeled probes corresponds to more than one fixed probe. EXAMPXE.27

Interrogation of Segments of the HTV Virus with All Possible IQ-ipers

In this example of Format III SBH, an array was generated on nylon membranes (e.g., Gene Screen) of dl possible bound 5-mers (1024 possible pentamers). The bound 5-mer oligonucleotides were synthesized with 5' tails of 5'-TTTTTT-NNN-3' (N = all four bases C, G, 36567

T, at this step in the synthesis equd molar amounts of dl four bases are added). These oligonucleotides were prerisdy spotted onto the nylon membrane, the spots were allowed to diy, and the oligonucleotides were immobilized by treating the dried spots with UV light. Oligonucleotide densities of up to 18 oligonucleotides per square nanometer were obtained using this method. After the UV treatment, the nylon membranes were treated with a detergent containing buffer at 60-80°C. The spots of oligonucleotides were gridded in subarrays of 10 by 10 spots, and each subaπay has 645-mer spots and 36 control spots. 16 subanays give 10245-mers which encompasses dl possible 5-meτs.

The subarrays in the aπay were partitioned from each other by physicd barriers, e.g., a hydrophobic strip, that dlowed each subarray to be hybridized to a sample without cross- contamination from adjacent subarrays. In a preferred embodiment, the hydrophobic strip is made from a solution of silicone (e.g., household silicone glue and sed paste) in an appropriate solvent (such solvents are well known in the art). This solution of silicone grease is applied between the subarrays to form lines which after the solvent evaporates aα as hydrophobic strips separating the cells.

In this Format HI example, the free or solution (nonbound) 5-mers were synthesized with 3' tails of 5-NN-3" (N = dl four bases A C, G, T). In this embodiment, the free 5-mers and the bound 5-mers are combined to produce dl possible 10-mers for sequencing a known DNA sequence of less than 20 kb. 20 kb of double stranded DNA is denatured into 40 kb of single-stranded DNA. This 40 kb of ss DNA hybridizes to about 4% of dl possible 10-mers. This low frequency of 10- er binding and the known target sequence dlow the pooling of free or solution (nonbound) 5- mers for treatment of each subaπay, without a loss of sequence information. In a prefeπed embodiment, 16 probes are pooled for each subarray, and dl possible 5-mers are represented in 64 total pools of free 5-mers. Thus, all possible 10-mers may be probed against a DNA sample using 1024 subanays (16 subanays for each poo] of free 5-mers).

The target DNA in this embodiment represents two-600 bp segments of the HTV vims. These 600 bp segments axe represented by pools of 60 overlapping 30-mers (the 30-mers overlap each adjacent 30 mer by 20 nudeotides). The pools of 30-mers mimic a target DNA that has been treated using techniques wdl known in the art to shear, digest and/or random PCR the target DNA ^" to produce a random pool of very smdl fragments. 6567

As described above in the previous Format III examples, the free 5-mers are labeled with radioactive isotopes, biotin, fluorescent dyes, dc. The labded free 5-mers are then hybridized dong with the bound 5-mers to the target DNA and ligated. In a preferred embodiment, 300-1000 units of ligase are added to the reaction. The hybridization conditions were worked out following the teachings of the previous examples. Following ligation and removd of the target DNA and excess free probe, the array is assayed to determine the location of labded probes (uiing the techniques described in the examples above).

The known DNA sequence of the target, and the known free and bound 5-mexs in each subaπay, prediα which bound 5-mers will be ligated to a labeled free 5-mer in each subarray. The signd from 20 of these predicted dots were lost and 20 new signds were gained for each change in the target DNA from the predicted sequence. The overlapping sequence of the bound 5-mers in these ten new dots identifies which free, labded 5-mer is bound in each new dot.

Using the described methods, arrays and pools of free, labded 5-mers, the test HTV DNA sequence was probed with dl possible 10-mers. Using this Format III approach, we properly identified the "wild-type" sequence of the segments tested, as wdl as several sequence "mutants" that were introduced into these segments.

EXAMPLE 2?

Sequencing of Repetitive DNA Sequences

In one embodiment, repetitive DNA sequences in the target DNA are sequenced with "spacer oligonucleotides" in a modified Format III approach. Spacer oligonudeotides of varying lengths of the repetitive DNA sequence (the repeating sequence is identified on a first SBH run) axe hybridized to the target DNA dong with a first known adjoining oligonucleotide and a second knβwn, or group of possible oligonudeotides adjoining the other side of the spacer (known from the first SBH run). When a spacer matching the length of the repetitive DNA segment is hybridized to the target, the two adjacent oligonudeotides can be ligated to the spacer, If the first known oligonucleotide is fixed to a substrate, and the second known or possible oligonudeotide(s) is labded, a bound ligation produα including the labded second known or possible oligonuclεotide(s) is formed when a spacer of the proper length is hybridized to the target DNA. 6567

EXAMPLE29

Seouencipg Through Branch Points with Format III SBH

In one embodiment, branch points in the target DNA are sequenced using a third set of oligonucleotides and a modified Format UI approach. After a first SBH run, several brandi points may be identified when the sequence is compiled. These can be solved by hybridizing oligonucleotide(s) that overlap partidly with one of the known sequences leading into the branch point and then hybridizing to the target an additiond oligonudeotide that is labded and corresponds to one of the sequences that comes out of the branch point When the proper oligonucleotides are hybridized to the target DNA, the labeled oligonudeotide can be ligated to the other(s). In a preferred embodiment, a first oligonucleotide that is offset by one to several nucleotides from the branch point is selected (so that it reads into one of the branch sequences), a second oligonucleotide reading from the first and into the branch point sequence is also sdeαed, and a set of third oligonudeotides that coπespond to dl the possible branch sequences with an overlap of the branch point sequence by one or a few nucleotides (coπesponding to the first oligonudeotide) is sdected. These oligonucleotides are hybridized to the target DNA, and only the third oligonudeotide with the proper branch sequence (that matches the branch sequence of the first oligonudeotide) will produce a ligation produrt with the first and second oligonudeotides.

EXAMPLE 3Q

Muhiplexinp Probes for Analyzing a Target Nucleic Acid In this Example, sets of probes are labeled with different labels so that each probe of a set can be differentiated from the other probes in the set. Thus, the set of probes may be contacted with target nucleic acid in a single hybridization reaction without the loss of any probe information. In preferred embodiments, the different labels are different radioisotopes, or different flourescent labds, or different EMLs. These sets of probes may be used in dther Format I, Format II or Format III SBH.

In Format I SBH, the set of differently labded probes are hybridized to target nucldc acid which is fixed to a substrate under conditions that dlow differentiation between perfect matches one base-pair mismatches. Specific probes which bind to the target nudeic acid are identified by 6567

their different labels and pcrfeα matches arc determined, at least in part, from this binding information.

In Format II SBH, the target nucleic adds are labeled with different probes and hybridized to arrays of probes. Specific target nudeic acids which bind to the probes are identified by thdr different labels and perfert matches are determined, at least in part, form this binding information. In Format UI SBH, the set of differently labeled probes and fixed probes are hybridized to a target nucleic acid under conditions that dlow perfect matches to be differentiated from one base- pair mismatches. Labded probes that are adjacent, on the target, to a fixed probe are bound to the fixed probe, and these products are detected and differentiated by their different labels. In a preferred embodiment, the different labels are EMLs, which can be detected by electron capture mass spectrometry (EC-MS). EMLs may be prepared from a variety of backbone molecules, with certain aromatic backbones being particularly prefeπed, e.g., see Xu et αl, J. Chromatog. 764:95-102 (1997). The EML is attached to a probe in a reversible and stable manner, and after the probe is hybridized to target nucldc arid, the EML is removed from the probe and identified by standard EC-MS (e.g., the EC-MS may be done by a gas chromatograph-mass spectrometer).

EXAMPLE 31

Detection of Low Frequency Tar et Nucleic Acids

Format III SBH has sufficient discrimination power to identify a sequence that is present in a sample at 1 part to 99 parts of a similar sequence that differs by a single nucleotide. Thus, Format Dl can be used to identify a nucldc arid present at a very low concentration in a sample of nucleic acids, e.g., a sample derived from blood.

In one embodiment, the two sequences are for cystic fibrosis and the sequences differ from each other by a deletion of three nucleotides. Probes for the two sequences were as follows, probes distinguishing the deletion from wild type were fixed to a substrate, and a labded contiguous probe was common to both. Using these targets and probes, the deletion mutant codd be detected with Format III SBH when it was present at one part to ninety nine parts of the wild-type. 6567

EXAMPLE 32

Polaroid Ap^paratus and Method for An t rήifr a TarprβtNucl i Aςjd

An apparatus for andyz g a nucleic acid can be constructed with two aπays of nucleic adds, and an optiond materid that prevents the nucleic acids of the two arrays from mixing until such mixing is desired. The arrays of the apparatus may be supported by a variety of substrates, induding but not limited to, nylon membranes, nitrocellulose membranes, or other materids disclosed above, In prefeπed embodiments, one of the substrate is a membrane separated into sectors by hydrophobic strips, cor a suitable support materid with wdls which may contain a gd or sponge. In this embodiment probes are placed on a sertor of the membrane, or in the well, the gel, or sponge, and a solution (with or without target nucldc acids) is added to the membrane or wdl so that the probes are solubilized. The sdution with the solubilized probes is then dlowed to contact the second array of nudeic acids. The nucleic acids may be, but are not limited to, oligonudeotide probes, or target nucleic acids, and the probes or target nucldc adds may be labded. The nucldc adds may be labded with any labds conventiondly used in the art, including but not limited to radioisotopes, fluorescent labds oτ dectrophore mass labds.

The materid which prevents mixing of the nucldc adds may be disposed between the two aπays in such a way that when the materid is removed the nucleic acids of the two aπays mix together. This materid may be in the form of a sheet, membrane, or other barrier, and this material may be comprised of any materid that prevents the mixing of the nudeic acids. This apparatus may be used in Format I SBH as follows: a first array of the apparatus has target nudeic acids that are fixed to the substrate, and a second array of the apparatus has nucleic add probes that are labeled and can be removed to interrogate the target nudeic acid of the first array. The two arrays are optiondly separated by a sheet of materid that prevents the probes from contacting the target nudeic acid, and when this sheet is removed the probes can inteπogate the target. After appropriate incubation and (optiondly) washing steps the array of targets may be "read" to determine which probes formed perfeα matches with the target. This reading may be automated or can be done raanudly (e.g., by eye with an autoradiogram). In Format U SBH, the procedure followed would be similar to that described above except that the target is labded and the probes are fixed. 6567

Altemativdy, the apparatus may be used in Format III SBH as follows: two arrays of nuddc acid probes are formed, the nucleic arid probes of either or both arrays may be labded, and one of the arrays may be fixed to its substrate. The two arrays are separated by a sheet of material that prevents the probes from mixing. A Format H reaction is initiated by adding target nucldc arid and removing the sheet dlowing the probes to mix with each other and the target Probes which bind to adjacent sites on the targrt are bound together (e.g., by base-stacking interactions or by covdently joining the backbones), and the resdts are read to determine which probes bound to the target at adjacent sites. When one set of probes is fixed to the substrate, the fixed array can be read to determine which probes from the other array are bound together with the fixed probes. As with the above method, this reading may be automated (e.g., with an ELISA reader) or can be done manudly (ε.g., by eye with an autoradiogram). EXAMPLE 33

Three Dimensional Arrays Of Probes

In a prefeπed embodiment, the oligonudeotide probes are fixed in a three-dimensiond anay. The three-dimensiond array is comprised of multiple layers, such that each layer may be andyzed separate and apart from the other layers, or all the layers of the three-dimensiond aπay may be simultaneously analyzed. Three dimensional aπays indude, for example, an aπay disposed on a substrate having multiple depressions with probes located at different depths within the depressions (each level is made up of probes at similar depths within the depression); or an array disposed on a substrate having depressions of different depths with the probes located at the bottom of the depression, at the peaks separating the depressions or some combination of peaks and depressions (each level is made up of dl probes at a certain depth); or an array disposed on a substrate comprised of multiple sheets that are layered to form a three-dimensiond aπay,

Materids for synthesizing these three-dimensiond aπays are wdl known in the an, and indude the materids previously rerited in this specification as suitable as supports for probe arrays. In addition, other suitable materids which can support oligonucleotide probes, and which preferably, are flexible may be used as substrates. EXAMPLE 34

Signa u e Processing For Clustering cDNA Clones 6567

A plurality of distinct nucldc arid sequences were obtained from cDNA library, using Standard per, SBH sequence signature andysis and Sanger sequencing techniques. The inserts of the library were amplified with per using primers specific for vector sequences which flank the inserts. These samples were spotted onto nylon membranes and interrogated with sdtable number of oligonucleotide probes and the intensity of positive binding probes was measured giving sequence signatures. The clones were clustered into groups of similar or identical sequence signatures, and single representative dones were selected from each group for gd sequencing. The 5' sequence of the amplified inserts was then deduced using the reverse M13 sequencing primer in a typicd Sanger sequencing protocol. PCR products were purified and subjected to flourescent dye terminator cycle sequencing. Single pass gd sequencing was done using a 377 Applied Biosystems (ABI) sequencer. The majority of clones which were sdected and sequenced by this method had sequences which differed from each other, and a very smdl number had the same sequence. E AM LE 35 ffiph-Thrnufflrout Prodnction Of Chins In a preferred embodiment, an apparatus for mass producing arrays of probes may comprise a rotating drum or plate coupled with an ink-jet deposition apparatus, for example, a microdrop dosing head; and a suitable robotics systems, for example, an anorad gantry. A particdarly preferred embodiment of the apparatus will be described referring to Figs. 1-3.

The apparatus comprises a cylinder (1 ) to which a suitable substrate is fixed. The substrate may be any of the materials previously described as suitable for an array of probes. In a prefeπed embodiment, the substrate is a flexible material, and the aπays are made dirertly on the substrate. In dtemative embodiments, a flexible substrate is fixed to the cylinder and individud chips are fixed on the substrate. The aπays are then made on each individud chip.

In a prefeπed embodiment, physicd barriers are applied to the substrate or chip and define an aπay of wells. The physicd barriers may be applied to the substrate or chip by the apparatus, or dternativdy, the physicd barriers are applied to the chips or substrate before they are fixed to the cylinder (I). A single spot of oligonucleotide probes is then placed into each wdl, wherein the probes placed into an individud wdl may all have the same sequence, or the probes spotted into an individud well may have different sequences, ln a more prefeπed embodiment the probe or probes spotted into each individud wdl in an array are different from the probe or probes spotted in 6567

the other wdls of the aπay. Sequencing chips comprising dtiple arrays can then be assembled from these aπays.

After the substrate or substrate and chips are fixed to the cylinder (1 ), a motor (not shown) rotates the cylinder. The cylinder's rotation speed is precisely determined by any of the ways wdl known in the art, including, for example using a fixed opticd sensor and light source that rotates with the cylinder. A dispensing apparatus (3) moves along an arm (2) and can deliver probes or other reagents through a dispensing tip (8) to precise locations on the substrate or chips using the precise rotation speed cdαilated above, by methods well known in the art. The dispensing apparatus recdves probes or reagents from the reservoir (6) through the feeding line (7). The reservoir (6) holds d] the necessary probes and other reagents for making the arrays.

The dispensing apparatus is depicted in Figure 3. The dispensing apparatus may have one or multiple dispensing tips (14 &■ 8). Each dispensing tip has a sample wdl (13) in a body (12) that receives probes or other reagents through a sample line (10). The pressure line (11) pressurizes the chamber (9) to a psi sufficient to force probes or reagents through the dispensing tip ( 14 &.8). The sample line (10), wdl (13) and dispensing tip (14 & 8) must be flushed between each change in probe or reagent An appropriate washing buffer is supplied through sample line (10) or through an optiond dedicated washing line (not shown) to the sample well (13) or optiondly a portion or dl of the chamber (9) may be filled with washing buffer. The washing buffer is then removed from the sample wdl (13) and chamber (9) if necessary by an evacuation line (not shown) or through the sample line (10) and dispensing tip (14 & 8).

When the dispensing means has applied probes to dl the appropriate sites in each aπay or chip, the substrate (with or without chips) is removed from the cylinder and a new substrate is fixed to the cylinder.

EXAMPLE 36 Analysis Of A Target Nucleic Acid With Probes Completed To Discrete Particles

In this embodiment, a target nucleic acid is interrogated with probes that are complexed (covdent or noncovdeπt) to a plurality of discrete particles. The discrete partides can be discriminated from each other based on a physicd property (or a combination of physical properties), and particles with differentiated by the physical property are complexed with different probes. In a prefeπed embodiment, the probe is an oligonucleotide of a known sequence and 6567

length. Thus, a probe may be identified by the physicd property of the discrete particle. Suitable probes for this embodiment include dl the probes that are described above in previous sections, including probes which are shorter in an informative sense than the probes full length.

The physicd property of the discrete particle may be any property, well known in the art, which allows particles to be differentiated into sets. For example, the partides could be differentiated into sets based on their size, flourescence, absorbance, dectromagnetic charge, or weight, or the particles could be labded with dyes, radionudides, ox EMLs. Other suitable labds include ligands which can serve as specific binding members to a labeled antibody, chemiluminescers, enzymes, antibodies which can serve as a specific binding pair member for a labded ligand, and the like. A wide variety of labds have been employed in immunoassays which can readily be employed. Still other labds include antigens, groups with specific reaαivity, and dectrochemically detectable moieties. Still further labels, include any of the labels recited above in previous sections. These labels and properties may be measured quantitativdy by methods wdl known in the art, induding for example, those methods described above in previous sections, and the partides may be differentiated on the basis of signd intensity or signd type (for one of the labels, e.g., different dye densities may be applied to a particle, or different types of dyes). In a preferred embodiment several physicd properties are combined and the different combinations of properties allow discrimination of the partides (e.g., ten sizes and ten colors could be combined to differentiate 100 partide groups). The pa tide-probes dlow the exploitation of standaid combinatorid approaches so that, for example, dl possible 10-mers can be synthesized using about 2000 reaction containers. A first set of 1024 reactions are done to synthesize dl possible 5-mers on 1024 differcntidly labeled particles. The resulting probe-particles are mixed together, and split into another set of 1024 reaction containers. A second set of reactions are done with these samples to synthesize dl possible 5-mer extensions on the probes in the pools of partides. The physicd property identifies the first five nucleotides of each probe and the reaction container will identify the identity of the second five nucleotides of every probe. Thus, dl possible 10-mer probes axe synthesized using 2048 reaction containers, This approach is easily modified to make all possible n-mers for a large range of probe lengths. 6567

Tn a preferred embodiment, the particles are separated into sets by the intensity of flourescence of the particles. The particles in each set are prepared with varying densities of flourescent label, and thus, the particles have different flourescence intensities. The flourescent intensity of flourescdn is rdated to concentration over a range of 1:300 to 1:300,000 (Loc hart et d., 1986), and between 1:3000 to 1:300,000 there is a linear relationship (so the fiourescein intensity is linear over a range of about 1-300). In the linear range of detection, 256 sets of particles are labeled with fiourescein (e.g., 3-259). 256 sets of partides allows all possible 4-mers to be attached to different sets of particles. By pooling the particles, dl possible 5-mers can be made by having four pools of dl possible 4-mers and then extending the probes in each pool by A, G, C, or T. Similarly, all possible 6-mers can be made by having 16 pools of all possible 4-mers in which each pool of 4-mers is extended by one of the 1 possible two base permutations of A, G, C, and T (etc. for 7-mers there are 64 pools, 8-mers there are 256 pools, and so forth).

The 5-mer probes (in four pools) are used to inteπogate a target nucldc acid. The target nucleic acid is labeled with another flourescent dye, or other different label (as described above). Labeled target is mixed with the four pools, and complementary probes in each pool hybridize to the target nucleic acid. These hybridization complexes are detected by methods wdl-known in the art, and the positively hybridizing probes are then identified by detecting the flourescence intensity of the particle. In a most prefeπed embodiment, the mixture of probe-particles and target nucleic arid are fed through a flow-cytometer or other separating instrument one particle at a time, and the particle label and the target are measured to determine which probes are complementary to the target nucleic add.

In an dtemative embodiment, a set of free probes is labeled with another flourescent dye, or other label (as described above), and individud free probes are mixed with each pool of 5-mer probes (four pools) and then the mixtures are hybridized with the target nudeic arid. An agent is added to covdently attach free probe to 5-mer probe (see previous sections for a description of suitable agents), when the free probe is bound to a site on the target nudeic arid that is adjacent to the site on which the 5-mer probe is bound (the free probe site must be adjacent the end of the 5- mer probe which can be ligated). The particles are then assayed, by methods well known in the art, to determine which partides have been covdently coupled to the free probe, i.e., the particles which have the free probe Jabd, and the 5-mer probe is identified by the flourescent intensity of the 6567

partide. In a most prefeπed embodiment, the mixture of probe-partides, free probes, and _target nudeic add are fed through a flow-cytometer one particle at a time, and the particle label and the free probe label are measured to determine which probes are complementary to the target nudeic acid. In prefeπed embodiment, a single apparatus houses all or most of the madpdations for an andysis of a target nucleic acid with the probe-particle complexes. The apparatus has one or more reagent chambers in which buffer and labeled target nudeic acid are thoroughly mixed (target nucleic acid may be added manually or automaticdly). The mixture is diquoted from the reagent chamber into a plurality of reaction chambers, and each reaction chamber has a pool of probe- particle complexes. The probe-particles and target nucleic acid react under conditions which dlow complementary probes to bind with the target nudeic acid. Excess target nudeic acid, i.e., nonbound, is removed from the reaction chamber (e.g., by washing), and the particles bound to target nucldc acid are identified by the association of target nucleic acid label with the particles, and the probe is identified by the physicd property of the particle. In a preferred embodiment, after removing excess target, the partides move single file through a channel from the reaction chamber to the deteαing device(s). As single partides move past the detecting device(s) they measure the target label and the physicd property of the particle. In an dtemative prefeπed embodiment, before or after removing excess target, the particles are fractionated, for example, by size (e.g., exclusion chromatography), charge (e.g., ion exchange chromatography), and/or density-weight into thrir sets using one or a combinations of these physicd properties. These fractionated partides are then assayed by the deteαing device(s).

In an dtemative of this embodiment, the main reagent chamber is supplied with buffer, target nudeic arid, a pool of probe-partide complexes, and a chemicd or enzymatic ligating reagent. These components are thoroughly mixed and then diquoted from the reagent chamber into a plurality of reaction chambers. Each of the reaction chambers has a labded free probe.

Alternatively, the pool of probe partide complexes may be placed in the reaction chamber with the free probe instead of adding them to the reagent chamber. Additiondly, the free probes codd be added to the reagent chamber, and the pool of probe partides codd be added to the reaction chamber. The probe-particles, target nudeic acid, and free probe react under conditions which allow free- and partide-probes to bind with adjacent sites on the target nucleic acid so that free 567 probe is ligated to the probe partide. Excess free probe (i.e., nooligated), and target nucleic acid are removed from the reaction chamber (e.g., by washing), and the ligated probes are identified by the association of free probe Iabd with the partides, and the probes complexed to the paiticles are identified by the physicd property of the partide. In a preferred embodiment, after removing excess probe and the target, the partides move single file through a channel from the reaction chamber to the deteαing device(s). As single particles move past the deteαing device(s) they measure free probe label covdently attached to the partides, and the physicd property of the particle. In an alternative prefeπed embodiment, before or after removing excess probe and the target, the particles are fractionated, for example, by size (e.g., exclusion chromatography), charge (e.g., ion exchange chromatography), and/or density weight into their sets using the physicd property. These fraαionated particles are assayed by the detecting device(s).

In a prefeπed embodiment, there is a sset of econd reaction chambers in the apparatus, and the pool of probe-particles are placed in the second reaction chamber. Target and buffer are mixed in the reagent chamber and these are fed into the first reaction chamber which contains the labeled free probe. The probe and target are mixed, and optiondly the probe may hybridize to the target. This mixture of labeled probe and target is then passed to the second reaction chamber which contains the pool of probe-particles. The free probe and probe-particles hybridize to target and appropriate probes are ligated in the second reaction chamber. The ligating agent may be added at the reagent chamber or in either reaction chamber, preferably, the ligating agent is added in the second reaction chamber. The probe-particle hybridization products in the second reaction chamber are andyzed as above.

In one embodiment, the target nuddc acid is not amplified prior to andysis (either by PCR or in a vertor, e.g, a lambda library). Preferably longer free and partide probes are used in this embodiment because of the increase in sequence complexity of the sample (i.e., to distingdsh positives over background).

The probe-particle embodiments described in this example are sdtable for use in any of the applications previously described, including, but not limited to the previously described diagnostic and sequencing applications. Additiondly, these probe particle embodiments may be modified by any the previously described variations or modifications. 67

EXAMPLE 37

The Interaction of Complementary Polynucleotides in the Presence of Agents Which Mφdrfr the BindingBetween the Polynucleotides.

In this embodiment, the discrimination of perfect matches from mismatches in the binding of complementary polynucleotides is moddated by the addition of an agent or agents. In a prefeπed embodiment, the complementary polynucleotides are a target polynudeotide and a polynucleotide probe. The discrimination of perfect matches from mismatches may be modulated by adding an agent wherein the agent is a sdt such as tetraalkyl ammonium sdt (e.g., TMAC, Ricdli et al., Nucl. Acids Res.21:3785-3788 (1 93)), sodium chloride, phosphate sdts, and borate sdts, orgadc solvents such as formamide, glycol, dimethylsdfoxide, and dimethylfoπnamide, urea, guanidinium, amino arid andogs such as betdne (Henke et d., Nucl. Acids Res. 1 :3957-3958 (1997); Rees et d., Biochemistry 32:137-144 (1993)), polyamines such as spermidine and spermine (Thomas et al., Nucl. Acids Res.25:2396-2402 (1 97)), or other ositivdy charged roolecdes which neutralize the negative charge of the phosphate backbone, detergents such as sodium dodecyl sulfate, and sodium lauryl sarcosinate, minor/major groove binding agents, positively charged polypeptides, and intercalating agents such as acridine, ethidium bromide, and anthracine. In a preferred embodiment, a mixture of agents is added to the hybridization reaction to modulate the discrimination of perfect matches from mismatches. Some of these agents effect discrimination by reducing the entropy of mdting between two complementary strands. In a prefeπed embodiment, the discrimination of perfeα matches from mismatches is improved by the agent or agents. For example, formamide, a commody used denaturing agent, has been shown to preferentially destabilize mismatches versus perfect matches in a format Dl reaction. As described above, a format III reaction was set up, and then varying amounts of foπnamide were added (0%, 10%, 20%, 30%, 40%, and 50%). At 0% a perfect match signd was detected and the background (mismatches) was high. At 10% formamide, there was a good perfect match signd and the background/mismatch signd was reduced. At 20% formamide, the perfert match signd was reduced (but detectable) and the background/mismatch signd was diminated. At 3O%-50% formamide there was no perfeα match or backgroundmismatch signd.

In an dtemative embodiment, an agent is used to reduce or increase the T_m of a pair of complementary polynudeotides. In a more prefeπed embodiment, a mixture of the agents is used 567 to reduce or increase the T_w of a pair of complementary polynucleotides. The agents may dter the T_m in a number of ways, two examples, which are rtoi meant to limit the invention, are (1) agents which disrupt the hydrogen bonding between the bases of two complementary polynucleotides (Goodman, Proc. Nat'l Acad. Sci.94: 10493-10495 (1 97); Moran et al., Proc. Nat'l Acad. Sci. 94:10506-10511 (1997); Nguyen et d._s Nucl. Acids Res.25:3059-3065 (1997)), and (2) agents which neutralize or shidd the negative charges of the phosphates in the sugar phosphate backbone of the polynucleotide (Thomas et al., Nucl, Acids Res.25:2396-2402 (1 97)). By strengthening or weakening (1) and/or (2) one can modulate the T_m of a complementary pair of polynudeotides. In a preferred embodiment, the formation of complexes between the probe and target nucleic acid is inhibited by the addition of an agent. For example, the formation of ligation products in a format III type reaction was eliminated by the addition of an dkyl polysulphonic acid. When 0.1 - 1.0% polyanethole sulfonic acid (or polyanethole sdfonate) was added the signd was eliminated.

In a most prefeπed embodiment, an agent or agents are added to decrease the binding energy of GC base pairs, or increase the binding energy of AT base pdrs, or both. In a preferred embodiment, the agent or agents axe added so that the binding energy from an AT base pdr is approximately equivdent to the binding energy of a GC base pair. Thus, the energy of binding between two complementary polynudeotides is solely dependent on length. The energy of binding of these complementary polynucleotides may be increased by adding an agent that neutrdizes or shields the negative charges of the phosphate groups in the polynudeotide backbone. EXAMPLE 38 EnhapcipgJ'he Activity Of A Nucleic Acid Modifying Polypeptide

In one emboidment, the discrimination of perfect matches from mismatches is enhanced in a format m reaction. The discrimination is enhanced by an agent seleαed from the group comprising a polyamine such as speπmdine or spermine, other positively charged molecules which neutralize the negative chaige of the phosphate backbone, and a Mg^""^" ion. The discrimination is dso enhanced by changing a physicd condition selected from the group comprising temperature, reaction time, or ionic strength, In a most prefeπed embodiment, several agents are added and severd physicd conditions are changed. For example, the discrimination of perfeα matches from mismatches was increased about 10-100 fold by adding or dtering the following agents and 67

physicd conditions: 100 mM MgCl₂ (increase from 10 M MgCl₂), 100 mM dithiothrietol, 100 μgml BSA (increase from 25 μg/ml), 10 mM ATT (increase from 1 mM), 10 mM speπnadine, 10 umts/μl ligase (increase from 4 units/μl), at room temperature (increase from 4 °C) for 30 minutes (decrease from 120 minutes). The MgCl₂. ATP, dithiothrietol, and BSA, aα to stabilize the ligase during the reaction. The increased temperature and ligase concentration increase the rate at which ligation products are produced so that the reaction time can be decreased. These fartors may dso impact the ligation reaction through a kinetic effect. The MgCl₂, and the speπnadine enhance discrimination by favoring the formation of perfeα matches over mismatches (they preferentially increase the ΔG of formation for the perfect match over the mismatch). In addition, discrimination in a format III reaction was enhanced by the following agents: 20-100 mM MgCl₂ and 5- 10 mM ATP, 10-100 mM dithiothrietol, 50-100 μgtml BSA, or 5-20 mM speπnadine. Other agents that maybe used to enhance discrimination include those recited supra. Discrimination was dso enhanced by raising the temperature from 4 °C to 16 - 37 °C, or increasing the ligase concentration to 5-20 units/μl. In an dtemative embodiment, the activity of a polynuclric acid polymerase is enhanced by the agents which enhance the discrimination of perfeα matches and mismatches between a target nucleic acid and a complementary polynucleotide. The reaction mixture for the polymerase includes a targα nudeic arid, a polynucleotide primer and an agent(s) which enhances the dscrimination of the perfect matches from the mismatches. The polynucleic add polymerase reacts with the primer to replicate the target nucleic arid and coπect (perfeα match) priming is favored over mismatch priming by the agent. The agent(s) may include a sdt such as tetradkyl ammomum sdt (e.g., TMAC, Ricelli et al., Nucl. Adds Res.213785-3788 (1 93)), sodium chloride, phosphate sdts, and borate sdts, organic solvents such as foππamide, glycol, dimethyisdfoxide, and dimethylforrnamide, urea, guanidinium, amino acid andogs such as betaine (Henke et al., Nucl. Acids Res. 19:3957-3958 (1997); Rees et al., Biochemistry 32:137-144

(1 93)), polyamines such as spermidine and spermine (Thomas et al., Nucl. Adds Res.25:2396- 2402 (1997)), or other positively charged molecdes which neutralize the negative charge of the phosphate backbone, detergents such as sodium dodecyl sdfate, and sodium lauryl sarcosinate, minor/major groove binding agents, positivdy charged polypeptides, and intercalating agents such 567

as acridine, ethidium bromide, and anthracine. In a prefeπed embodiment, a mixture of agents is added to the hybridization reaction to moddate the discrimination of perfect matches from mismatches. In a most preferred embodiment, the agent(s) is used to enhance proper priming in a PCR reaction. For example, when 10 mM speπnadine is added to a PCR reaction there was at least a 5-fold increase in product

In still another dtemative embodiment, the activity of polypeptide which modifies a nucleic acid, such as, for example,-an integrase, a gyrase, a nudease, a hdicase, a methylase, and a capping enzyme, is enhanced by the agents which enhance the dscrimination of perfeα matches and mismatches berween a target nucleic arid and a complementary polynucleotide. The reaction mixture for the polypeptide includes a target nudeic acid, a complementary polynucleotide and an agent(s) which enhances the discrimination of the perfect matches from the mismatches. The polypeptide reacts with the complex of the polynucleotide and the target nucldc acid and the perfect match complexes are favored over mismatch complexes by the agent Agents that maybe used to enhance dscrimirration include those recited supra. E MPLES

Enhancing The Discrimination of Perfect Matches from Mismatches Using a Modified igase

The invention relates to modified DNA ligases which increase the discrimination of perfeα matches from mismatches for complementary polynucleotides. The modified ligase enhances discrirnination in a number of ways, for example, the ligase may increase the difference in the oh rates and or the off rates between a perfert match product and a mismatch produα (a kinetic effeα); or the ligase may increase the binding energy difference between a perfeα match and a mismatch (a free energy [ΔG] effert); or the ligase may itself discriminate between perfeα matches and mismatches (ΔG or kinetic effect); or some combination of these and other factors.

39.1 Modified Ligases

The modified ligase of the invention may be prepared by methods well known in the art for modifiying polypeptides, such as are found in for e.g., Cuπent Protocols in Protein Science (1997) JJΞ Coligan, et al., eds. John Wiley & Sons, New York; and Kaiser ET, Lawrence DS, Rokita SE. (1985) "The chemicd modification of enzymatic specificity." Annu Rev Biochem, 54:565-595. 67

The ligases of the invention may dso be modified by producing variants of the ligase nudeic acids. These amino arid sequence variants may be prepared by methods known in the art by introduring appropriate nucleotide changes into a native or variant polynudeotide. There are two variables in the construction of amino acid sequence variants: the location of the mutation and the nature of the mutation. The amino arid sequence variants of the ligase nucleic acids are preferably constmαed by mutating the polynucleotide to give an amino acid sequence that does not occur in nature. These amino arid dterations can be made at sites that differ in the nucleic acids from different species (variable positions) or in highly conserved regions (constant regions). Sites at such locations will typically be modified in series, e.g., by substituting first with conservative choices (e.g., hydrophobic amino arid to a different hydrophobic ammo acid) and then with more distant choices (e.g., hydrophobic amino arid to a charged amino acid), and then deletions or insertions may be made at the target site.

Amino acid sequence ddetions generally range from about 1 to 30 residues, preferably about 1 to 10 residues, and are typically contiguous. Amino arid insertions include amino- and/or carboxyl-termind fusions ranging in length from one to one hundred or more residues, as well as intrasequence insertions of single or multiple amino arid residues. Intrasequence insertions may range generally from about 1 to 10 amino residues, preferably from 1 to 5 residues. Examples of termind insertions include the heterologous signd sequences necessary for secretion or for intracdlular targeting in different host cdls. In a prefeπed method, polynucleotides encoding the ligase nucleic acids are changed via site-direαed mutagenesis. This method uses oligonucleotide sequences that encode the polynudeotide sequence of the desired amino acid variant, as well as a sufficient adjacent nucleotide on both sides of the changed amino arid to form a stable duplex on either side of the site of being dianged. In general, the tech ques of site-direαed mutagenesis are wdl known to those of skill in the art and this technique is exemplified by publications such as, Edelman et al. , DfclΔ 2: 183 (1983). A versatile and efficient method for producing site-specific changes in a polynucleotide sequence was published by Zoller and Smith, Nucleic Acids Rt>_;.10:6487-6500 (1982).

PCR may dso be used to create amino arid sequence variants of the ligase nudeic adds. When small amounts of template DNA are used as starting materid, primer(s) that differs slightly 567

in sequence from the coπesponding region in the template DNA can generate the desired aπύno arid variant PCR amplification resdts in a population of product DNA fragments that differ from the polynucleotide template encoding the ligase at the position specified by the primer. The produα DNA fragments replace the coπesponding region in the plasmid and this gives the desired amino acid variant.

A further technique for generating amino acid variants is the cassette mutagenesis techdque described in Wdls et al , Qss 3.4.315 (19S5); and other mutagenesis techmques well known in the art, such as, for example, the techniques in Sambrook et al. , syjj∑a, and Current Protocols jn Mplecular Biology. Ausubel etal.

39.2 Recombinant Expression of the Modified Ligase.

The present invention further provides recombinant constructs comprising a modified ligase nucldc arid. The recombinant constructs of the present invention comprise a vector, such as a plasmid or viral veαor, into whidi a modified ligase nucleic arid is inserted, in a forward or reverse orientation. The veαor may further comprise regulatory sequences, including for example, a promoter, operably linked to the ORF. The veαor may further comprise a marker sequence or heterologous ORF operably Jinked to an expression moddating fragment ("EMF") or uptake moddating fragment ("UMF"). Large numbers of sdtable vertors and promoters are known to those of skill in the art and are commercidly available for generating the recombinant constructs of the present invention. The following veαors are provided by way of example. Bacterid: pBs, phagescript, PsiXl 74, pBluescript SK, pBs KS, pNH8a, pNH16a, pNHl 8a, pNH46a (Stratagene); pTrc99A pKK223-3, pKK233-3, pDR540, pRIT5 (Pharmacia). Eukaryotic: pWLneo, pSV2cat, pOG44, PXTI, pSG (Stratagene) pSVK3, pBPV, pMSG, pSVL (Pharmacia).

Promoter regions can be selected from any desired gene using CAT (chloramphericol transferase) veαors or other vectors with selectable markers. Two appropriate veαors are pKK232- 8 and ρCM7. Paiticdar named baαerid promoters include lad, lacZ, T3, T7, gpt, lambda P_R, and trc. Eukaryotic promoters include CMV immediate early, HSV thymidine kinase, early and late SV40, LTRs from retrovirus, and mouse metallothionein-I. Selection of the appropriate veαor and promoter is well within the levd of ordinary skill in the art 567

Generally, recombinant expression veαors will include origins of replication and selectable markers permitting transformation of the host cell, e.g., the ampicillin resistance gene of £__ CQ I and S_ cerevisiae TRP 1 gene, and a promoter derived from a highly-expressed gene to direct transcription of a downstream structural sequence. Such promoters can be derived from operons encoding glycolytic enzymes such as 3-phosphoglycerate kinase (PGK), a-fector, arid phosphatase, or heat shock proteins, among others. The heterologous structural sequence is assembled in appropriate phase with translation initiation and termination sequences, and preferably, a leader sequence capable of directing secretion of translated protein into the periplasmic space or extracelldar medium. Optiondly, the heterologous sequence can encode a fusion protein including an N-termind identification peptide imparting desired charaαeristics, e.g., stabilization or simplified purification of expressed recombinant product.

Useful expression vectors for bartend use are constructed by inserting the modified ligase nucleic acid together with suitable translation initiation and teπnination signals in operable reading phase with a functiond promoter. The vector will comprise one or more phenotypic selectable markers and an origin of replication to ensure mdntenance of the vector and to, if desirable, provide amplification within the host. Suitable prokaryotic hosts for transformation indude £_ colj. B cijlus subriljs Salmonella tvnhimyrium and various species within the genera Pseudomonas, Streptomyces, and Staphylococcus, dthough others may dso be employed as a matter of choice. As a representative but nodimiting example, useful expression veαors for baαerid use can comprise a selectable marker and bacterid origin of replication derived from commercidly available plasmids comprising genetic dements of the well known cloning veαor pBR322 (ATCC 37017). Such commercid vectors include, for example, pKK223-3 (Pharmacia Fine Chemicals, Uppsda, Sweden) and GEM 1 (Promega Biotec, Madison, WI, USA). These pBR322 "backbone" seαions are combined with an appropriate promoter and the structural sequence to be expressed. Following transformation of a suitable host strain and growth of the host strain to an appropriate cell density, the sdected promoter is derepressed by appropriate means (eg., temperature shift or chemicd induction) and cells are cultured for an additiond period. Cells are typicdly harvested by centrifugation, disrupted by physicd or chemicd means, and the resdting crude extraα retained for further purification. 567

Any host/vector system can be used to express the modified ligases of the present invention. These indude, but are not limited to, eukaryotic hosts such as HeLa cdls, Cv-l cell, COS cells, and Sf9 cdls, as well as prokaryotic host such as £_. cβji and g. subtilfs- The most preferred cdls are those which do not normally express the modified ligase or which expresses the modified ligase at a low natural level.

The modified ligase can be expressed in mammdian cdls, yeast, bacteria, or other cdls under the control of appropriate promoters. Cell-free translation systems can dso be employed to produce the modified ligase using RNAs derived from the DNA constructs of the present invention. Appropriate doning and expression vedors for use with prokaryotic and eukaryotic hosts are described by Sambrook, et al, in Molecular Cloning: A Laboratory Manual, Second Edition, Cold Spring Harbor, New York (1989), the disclosure of which is hereby incorporated by reference. Various mammdian cell cdture systems can dso be employed to express the modified ligase. Examples of mammalian expression systems indude the COS-7 lines of monkey kidney fibroblasts, described by Gluzman, Cell 23:115 (1981), and other cell lines capable of expressing a compatible vector, for example, the C127, 3T3, CHO, HeLa andBHK cdl tines. Mammdian expression vectors will comprise an origin of replication, a suitable promoter and , and dso any necessary ribosome binding sites, polyadenylation site, splice donor and acceptor sites, transαiptiond termination sequences, and 5' flanking nontranscribed sequences. DNA sequences derived from the SV40 viral genome, for example, SV40 origin, early promoter, enhancer, splice, and polyadenylation sites may be used to provide the required nontranscribed genetic dements. Recombinant modified ligase produced in bacterial cdture are usudly isolated by initial extraction from cell pdlets, followed by one or more sdting-out, aqueous ion exchange or size exclusion chromatography steps. Protein refolding steps can be used, as necessary, in completing configuration of the mature protein. Finally, high performance liqmd chromatography (HPLC) can be employed for final purification steps. Microbid cells employed in expression of roteins can be disrupted by any convenient method, including freeze-thaw cycling, sonication, mecbanicd disruption, or use of cdl lysing agents.

A variety of methodologies known in the art can be utilized to obtain the modified ligase of the present invention. At the simplest levd, the amino acid sequence can be synthesized using commercially avdlable peptide synthesizers. This is particdarly useful in producing smdl peptides 567

and fragments of larger polypeptides. Fragments are useful, for example, in generating antibodies against the modified ligase. In an dtemative method, the modified ligase is purified from baαerid cells which produce the modified ligase. One skilled in the art can readily follow known methods for isolating polypeptides and proteins in order to purify the modified ligase of the present invention. These include, but are not limited to, immunochromatogiaphy, HPLC, size-exclusion chromatography, ion-exchange chromatography, and iπununo-affinity chromatography. See, e.g., Scopes, Protein Purification: Principles and Practice, Springer- Verlag (1 94); Sambrook, et al, in Molecular Cloning: A Laboratory Manual; Ausubel et al, Current Protocols in Molecular Biology. The modified ligase of the present invention can dtematively be purified from cells which have been dtered to express the modified ligase. As used herein, a cell is sdd to be altered to express the modified ligase when the cell, through genetic manipdation, is made to product the modified ligase which it normally does not produce or which the cell normdly produces at a lower levd. One skilled in the art can readily adapt procedures for introducing and expressing either recombinant or synthetic sequences into eukaryotic or prokaryotic cdls in order to generate a cell which produces the modified ligase of the present invention.

393 Modified Ligases Which Enhance Discrimination.

In one emboidment, the discrimination of perfect matches from mismatches is enhanced in a format III reaction. In the format III reaction, the targα nucleic arid interacts with complementary probes and the discrimination of perfeα matches from mismatches is enhanced by the modified ligase. The modified ligase enhances discrimination in a number of ways, for example, the ligase may increase the difference in the on rates and/or the off rates between a perfert match product and a mismatch product (a kinetic effect - e.g., the ligase may preferentially bind to perfert matches and slow the off-rate of perfeα matches versus mismatches); or the ligase may increase the binding energy difference between a perfeα match and a mismatch (a free energy [ΔG] effect - e.g., the ligase may preferentially bind to perfect matches and increase the stability of perfeα matches versus mismatches); or the ligase may itself discriminate between perfect matches and mismatches (ΔG or kinetic effect - e.g., the modified ligase may ligate ody perfeα matches); or some combination of these and other faαoxs. 567

The present invention is not to be limited in scope by the exemplified embodiments which are intended as illustrations of single aspeαs of the invention, and compositions and methods which are funrtiondly eq vdent are within the scope of the invention. Indeed, numerous modifications and variations in the practice of the invention are expected to occur to those skilled in the an upon consideration of the present prefeπed embodiments. Consequently, the only limitations which should be placed upon the scope of the invention are those which appear in the appended claims.

All references cited within the body of the instant specification are hereby incorporated by reference in their entirety.

Claims

567┬úLAJM┬ú WHAT IS CLAIMED IS:

1. A method for andyzing a target nucleic acid, comprising the steps of: providing an array of a plurality of fixed oligonucleotide probes; providing a plurality of labeled oligonucleotide probes; contacting the target nucleic acid w h the fixed probes and the labeled probes under conditions that dlow the probes which form perfect matches with the target nucleic acid to be distinguished from the probes bound with an end mismatch to the target nucleic add, wherdn a modified ligase is added which increases the discrimination of the perfect match from the end mismatch; covdently joining the fixed probe, bound at a site in the target nucleic acid, to the labeled probe, which is hybridized to a sire on the target nudeic acid that is adjacent to the site on which the fixed probe is bound; and identifying the fixed probes and the labeled probes that are covalenily joined.