WO2001043051A2

WO2001043051A2 - Computer method and apparatus for revealing promotor motifs

Info

Publication number: WO2001043051A2
Application number: PCT/US2000/042469
Authority: WO
Inventors: Betsey D. Dyer; Mark D. Leblanc; Glen Aspeslagh; Nathan P. Buggia
Original assignee: Board Of Trustees Of Wheaton College
Priority date: 1999-11-30
Filing date: 2000-11-30
Publication date: 2001-06-14
Also published as: WO2001043051A3; WO2001043051A9

Abstract

A computer search tool and supporting database is presented for use in analyzing genomes. The database is formed from a search of inverted and other repeats in a given genome, especially of motifs having such repeats. Indications of found repeats, length, location in the genome sequence and nearest gene are recorded in the database. The computer search tool includes a search engine, browser and drawing member. The search engine responds to user requests for certain repeats (e.g., specified in terms of length and/or subsequence within). The browser enables viewing of a graphical representation (generated by the drawing member) of the genome, with the repeats highlighted. The invention computer search tool enables analysis of the highlighted repeats against known regulatory sites/promotor motifs of other known sequences or relative to each other across the genome.

Description

COMPUTER METHOD AND APPARATUS FOR REVEALING PROMOTOR MOTIFS

BACKGROUND OF THE INVENTION

Genomics is the field concerning the analysis of the structure and function of the complete DNA sequence (genome) of any organism or in the case of viruses, the complete DNA or RNA sequence. This includes the parts of the sequence designated as genes (or putative genes) as well as all of the intergenic sequences, some of which regulate gene use and chromosomal structure, but much of which is of yet unknown function. By way of overview, a cell has an operational center called the nucleus which contains structures called chromosomes. Chemically, chromosomes are formed of deoxyribonucleic acid (DNA) and associated protein molecules. Structurally, each chromosome may have tens of thousands of genes. Some genes are referred to as "encoding" (or carrying information for constructing) proteins which are essential in the structuring, functioning and regulating of cells, tissues and organs. Thus, for each organism, the components of the DNA molecules encode all the information necessary for creating and maintaining life of the organism. See Human Genome Program, U.S. Department of Energy, "Primer on Molecular Genetics", Washington, D.C., 1992. The shape of a DNA molecule can be thought of as a twisted ladder. That is, the DNA molecule is formed of two parallel side strands of sugar and phosphate molecules connected by orthogonal/cross pieces (rungs) of nitrogen-containing chemicals called bases. Each long side strand is formed of a particular series of units called nucleotides. Each nucleotide comprises one sugar, one phosphate and a nitrogenous base. The order of the bases in this series (the side strands series of nucleotides) is called the DNA sequence. Each rung forms a relatively weak bond between respective bases, one on each side strand. The term "base pairs" refers to the bases at opposite ends of a rung, with one base being on one side strand of the DNA molecule and the other base being on the second side strand of the DNA molecule. Genome size or sequence length is typically stated in terms of number of base pairs.

There are four different bases present in DNA: adenine (A), thymine (T), cytosine (C) and guanine (G). Adenine will pair only with thymine (an A-T pair) and cytosine will pair only with guanine (a C-G pair). A DNA sequence is represented in writing using A's, C's, T's and G's (respective abbreviations for the bases) in corresponding series or character strings. That is, the ACTG's are written in the order of the nucleotides of the subject DNA molecule.

As previously mentioned, each DNA molecule contains many genes. A gene is a specific sequence of nucleotide bases. These sequences carry the information required for constructing proteins. A protein is a large molecule formed of one or more chains of amino acids in a specific order. Order is determined by base sequence of nucleotides in the gene coding for the protein. Each protein has a unique function. In a DNA molecule, there are protein-coding sequences (genes) called "exons"; and non-coding- function sequences called "introns" interspersed within many genes. The balance of DNA sequences in the genome are other non- coding regions or intergenic regions.

According to the foregoing method of representing genome and DNA sequences, the DNA sequence specifies the genetic instructions required to create a particular organism with its own unique traits and at the same time provides a text (character string) environment in which to study the same. The completion of the genome sequence of Caenorhabditis elegans, marks the beginning of what is likely to be years of database mining of this genome, for the purpose of cataloguing, organizing and interpreting actual or putative regulatory motifs (i.e., interesting gene subsequences) by which this multicellular eukaryote coordinates the development and maintenance of differentiation (e.g., Brown, S.M., Biotechniques 26, 266-268 (1999); Clarke, N. et al, Science 282, 2081-2022 (1998); and Goffeau, A., Nature Biotechnology 16, 907-908 (1998)). Transcription profiling has helped to focus some genome-wide studies such as Roth et al (Roth, et al, Nature Biotechnology 16, 939-945 (1998)) and van Helden et al. (van Helden, et al, J.Mol.Biol 281, 827-842 (1998)) both of which looked for conserved motifs about 600-800 bp upstream of well analyzed assemblages of co-regulated genes in Saccaromyces. Numerous other studies of model multicellular eukaryotes such as Drosophila and Caenorhabditis have revealed many conserved regulatory motifs and these have been widely used in chromosome and genome- wide searches (e.g., Wong, Y.C. et al, Chromosoma 92, 124-135 (1985)).

SUMMARY OF THE INVENTION

The present invention provides a timely and potentially useful in silico (computer-based) discovery tool for promotor elements. The principle behind the invention is that repeat sequences of all kinds including inverted repeats, direct repeats, mirror repeats and everted repeats are known to be functional motifs in the promotor regions of many genes in a diversity of organisms. Functions of such repeats include, but are not limited to: (1) binding sites for the binding of regulatory factors; (2) Opportunities for internal base-pairing and subsequent regulation in rnRNAs; and (3) Transposon-like mechanisms for the rearrangement and regulation of genes. Furthermore functional repeats of all kinds are not limited to perfect ones but may include a certain number of mismatches and other irregularities.

Laboratory studies of gene regulation have usually proceeded from research on specific promotor regions to the discovery and characterization of relevant motifs in that region. With the present invention, the procedure is reversed and made more exploratory. The starting point for use of this discovery tool is an extensive search of all intergenic regions of a chromosome or entire genome and the subsequent production of a complete catalogue of all types of motifs including repeats within a particular threshold range of mismatch or irregularity. From this catalogue, conserved motifs may be discovered and hypotheses formed as to the significance of their appearances in the promotor regions of particular genes. The scale of discovery for this invention is similar to that of "Transcription Profiling" in which all up- or down-regulated genes are catalogued for a particular organism (or cell) at a particular point in development or metabolism. Hypotheses may then be formed concerning the significance of the apparent co-regulation of those genes. With this promotor discovery tool all genes with similar motif patterns in their promotor regions may be collected. Indeed this invention combined with "Transcription Profiling" is a potentially powerful way to begin to decipher some of the grammar and syntax of gene regulation.

To demonstrate, Applicants have produced a computer-based searchable, annotated catalogue for Caenorhabditis chromosome T and chromosome X of all possible repeats of sizes ranging from 20-200 base pairs with 0-10% mismatches and loops of up to one third the length of the repeat. The database may be accessed through queries based on size, location or sequence. Each repeat is identified in respect to location, nearest downstream gene (with links to the Caenorhabditis genome project), and frequency, and includes a list of similar sequences. Furthermore, there is a selectable option to show a graphical representation of any repeat, internally base paired with the minimum number of mismatches.

In addition, Applicants have produced a more general, computer-based, searchable lexicon of any possible motif. The lexicon database serves as a "dictionary" for queries of specific motifs of any length. Annotated results are returned from a query and include the locations of each occurrence, the nearest downstream gene, and statistics that include but are not limited to comparing the likelihood of finding "this" particular motif of a certain length in an organism versus the likelihood of finding the motif in a sequence of random pairs.

In general, the present invention provides a method for analyzing a known genome, where the genome is represented by a series of characters in a certain sequential order. The method includes the steps of: (i) locating motifs having inverted, everted, direct and/or mirror repeats in the genome; (ii) recording the located motifs and repeats in a data store; and (iii) connecting the database to a computer network and enabling network users to search and browse the motifs from the data store and hypothesize functionality of the same. For each motif recorded in the data store, there are indications of location in the genome series, length of the motif and nearest gene. Computer apparatus of the present invention thus provides a user-interactive computer search tool for analyzing the subject genome and for considering motifs in context of each other across the genome, as well as comparatively to known promotor motifs/regulatory sites of other genomes. In a preferred embodiment, the invention apparatus (a) includes a search engine, a browser and drawing member supported by the database, and (b) enables revelation of promotor motifs of the subject genome as a function of motifs rendered from the database.

In the preferred embodiment, computer apparatus searches the provided genomic information for motifs, stores found motifs and corresponding information about the motifs into a data store, and enables various subsequent display of the repeats for visual analysis. Before storage, the invention computer apparatus verifies each motifs uniqueness. In other words, the computer apparatus verifies that there are no nested motifs. In addition, the data store may be located on a separate computer system connected via a computer network. The data store is accessible by end users throughout the computer network such as the World Wide Web. The invention computer apparatus locates individual repeats, motifs or any other specified sequence upon user command. The user selectable search criteria may include the location, the sequence, or the length of the repeat or motif. The invention computer apparatus enables organizing and arranging of the repeat or motif information as a function of user specified terms and provides screen displays for comparison with the existing genes. The screen displays employ coloring or other visual effects (underlining, windowing, highlighting, etc.), such that the invention computer apparatus allows the analysis of repeats or motifs on the basis of other highlighted or known motifs. In effect, the present invention generates custom screen view displays annotating motifs which lead to deciphering gene regulation of the subject genome.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Figs. 1A and IB are block diagrams of a computer system 14 embodying the present invention, including a database builder procedure employed to find desired motifs in a given genome and store repeat and motif data in a database of the present invention.

Fig. 2 is a flow diagram illustrating use of the search engine, drawing utility and browser of Fig. IB in the present invention.

Fig. 3 is a schematic representation of one embodiment of the invention database of Figs. 1A and IB.

Figs. 4A and 4B are graphical illustrations of display screen views supported by the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Generally speaking, the present invention provides a computer method and apparatus for analyzing genomes and revealing likely promotor motifs as illustrated in Figs. 1A, IB and 2.

Illustrated in Fig. 1 A is a computer system embodying the present invention. Included in the computer system is a digital processor 12 and a set of software programs or other digitally executed means 14 for forming a database 23 of repeat and motif information from an input sequence 10. That is, digital processor 12 executes invention software 14 to perform the steps discussed below in Fig. IB.

Generally input 10 to digital processor 12 is from another computer (over a local or global communication link), another program, input devices (e.g., keyboard, mouse, etc.) and the like. Output 24 is provided to I/O devices (such as a display monitor), another program, another computer (over a communication link) and the like. It is understood that database 23 may reside on digital processor 12 or another processor/ computer system. With reference to Fig. IB, genomic DNA sequence is downloaded (step 11) from a file such as those at the NCBI web-site into computer memory or stored on disk. The bases are ordered beginning with the 5' end of the DNA sequence to the 3' end. At step 13, an 8 base pair (bp) window is used to analyze one portion of the input sequence at a time. If the sequence portion in the 8 base pair window is a known motif, then the motif is recorded in database 23 and then the subject window is analyzed (step 15) as to whether it is an inverted repeat, mirror repeat, everted repeat or direct repeat (with or without loops of designated size) subject to allowable tolerance levels. If the windowed sequence portion is one of these types of repeats, it is stored 17 in a database 23 along with other information. The 8bp window is then moved one bp to the right (toward the 3' end of the DNA sequence) and the analysis is repeated 19. When the entire DNA sequence has been analyzed, steps 13 and 15 are repeated for a larger window of 9 bp and so on, up to a window length of 200 bp (or greater) using loop 21. Before storing the repeat data in the database 23, the windowed sequence portion determined to be a repeat is checked for uniqueness (step 16). If the repeat is unique, then step 17 stores the repeat in the database 23. In other words, a check is made to ensure that the repeat does not contain nested repeats already in the database 23. If the repeat is not unique (step 20), the previous repeat is removed from the database 23 and the new repeat information is added (step 22) to the database 23. Preferably, the database 23 stores information for each repeat discovered by invention software routine 14 in what is known as a flat file format. In one embodiment illustrated in Fig. 3, the information may consist of a unique identifier or site name 57, repeat type 66, location with respect to the nearest upstream gene 59, name of the nearest upstream gene 61 , location with respect to the 5' end of the DNA sequence 63, the sequence 65 of the repeat, and length 67 of the repeat. In the preferred embodiment, the database 23 is a relational database such as a Unix flat file or Microsoft Access database. In addition, the database links to or otherwise provides a data store of known genes located within the input DNA sequence 10. This enables the database 23 to provide the downstream position for a given repeat relative to the known gene. Referring back to Fig. IB, the database storage 23 is accessed by a search engine 25, drawing utility 27, and browser or other front end applications 29. The front end application 29 queries the database 23 via the String Query Language (SQL) or the like. The front end application 29 may also provide information such as frequency, pattern recognition of the sequence data, and may filter or sort the data. For example, AT rich repeats can be filtered. An AT rich repeat consists exclusively of only A's and T's in the character representation of the repeat sequence. Where digital processor 12 is a server or network node (i.e., Internet Web site), the search engine 25, drawing utility 27, and browser/front end 29 are available via the Internet (HTTP protocol) and provide in essence a sharable, searchable, browsable catalogue of repeats for a particular genome. The presentation of such a genomic catalogue for extensive and methodical exploration for the purpose of discovering and analyzing promoters is an important part of the novel aspect of the invention. Furthermore, the user strategies and procedures as guided, enhanced and facilitated by the catalogue presentation (through drawing utility 27) are novel aspects of the invention.

A typical procedure utilizing the invention system of Figs. 1A and IB for promotor discovery might include the following user activities shown in Fig. 2. An end user, through a browser 31, queries the database/catalogue of repeats 23 and extracts repeat information. The browser 31 is used to note any anomalous abundances or absences of repeats (especially in promotor regions), as well as which genes and other repeats are associated and in which configurations (steps 33, 35). In addition, the browser 31 is used to search in the vicinity of genes of interest (step 37) to see what other genes or repeats are associated. "Genes of interest" might include those that appear to be up or downstream and/or regulated in transcription profiling studies as well as those known to be part of co-regulated pathways. Indeed, transcription profiling is an important complementary tool to this invention.

Further discovery procedures are done with the search function 39 (such as implemented by a search engine 25 of Fig. IB). This may be used in conjunction with the browser 31 to generate and note frequency lists of repeats 41, 47, to search published or putative motifs (step 43) and to search in the vicinity of particular genes (step 45). Whether the browser 31 or the search function 39 is used, the subsequent steps include the comparison of intergenic regions either intra or interspecifically to find potential new motifs and patterns of repeats 49. In these cases, the browser 31 and especially the search function 39 may be used again to determine frequencies and positions of potential motifs (step 51).

The accumulated data on repeats in promotors from the invention system may then be used to interpret the results of transcription profiles (step 53) or to focus mutational studies in the lab (step 55) as well as other lab studies. Thus this invention is an in silico approach to guiding promotor research in the laboratory by making possible the rapid, extensive and methodical exploration of promotor regions.

Throughout the foregoing steps, the drawing utility 27 provides visual display and graphical illustration of the repeat and gene data retrieved from data store 23. For example, in Fig. 4A, the drawing utility 27 presents a graphical view of repeats clustering near a certain gene. Such a view is supported by data retrieved from database 23 on repeats located near the gene, i.e., data resulting from a search on the nearest gene field 61 (being set to the subject gene) and from location with respect to that gene (field 59 in Fig. 3). Other information (e.g., repeat length, input sequence position/location, etc.) regarding each displayed repeat is also provided as a function of user selection or action in the screen view of Fig. 4A.

Techniques known in the art are used for rendering such a display/screen view.

In the example screen view of Fig. 4B, the drawing utility 27 supports different colored display of different types of repeats and different regions. Common display coloration techniques are employed. By way of illustration and not limitation, Fig. 4B shows inverted repeats displayed in one color (boxed) and 14bp overlap regions in another color (square brackets). Motifs GTGAC and TAGGTCA are highlighted (underlined) which visually reveals/illustrates their numerous count in this region. With such a display, the end user is able to visually see certain patterns and make certain assessments, such as the satellite-like configuration here resembles those reported in other heat shock promoters. Restated, the present invention generates custom screen views annotating motifs which leads to deciphering gene regulation of the subject genome.

According to the foregoing, the present invention enables end users to view data and graphical illustrations of motifs and repeats and analyze motifs including repeats relative to each other and or across the given genome. In addition, the present invention enables comparisons of motifs in the database 23 to known regulatory sites of other known genomes. As such, the present invention in effect provides annotation to repeats and motifs in the given genome. In tum, this allows the end user to make initial determinations about regulatory sites at the motif sites in the subject genome. In this manner, the present invention provides a novel apparatus and method for discovering or revealing promotor motifs (sites that regulate the expression of genes) in a genome sequence.

While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

For example, although the preferred embodiment when focused on repeats is directed to inverted repeats, other types of indirect (or direct) repeats may be included. Such other types include mirrored repeats and everted repeats. Further the data store of repeat information may include indications of motifs, regulatory sites or sequences known to have interesting functionality.

Claims

CLAΓMSWhat is claimed is:

1. A method for analyzing a known genome, the genome represented by a series of characters in a sequential order, the method comprising the steps of: determining repeats in a subject genome series, the genome series being a series of characters representing a subject genome; recording the determined repeats in a data store, including for each repeat, recording indications of location in the genome series, length of the repeat and relative position to closest genes; and using a digital processor that accesses the datastore and supports display of graphical illustrations of the genome series, searching for certain ones of repeats and browsing graphical display of the genome series with the certain repeats highlighted, such that analyses about the subject genome are enabled.

2. A method as claimed in Claim 1 further comprising the step of connecting the datastore to a computer network and enabling network users to search and browse the desired repeats from the data store.

3. A method as claimed in Claim 2 wherein the computer network is the Internet; and the step of connecting the data store includes providing a working digital processor as a host server on the Internet, and supporting the data store on the working digital processor.

4. A method as claimed in Claim 1 wherein the step of recording includes indicating a respective nearest gene for each repeat, and the step of using a digital processor includes making analyses about gene regulation based on the repeats and corresponding indications recorded in the datastore as viewed together in context of each other across the genome.

5. A method as claimed in Claim 1 wherein the step of determining repeats includes determining inverted repeats, everted repeats, direct repeats and/or mirror repeats.

6. A method as claimed in Claim 1 wherein the step of searching for certain ones of repeats includes searching for desired repeats as a function of length and/or subsequence contained in the desired repeat.

7. A method as claimed in Claim 6 further comprising the step of determining existence of a regulatory site based on analysis of highlighted repeats relative to each other and/or relative to other known regulatory sites and motifs.

8. A method as claimed in Claim 6 further comprising the step of determining distribution of repeats throughout the subject genome from the display of the genome series with the certain repeats highlighted.

9. A method as claimed in Claim 1 further comprising the step of including motifs in the graphical display of the genome series with the certain repeats highlighted, such that working displays annotating particular motifs are generated for furthering analysis about the subject genome.

10. A method as claimed in Claim 1 further comprising the steps of: repeating the searching and browsing with different ones of certain repeats; accumulating data from the searching and browsing of different certain repeats; and using the accumulated data to interpret results of transcription profiles or to focus lab studies.

11. A method for deciphering gene regulation of a genome, the genome represented by a series of characters in a sequential order, the method comprising the steps of: building a motif data store for a subject genome by: (i) locating motifs, including repeats in a subject genome series, the genome series being a series of characters representing the subject genome;

(ii) recording the located motifs and repeats in the data store, including for each, recording an indication of location in the genome series, length of the motif or repeat and relative position to closest genes; using a computer that accesses the motif datastore and supports display of graphical illustrations of the genome series, searching for certain ones of motifs and browsing graphical display of the genome with the certain motifs highlighted; and generating custom screen views annotating motifs as a result of the searching and browsing, said generated screen views annotating motifs enabling analysis about gene regulation of the subject genome.

12. A method as claimed in Claim 11 wherein the step of using a computer includes a computer of a computer network enabling network users to perform the steps of searching and browsing and generating custom screen views annotating particular motifs.

13. A method as claimed in Claim 12 wherein the computer network is the Internet; and the motif datastore resides on a host server.

14. A method as claimed in Claim 11 wherein the step of locating includes locating inverted repeats, everted repeats, direct repeats and/or mirror repeats.

15. A method as claimed in Claim 11 further comprising the step of determining existence of a regulatory site based on analysis of repeats displayed near the highlighted motifs relative to each other and/or relative to other known regulatory sites motifs.

16. A method as claimed in Claim 11 further comprising the step of determining distribution of motifs throughout the subject genome from the display of the genome series with the certain motifs highlighted.

17. A method as claimed in Claim 11 further comprising the steps of: repeating the searching and browsing with different ones of certain repeats; accumulating data from the searching and browsing of different certain repeats; and using the accumulated data to interpret results of transcription profiles or to focus lab studies.

18. Apparatus for analyzing a known genome, the genome represented by a series of characters in a sequential order, the apparatus comprising: a database storing motif information from motifs and repeats in a subject genome series, the genome series being a series of characters representing a subject genome, the motif information including for each motif and repeat, location in the genome series, length of the repeat or motif and relative position to closest gene; a computer in access communication with the database and supporting display of graphical illustrations of the genome series; and a search member executed by the computer for searching the database for certain ones of repeats in response to user request, the search member providing the certain repeats as search results, the computer receiving the search results and therefrom generating graphical display of a portion of the genome series with the certain repeats highlighted in a manner enabling analysis of the subject genome.

19. Apparatus as claimed in Claim 18 wherein the computer includes motifs in the generated graphical display, in a manner such that the graphical display serves as a working screen view annotating particular motifs and enabling analysis about gene regulation of the subject genome.

20. Apparatus as claimed in Claim 18 wherein the computer is a network computer enabling access to the database across the Internet.

21. Apparatus as claimed in Claim 18 wherein the computer further generates screen views in which repeats and motifs from the database are viewed together in context of each other across the genome and enable analyses about gene regulation.

22. Apparatus as claimed in Claim 18 wherein the database stores motif information about inverted repeats, everted repeats, direct repeats and/or mirror repeats in the subject genome series.

23. Apparatus as claimed in Claim 18 wherein the computer generated graphical display enables determination of existence of a regulatory site based on analysis of highlighted repeats relative to each other and/or relative to other known regulatory sites and motifs.

24. Apparatus as claimed in Claim 18 wherein the computer generated graphical display enables determination of distribution of repeats throughout the subject genome.