WO2009045115A1

WO2009045115A1 - Proliferation signature and prognosis for gastrointestinal cancer

Info

Publication number: WO2009045115A1
Application number: PCT/NZ2008/000260
Authority: WO
Inventors: Ahmad Anjomshoaa; Anthony Edmund Reeve; Yu-Hsin Lin; Michael A Black
Original assignee: Pacific Edge Biotechnology Ltd
Priority date: 2007-10-05
Filing date: 2008-10-06
Publication date: 2009-04-09
Also published as: JP2017060517A; JP5745848B2; KR101982763B1; NZ562237A; KR20200015788A; KR20200118226A; JP2015165811A; US20180010198A1; JP2018126154A; SG10201602601QA; KR20180089565A; CA2739004C; KR20100084648A; CN108753975A; JP2010539973A; CN101932724A; CA3090677A1; CA2739004A1; KR101727649B1; KR20220020404A

Abstract

This invention relates to methods and compositions for determining the prognosis of cancer in a patient, particularly for gastrointestinal cancer, such as gastric or colorectal cancer. Specifically, this invention relates to the use of genetic markers for the prediction of the prognosis of cancer, such as gastric or colorectal cancer, based on cell proliferation signatures. In various aspects, the invention relates to a method of predicting the likelihood of long-term survival of a cancer patient, a method of determining a treatment regime for a cancer patient, a method of preparing a personalized genomics profile for a cancer patient, among other methods as well as kits and devices for carrying out these methods.

Description

PROLIFERATION SIGNATURES AND PROGNOSIS FOR GASTROINTESTINAL CANCER

FIELD OF THE INVENTION This invention relates to methods and compositions for determining the prognosis of cancer, particularly gastrointestinal cancer, in a patient. Specifically, this invention relates to the use of genetic markers for determining the prognosis of cancer, such as gastrointestinal cancer, based on cell proliferation signatures.

BACKGROUND OF THE INVENTION

Cellular proliferation is the most fundamental process in living organisms, and as such is precisely regulated by the expression level of proliferation-associated genes (1). Loss of proliferation control is a hallmark of cancer, and it is thus not surprising that growth- regulating genes are abnormally expressed in tumours relative to the neighbouring normal tissue (2). Proliferative changes may accompany other changes in cellular properties, such as invasion and ability to metastasize, and therefore could affect patient outcome. This association has attracted substantial interest and many studies have been devoted to the exploration of tumour cell proliferation as a potential indicator of outcome.

Cell proliferation is usually assessed by flow cytometry or, more commonly, in tissues, by immunohistochemical evaluation of proliferation markers (3). The most widely used proliferation marker is Ki-67, a protein expressed in all cell cycle phases except for the resting phase G₀ (4). Using Ki-67, a clear association between the proportion of cycling cells and clinical outcome has been established in malignancies such as breast cancer, lung cancer, soft tissue tumours, and astrocytoma (5). In breast cancer, this association has also been confirmed by microarray analysis, leading to a proliferative gene expression profile that has been employed for identifying patients at increased risk of recurrence (6).

However, in colorectal cancer (CRC), the proliferation index (Pl) has produced conflicting results as a prognostic factor and therefore cannot be applied in a clinical context (see below). Studies vary with respect to patient selection, sampling methods, cut-off point levels, antibody choices, staining techniques and the way data have been collected and interpreted. The methodological differences and heterogeneity of these studies may partly explain the contradictory results (7),(8). The use of Ki-67 as a proliferation marker also has limitations. The Ki-67 Pl estimates the fraction of actively cycling cells, but gives no indication of cell cycle length (3), (9). Thus, tumours with a similar Pl may grow at dissimilar rates due to different cycling speeds. In addition, while Ki-67 mRNA is not produced in resting cells, protein may still be detectable in a proportion of colorectal tumours leading to an overestimated proliferation rate (10).

Since the assessment of a prognosis using a single proliferation marker does not appear to be reliable in CRC (see below), there is a need for further tools to predict the prognosis of gastrointestinal cancer. This invention provides further methods and compositions based on prognostic cancer markers, specifically gastrointestinal cancer prognostic markers, to aid in the prognosis and treatment of cancer.

SUMMARY OF THE INVENTION

In certain aspects of the invention, microarray analysis is used to identify genes that provide a proliferation signature for cancer cells. These genes, and the proteins encoded by those genes, are herein termed gastrointestinal cancer proliferation markers (GCPMs). In one aspect of the invention, the cancer for prognosis is gastrointestinal cancer, particularly gastric or colorectal cancer.

In particular aspects, the invention includes a method for determining the prognosis of a cancer by identifying the expression levels of at least one GCPM in a sample. Selected GCPMs encode proteins that associated with cell proliferation, e.g., cell cycle components. These GCPMs have the added utility in methods for determining the best treatment regime for a particular cancer based on the prognosis. In particular aspects, GCPM levels are higher in non-recurring tumour tissue as compared to recurring tumour tissue. These markers can be used either alone or in combination with each other, or other known cancer markers.

In an additional aspect, this invention includes a method for determining the prognosis of a cancer, comprising: (a) providing a sample of the cancer; (b) detecting the expression level of at least one GCPM family member in the sample; and (c) determining the prognosis of the cancer.

In another aspect, the invention includes a step of detecting the expression level of at least one GCPM RNA, for example, at least one mRNA. In a further aspect, the invention includes a step of detecting the expression level of at least one GCPM protein. In yet a further aspect, the invention includes a step of detecting the level of at least one GCPM peptide. In yet another aspect, the invention includes detecting the expression level of at least one GCPM family member in the sample. In an additional aspect, the GCPM is a gene associated with cell proliferation, such as a cell cycle component. In other aspects, the at least one GCPM is selected from Table A, Table B, Table C or Table D, herein.

In a still further aspect, the invention includes a method for detecting the expression level of at least one GCPM set forth in Table A, Table B, Table C or Table D, herein. In an even further aspect, the invention includes a method for detecting the expression level of at least one of CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK,

GMNN, RRM1, CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6,

POLD2, POLE2, BCCIP, Pfs2, TREX1, BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37. In yet a further aspect, the invention comprises detecting the expression level of at least one of CDC2, RFC4, PCNA,

CCNE1 , CCND1, CDK7, MCM genes, FEN1, MAD2L1 , MYBL2, RRM2, and BUB3.

In additional aspects, the expression levels of at least two, or at least 5, or at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, or at least 75 of the proliferation markers or their expression products are determined, for example, as selected from Table A, Table, B, Table C or Table D; as selected from CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1 , BUB3, FEN1 , DRF1, PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37; or as selected from CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

In other aspects, the expression levels of all proliferation markers or their expression products are determined, for example, as listed in Table A, Table, B, Table C or Table D; as listed for the group CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1 , BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1, POLE3, RFC4, MCM3, CHEK1 , CCND1, and CDC37; or as listed for the group CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN 1 , MAD2L1 , MYBL2, RRM2, and BUB3.

In yet a further aspect, the invention includes a method of determining a treatment regime for a cancer comprising: (a) providing a sample of the cancer; (b) detecting the expression level of at least one GCPM family member in the sample; (c) determining the prognosis of the cancer based on the expression level of at least one GCPM family member; and (d) determining the treatment regime according to the prognosis.

In yet another aspect, the invention includes a device for detecting at least one GCPM, comprising: (a) a substrate having at least one GCPM capture reagent thereon; and (b) a detector capable of detecting the at least one captured GCPM, the capture reagent, or a complex thereof.

An additional aspect of the invention includes a kit for detecting cancer, comprising: (a) a GCPM capture reagent; (b) a detector capable of detecting the captured GCPM, the capture reagent, or a complex thereof; and, optionally, (c) instructions for use. In certain aspects, the kit also includes a substrate for the GCPM as captured.

Yet a further aspect of the invention includes a method for detecting at least one GCPM using quantitative PCR, comprising: (a) a forward primer specific for the at least one GCPM; (b) a reverse primer specific for the at least one GCPM; (c) PCR reagents; and, optionally, at least one of: (d) a reaction vial; and (e) instructions for use.

Additional aspects of this invention include a kit for detecting the presence of at least one GCPM protein or peptide, comprising: (a) an antibody or antibody fragment specific for the at least one GCPM protein or peptide; and, optionally, at least one of: (b) a label for the antibody or antibody fragment; and (c) instructions for use. In certain aspects, the kit also includes a substrate having a capture agent for the at least one GCPM protein or peptide.

In specific aspects, this invention includes a method for determining the prognosis of gastrointestinal cancer, especially colorectal or gastric cancer, comprising the steps of: (a) providing a sample, e.g., tumour sample, from a patient suspected of having gastrointestinal cancer; (b) measuring the presence of a GCPM protein using an ELISA method.

In additional aspects of this invention, one or more GCPMs of the invention are selected from the group outlined in Table A, Table B, Table C or Table D, herein. Other aspects and embodiments of the invention are described herein below. BRIEF DESCRIPTION OF THE DRAWINGS

This invention is described with reference to specific embodiments thereof and with reference to the figures.

FIG. 1: An overview of the approach used to derive and apply the gene proliferation signature (GPS) disclosed herein.

FIG. 2A: K-means clustering of 73 Cohort A tumours into two groups according to the expression level of the gene proliferation signature. FIG. 2B: Bar graph of Ki-67 Pl (%); vertical line represents the mean Ki-67 Pl across all samples. Tumours with a proliferation index about and below the mean are shown in red and green, respectively. The results show that over-expression of the proliferation signature is not always associated with a higher Ki-67 Pl. FIG. 3: Kaplan-Meier survival curves according to the expression level of GPS (gene proliferation signal) and Ki-67 Pl. Both overall (OS) and recurrence-free survival (RFS) are significantly shorter in patients with low GPS expression in colorectal cancer Cohort A (a, b) and colorectal cancer Cohort B (c, d). No difference was observed in the survival rates of Cohort A patients according to Ki-67 Pl (e, f). P values from Log rank test are indicated.

FIG. 4: Kaplan-Meier survival curves according to the expression level of GPS (gene proliferation signal) in gastric cancer patients. Overall survival is significantly shorter in patients with low GPS expression in this cohort of 38 gastric cancer patients of mixed stage. P values from Log rank test are indicated.

FIG. 5: A box-and-whisker plot showing differential expression between cycling cells in the exponential phase (EP) and growth-inhibited cells in the stationary phase (SP) of 11 QRT-PCR-validated genes. The box range includes the 25 to the 75 percentiles of the data. The horizontal line in the box represents the median value. The "whiskers" are the largest and smallest values (excluding outliers). Any points more than 3/2 times of the interquartile range from the end of a box will be outliers and presented as a dot. The Y axis represents the log 2 fold change of the ratio between cell line RNA and reference RNA. Analysis was performed using SPSS software.

DETAILED DESCRIPTION OF THE INVENTION Because a single proliferation marker is insufficient for obtaining reliable CRC prognosis, the simultaneous analysis of several growth-related genes by microarray was employed to provide a more quantitative and objective method to determine the proliferation state of a gastrointestinal tumour. Table 1 (below) illustrates the previously published and conflicting results shown for use of the proliferation index (Pl) as a prognostic factor for colorectal cancer.

Table 1 : Summary of studies on the association of proliferation indices with the CRC patients' survival

Study Number of patients Dukes stage Marker Association with survival

Evans et al, 2006" 40 A-C Ki-67

Rosati et al, 2004¹² 103 B-C Ki-67

Ishida et al, 2004¹³ 51^" C Kϊ-67

Buglioni et al, 1999¹⁴ 171 A-D Ki-67

No association was found . Guerra et al, 1998¹⁵ 108 A-C PCNA between proliferation index Kyzer and Gordon, 1997*⁶ 30 B-D Ki-67 and survival

Jansson and Sun, 1997¹⁷ 255 A-D Ki-67

BarettOE et al, 1996¹⁸ 95 A-B KΪ-67

Sun et al, 1996¹⁹ 293 A-C PCNA

Kubota et al, 1992²⁰ 100 A-D Ki-67

Valera et al, 2005²¹ 106 A-D Ki-67

Dziegiel et al, 2003²² 81 NI Ki-67

Scopa et al, 2003²³ High proliferation index was

117 A-D Ki-67 associated with shorter Bhatavdekar et al, 2001²⁴ 98 B-C Ki-67 survival

Chen et al, 1997²⁵ 70 B-C Ki-67

Choi et al, 1997²⁶ 86 B-D PCNA

Hilska et al, 2005" 363 A-D Ki-67

Salminen et al_? 2005²⁸ 146 A-D Ki-67

Garrity et al, 2004²⁹ 366 B-C Ki-67 Low proliferation index was

Allegra et al, 2003³⁰ 706 B-C Kϊ-67 associated with shorter

Pahnqvist et al, 1999³¹ 56 B Ki-67 survival

Paradϊso et al, 1996³² 71 NI PCNA

Neoptolemos et al, 1995³³ 79 A-C PCNA

NI: No Information available

In contrast, the present disclosure has succeeded in (i) defining a CRC-specific gene proliferation signature (GPS) using a cell line model; and (ii) determining the prognostic significance of the GPS in the prediction of patient outcome and its association with clinico-pathologic variables in two independent cohorts of CRC patients.

Definitions Before describing embodiments of the invention in detail, it will be useful to provide some definitions of terms used herein.

As used herein "antibodies" and like terms refer to immunoglobulin molecules and immunologically active portions of immunoglobulin (Ig) molecules, i.e., molecules that contain an antigen binding site that specifically binds (immunoreacts with) an antigen. These include, but are not limited to, polyclonal, monoclonal, chimeric, single chain, Fc, Fab, Fab', and Fab₂ fragments, and a Fab expression library. Antibody molecules relate to any of the classes IgG, IgM, IgA, IgE, and IgD, which differ from one another by the nature of heavy chain present in the molecule. These include subclasses as well, such as IgGI , lgG2, and others. The light chain may be a kappa chain or a lambda chain. Reference herein to antibodies includes a reference to all classes, subclasses, and types. Also included are chimeric antibodies, for example, monoclonal antibodies or fragments thereof that are specific to more than one source, e.g., a mouse or human sequence. Further included are camelid antibodies, shark antibodies or nanobodies.

The term "marker" refers to a molecule that is associated quantitatively or qualitatively with the presence of a biological phenomenon. Examples of "markers" include a polynucleotide, such as a gene or gene fragment, RNA or RNA fragment; or a polypeptide such as a peptide, oligopeptide, protein, or protein fragment; or any related metabolites, by products, or any other identifying molecules, such as antibodies or antibody fragments, whether related directly or indirectly to a mechanism underlying the phenomenon. The markers of the invention include the nucleotide sequences (e.g., GenBank sequences) as disclosed herein, in particular, the full-length sequences, any coding sequences, any fragments, or any complements thereof.

The terms "GCPM" or "gastrointestinal cancer proliferation marker" or "GCPM family member" refer to a marker with increased expression that is associated with a positive prognosis, e.g., a lower likelihood of recurrence cancer, as described herein, but can exclude molecules that are known in the prior art to be associated with prognosis of gastrointestinal cancer. It is to be understood that the term GCPM does not require that the marker be specific only for gastrointestinal tumours. Rather, expression of GCPM can be altered in other types of tumours, including malignant tumours.

Non-limiting examples of GCPMs are included in Table A, Table B, Table C or Table D, herein below, and include, but are not limited to, the specific group CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1, RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREXt, BUB3, FEN1 , DRF1 , PREI3, CCNE1, RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37; and the specific group CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

The terms "cancer" and "cancerous" refer to or describe the physiological condition in mammals that is typically characterized by abnormal or unregulated cell growth. Cancer and cancer pathology can be associated, for example, with metastasis, interference with the normal functioning of neighbouring cells, release of cytokines or other secretory products at abnormal levels, suppression or aggravation of inflammatory or immunological response, neoplasia, premalignancy, malignancy, invasion of surrounding or distant tissues or organs, such as lymph nodes, etc. Specifically included are gastrointestinal cancers, such as esophageal, stomach, small bowel, large bowel, anal, and rectal cancers, particularly included are gastric and colorectal cancers.

The term "colorectal cancer" includes cancer of the colon, rectum, and/or anus, and especially, adenocarcinomas, and may also include carcinomas (e.g., squamous cloacogenic carcinomas), melanomas, lymphomas, and sarcomas. Epidermoid (nonkeratihizing squamous cell or basaloid) carcinomas are also included. The cancer may be associated with particular types of polyps or other lesions, for example, tubular adenomas, tubulovillous adenomas (e.g., villoglandular polyps), villous (e.g., papillary) adenomas (with or without adenocarcinoma), hyperplastic polyps, hamartomas, juvenile polyps, polypoid carcinomas, pseudopolyps, lipomas, or leiomyomas. The cancer may be associated with familial polyposis and related conditions such as Gardner's syndrome or Peutz-Jeghers syndrome. The cancer may be associated, for example, with chronic fistulas, irradiated anal skin, leukoplakia, lymphogranuloma venereum, Bowen's disease (intraepithelial carcinoma), condyloma acuminatum, or human papillomavirus. In other aspects, the cancer may be associated with basal cell carcinoma, extramammary Paget's disease, cloacogenic carcinoma, or malignant melanoma.

The terms "differentially expressed gene," "differential gene expression," and like phrases, refer to a gene whose expression is activated to a higher or lower level in a subject (e.g., test sample), specifically cancer, such as gastrointestinal cancer, relative to its expression in a control subject (e.g., control sample). The terms also include genes whose expression is activated to a higher or lower level at different stages of the same disease; in recurrent or non-recurrent disease; or in cells with higher or lower levels of proliferation. A differentially expressed gene may be either activated or inhibited at the polynucleotide level or polypeptide level, or may be subject to alternative splicing to result in a different polypeptide product. Such differences may be evidenced by a change in mRNA levels, surface expression, secretion or other partitioning of a polypeptide, for example.

Differential gene expression may include a comparison of expression between two or more genes or their gene products; or a comparison of the ratios of the expression between two or more genes or their gene products; or a comparison of two differently processed products of the same gene, which differ between normal subjects and diseased δ subjects; or between various stages of the same disease; or between recurring and nonrecurring disease; or between cells with higher and lower levels of proliferation; or between normal tissue and diseased tissue, specifically cancer, or gastrointestinal cancer. Differential expression includes both quantitative, as well as qualitative, differences in the temporal or cellular expression pattern in a gene or its expression products among, for example, normal and diseased cells, or among cells which have undergone different disease events or disease stages, or cells with different levels of proliferation.

The term "expression" includes production of polynucleotides and polypeptides, in particular, the production of RNA (e.g., mRNA) from a gene or portion of a gene, and includes the production of a protein encoded by an RNA or gene or portion of a gene, and the appearance of a detectable material associated with expression. For example, the formation of a complex, for example, from a protein-protein interaction, protein-nucleotide interaction, or the like, is included within the scope of the term "expression". Another example is the binding of a binding ligand, such as a hybridization probe or antibody, to a gene or other oligonucleotide, a protein or a protein fragment and the visualization of the binding ligand. Thus, increased intensity of a spot on a microarray, on a hybridization blot such as a Northern blot, or on an immunoblot such as a Western blot, or on a bead array, or by PCR analysis, is included within the term "expression" of the underlying biological molecule.

The term "gastric cancer" includes cancer of the stomach and surrounding tissue, especially adenocarcinomas, and may also include lymphomas and leiomyosarcomas. The cancer may be associated with gastric ulcers or gastric polyps, and may be classified as protruding, penetrating, spreading, or any combination of these categories, or, alternatively, classified as superficial (elevated, flat, or depressed) or excavated.

The term "long-term survival" is used herein to refer to survival for at least 5 years, more preferably for at least 8 years, most preferably for at least 10 years following surgery or other treatment

The term "microarray" refers to an ordered arrangement of capture agents, preferably polynucleotides (e.g., probes) or polypeptides on a substrate. See, e.g., Microarray Analysis, M. Schena, John Wiley & Sons, 2002; Microarray Biochip Technology, M. Schena, ed., Eaton Publishing, 2000; Guide to Analysis of DNA Microarray Data, S. Knudsen, John Wiley & Sons, 2004; and Protein Microarray Technology, D. Kambhampati, ed., John Wiley & Sons, 2004. The term "oligonucleotide" refers to a polynucleotide, typically a probe or primer, including, without limitation, single-stranded deoxyribonucleotides, single- or double-stranded ribonucleotides, RNA: DNA hybrids, and double-stranded DNAs. Oligonucleotides, such as single-stranded DNA probe oligonucleotides, are often synthesized by chemical methods, for example using automated oligonucleotide synthesizers that are commercially available, or by a variety of other methods, including in vitro expression systems, recombinant techniques, and expression in cells and organisms.

The term "polynucleotide," when used in the singular or plural, generally refers to any polyribonucleotide or polydeoxribonucleotide, which may be unmodified RNA or DNA or modified RNA or DNA. This includes, without limitation, single- and double-stranded DNA, DNA including single- and double- stranded regions, single- and double-stranded RNA, and RNA including single- and double-stranded regions, hybrid molecules comprising DNA and RNA that may be single-stranded or, more typically, double-stranded or include single- and double-stranded regions. Also included are triple-stranded regions comprising RNA or DNA or both RNA and DNA. Specifically included are mRNAs, cDNAs, and genomic DNAs. The term includes DNAs and RNAs that contain one or more modified bases, such as tritiated bases, or unusual bases, such as inosine. The polynucleotides of the invention can encompass coding or non-coding sequences, or sense or antisense sequences.

"Polypeptide," as used herein, refers to an oligopeptide, peptide, or protein sequence, or fragment thereof, and to naturally occurring, recombinant, synthetic, or semi-synthetic molecules. Where "polypeptide" is recited herein to refer to an amino acid sequence of a naturally occurring protein molecule, "polypeptide" and like terms, are not meant to limit the amino acid sequence to the complete, native amino acid sequence for the full-length molecule. It will be understood that each reference to a "polypeptide" or like term, herein, will include the full-length sequence, as well as any fragments, derivatives, or variants thereof.

The term "prognosis" refers to a prediction of medical outcome (e.g., likelihood of long- term survival); a negative prognosis, or bad outcome, includes a prediction of relapse, disease progression (e.g., tumour growth or metastasis, or drug resistance), or mortality; a positive prognosis, or good outcome, includes a prediction of disease remission, (e.g., disease-free status), amelioration (e.g., tumour regression), or stabilization. The terms "prognostic signature," "signature," and the like refer to a set of two or more markers, for example GCPMs₁ that when analysed together as a set allow for the determination of or prediction of an event, for example the prognostic outcome of colorectal cancer. The use of a signature comprising two or more markers reduces the effect of individual variation and allows for a more robust prediction. Non-limiting examples of GCPMs are included in Table A, Table B, Table C or Table D, herein below, and include, but are not limited to, the specific group CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1, BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1, CCND1 , and CDC37; and the specific group CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

In the context of the present invention, reference to "at least one," "at least two," "at least five," etc., of the markers listed in any particular set (e.g., any signature) means any one or any and all combinations of the markers listed.

The term "prediction method" is defined to cover the broader genus of methods from the fields of statistics, machine learning, artificial intelligence, and data mining, which can be used to specify a prediction model. These are discussed further in the Detailed Description section.

The term "prediction model" refers to the specific mathematical model obtained by applying a prediction method to a collection of data. In the examples detailed herein, such data sets consist of measurements of gene activity in tissue samples taken from recurrent and non-recurrent colorectal cancer patients, for which the class (recurrent or nonrecurrent) of each sample is known. Such models can be used to (1) classify a sample of unknown recurrence status as being one of recurrent or non-recurrent, or (2) make a probabilistic prediction (i.e., produce either a proportion or percentage to be interpreted as a probability) which represents the likelihood that the unknown sample is recurrent, based on the measurement of mRNA expression levels or expression products, of a specified collection of genes, in the unknown sample. The exact details of how these gene-specific measurements are combined to produce classifications and probabilistic predictions are dependent on the specific mechanisms of the prediction method used to construct the model. The term "proliferation" refers to the processes leading to increased cell size or cell number, and can include one or more of: tumour or cell growth, angiogenesis, innervation, and metastasis.

The term "qPCR" or "QPCR" refers to quantative polymerase chain reaction as described, for example, in PCR Technique: Quantitative PCR, J.W. Larrick, ed., Eaton Publishing, 1997, and A-Z of Quantitative PCR, S. Bustin, ed., IUL Press, 2004.

The term "tumour" refers to all neoplastic cell growth and proliferation, whether malignant or benign, and all pre-cancerous and cancerous cells and tissues.

Sensitivity", "specificity" (or "selectivity"), and "classification rate", when applied to the describing the effectiveness of prediction models mean the following:

"Sensitivity" means the proportion of truly positive samples that are also predicted (by the model) to be positive. In a test for cancer recurrence, that would be the proportion of recurrent tumours predicted by the model to be recurrent. "Specificity" or "selectivity" means the proportion of truly negative samples that are also predicted (by the model) to be negative. In a test for CRC recurrence, this equates to the proportion of non-recurrent samples that are predicted to by non-recurrent by the model. "Classification Rate" is the proportion of all samples that are correctly classified by the prediction model (be that as positive or negative).

"Stringent conditions" or "high stringency conditions", as defined herein, typically: (1) employ low ionic strength and high temperature for washing, for example 0.015 M sodium chloride/0.0015 M sodium citrate/0.1% sodium dodecyl sulfate at 50⁰C; (2) employ a denaturing agent during hybridization, such as formamide, for example, 50% (v/v) formamide with 0.1% bovine serum albumin/0.1% Ficoll/0.1% polyvinylpyrrolidone/50 mM sodium phosphate buffer at pH 6.5 with 750 mM sodium chloride, 75 mM sodium citrate at 42°C; or (3) employ 50% formamide, 5X SSC (0.75 M NaCI₁ 0.075 M sodium citrate), 50 mM sodium phosphate (pH 6.8), 0.1% sodium pyrophosphate, 5X, Denhardt's solution, sonicated salmon sperm DNA (50 μg/ml), 0.1% SDS, and 10% dextran sulfate at 42°C, with washes at 42°C in 0.2X SSC (sodium chloride/sodium citrate) and 50% formamide at 55°C, followed by a high-stringency wash comprising 0.1X SSC containing EDTA at 55°C.

"Moderately stringent conditions" may be identified as described by Sambrook et al., Molecular Cloning: A Laboratory Manual, New York: Cold Spring Harbor Press, 1989, and include the use of washing solution and hybridization conditions (e. g., temperature, ionic strength, and % SDS) less stringent that those described above. An example of moderately stringent conditions is overnight incubation at 37°C in a solution comprising: 20% formamide, 5X SSC (150 mM NaCI₁ 15 mM trisodium citrate), 50 mM sodium phosphate (pH 7.6), 5X Denhardt's solution, 10% dextran sulfate, and 20 mg/ml denatured sheared salmon sperm DNA, followed by washing the filters in 1X SSC at about 37-50⁰C. The skilled artisan will recognize how to adjust the temperature, ionic strength, etc. as necessary to accommodate factors such as probe length and the like.

The practice of the present invention will employ, unless otherwise indicated, conventional techniques of molecular biology (including recombinant techniques), microbiology, cell biology, and biochemistry, which are within the skill of the art. Such techniques are explained fully in the literature, such as, Molecular Cloning: A Laboratory Manual, 2nd edition, Sambrook et al., 1989; Oligonucleotide Synthesis, MJ Gait, ed., 1984; Animal Cell Culture, R.I. Freshney, ed., 1987; Methods in Enzymology, Academic Press, Inc.; Handbook of Experimental Immunology, 4th edition, D .M. Weir & CC. Blackwell, eds., Blackwell Science Inc., 1987; Gene Transfer Vectors for Mammalian Cells, J. M. Miller & MP. Calos, eds., 1987; Current Protocols in Molecular Biology, F.M. Ausubel et al., eds., 1987; and PCR: The Polymerase Chain Reaction, Mullis et al., eds., 1994.

Description of Embodiments of the Invention

Cell proliferation is an indicator of outcome in some malignancies. In colorectal cancer, however, discordant results have been reported. As these results are based on a single proliferation marker, the present invention discloses the use of microarrays to overcome this limitation, to reach a firmer conclusion, and to determine the prognostic role of cell proliferation in colorectal cancer. The microarray-based proliferation studies shown herein indicate that reduced rate of the proliferation signature in colorectal cancer is associated with poor outcome. The invention can therefore be used to identify patients at high risk of early death from cancer.

The present invention provides for markers for the determination of disease prognosis, for example, the likelihood of recurrence of tumours, including gastrointestinal tumours. Using the methods of the invention, it has been found that numerous markers are associated with the progression of gastrointestinal cancer, and can be used to determine the prognosis of cancer. Microarray analysis of samples taken from patients with various stages of colorectal tumours has led to the surprising discovery that specific patterns of marker expression are associated with prognosis of the cancer. An increase in certain GCPMs, for example, markers associated with cell proliferation, is indicative of positive prognosis. This can include decreased likelihood of cancer recurrence after standard treatment, especially for gastrointestinal cancer, such as gastric or colorectal cancer. Conversely, a decrease in these markers is indicative of a negative prognosis. This can include disease progression or the increased likelihood of cancer recurrence, especially for gastrointestinal cancer, such as gastric or colorectal cancer. A decrease in expression can be determined, for example, by comparison of a test sample (e.g., tumour sample) to samples associated with a positive prognosis. An increase in expression can be determined, for example, by comparison of a test sample (e.g., tumour samples) to samples associated with a negative prognosis.

For example, to obtain a prognosis, a patient's sample (e.g., tumour sample) can be compared to samples with known patient outcome. If the patient's sample shows increased expression of GCPMs that is comparable to samples with good outcome, and/or higher than samples with poor outcome, then a positive . prognosis is implicated. If the patient's sample shows decreased expression of GCPMs that is comparable to samples with poor outcome, and/or lower than samples with good outcome, then a negative prognosis is implicated. Alternatively, a patient's sample can be compared to samples of actively proliferating/non-proliferating tumour cells. If the patient's sample shows increased expression of GCPMs that is comparable to actively proliferating cells, and/or higher than non-proliferating cells, then a positive prognosis is implicated. If the patient's sample shows decreased expression of GCPMs that is comparable to non- proliferating cells, and/or lower than actively proliferating cells, then a negative prognosis is implicated.

The invention provides for a set of genes, identified from cancer patients with various stages of tumours, outlined in Table C that are shown to be prognostic for colorectal cancer. These genes are all associated with cell proliferation and establish a relationship between cell proliferation genes and their utility in cancers prognosis. It has also been found that the genes in the prognostic signature listed in Table C are also correlated with additional cell proliferation genes. Based on these finding, the invention also provides for a set of cell cycle genes, shown in Table D, that are differentially expressed between high and low proliferation groups, for use as prognostic markers. Further, based on the surprising finding of the correlation between prognosis and cell proliferation-related genes, the invention also provides for a set of proliferation-related genes differentially expressed between cell lines in high and low proliferative states (Table A) and known proliferative- reiated genes (Table B). The genes outlined in Table A, Table B, Table C and Table D provide for a set of gastrointestinal cancer prognostic markers (gCPMs).

As one approach, the expression of a panel of markers (e.g., GCPMs) can be analysed by techniques including Linear Discriminant Analysis (LDA) to work out a prognostic score. The marker panel selected and prognostic score calculation can be derived through extensive laboratory testing and multiple independent clinical development studies.

The disclosed GCPMs therefore provide a useful tool for determining the prognosis of cancer, and establishing a treatment regime specific for that tumour. In particular, a positive prognosis can be used by a patient to decide to pursue standard or less invasive treatment options. A negative prognosis can be used by a patient to decide to terminate treatment or to pursue highly aggressive or experimental treatments. In addition, a patient can chose treatments based on their impact on cell proliferation or the expression of cell proliferation markers (e.g., GCPMs). In accordance with the present invention, treatments that specifically target cells with high proliferation or specifically decrease expression of cell proliferation markers (e.g., GCPMs) would not be preferred for patients with gastrointestinal cancer, such as colorectal cancer or gastric cancer.

Levels of GCPMs can be detected in tumour tissue, tissue proximal to the tumour, lymph node samples, blood samples, serum samples, urine samples, or faecal samples, using any suitable technique, and can include, but is not limited to, oligonucleotide probes, quantitative PCR, or antibodies raised against the markers. The expression level of one GCPM in the sample will be indicative of the likelihood of recurrence in that subject. However, it will be appreciated that by analyzing the presence and amounts of expression of a plurality of GCPMs, and constructing a proliferation signature, the sensitivity and accuracy of prognosis will be increased. Therefore, multiple markers according to the present invention can be used to determine the prognosis of a cancer.

The present invention relates to a set of markers, in particular, GCPMs, the expression of which has prognostic value, specifically with respect to cancer-free survival. In specific aspects, the cancer is gastrointestinal cancer, particularly, gastric or colorectal cancer, and, in further aspects, the colorectal cancer is an adenocarcinoma.

In one aspect, the invention relates to a method of predicting the likelihood of long-term survival of a cancer patient without the recurrence of cancer, comprising determining the expression level of one or more proliferation markers or their expression products in a sample obtained from the patient, normalized against the expression level of all RNA transcripts or their products in the sample, or of a reference set of RNA transcripts or their expression products, wherein the proliferation marker is the transcript of one or more markers listed in Table A, Table B, Table C or Table D, herein. In particular aspects, a decrease in expression levels of one or more GCPM indicates a decreased likelihood of long-term survival without cancer recurrence, while an increase in expression levels of one or more GCPM indicates an increased likelihood of long-term survival without cancer recurrence.

In a further aspect, the expression levels one or more, for example at least two, or at least 3, or at least 4, or at least 5, or at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, or at least 75 of the proliferation markers or their expression products are determined, e.g., as selected from Table A, Table, B, Table C or Table D; as selected from CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1, CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1 , BUB3, FEN1, DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37; or as selected from CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

In another aspect, the method comprises the determination of the expression levels of all proliferation markers or their expression products, e.g., as listed in Table A, Table, B, Table C or Table D; as listed for the group CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1 , BUB3, FEN1, DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37; or as listed for the group CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

The invention includes the use of archived paraffin-embedded biopsy material for assay of all markers in the set, and therefore is compatible with the most widely available type of biopsy material. It is also compatible with several different methods of tumour tissue harvest, for example, via core biopsy or fine needle aspiration. In a further aspect, RNA is isolated from a fixed, wax-embedded cancer tissue specimen of the patient. Isolation may be performed by any technique known in the art, for example from core biopsy tissue or fine needle aspirate cells. In another aspect, the invention relates to an array comprising polynucleotides hybridizing to two or more markers as selected from Table A, Table B, Table C or Table D; as selected from CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1 , BUB3, FEN1 , DRF1 , PREI3, CCNE1, RPA1, P0LE3, RFC4, MCM3, CHEK1, CCND1 , and CDC37; or as selected from CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

In particular aspects, the array comprises polynucleotides hybridizing to at least 3, or at least 5, or at least 10, or at least 15, or at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, or at least 75 or all of the markers listed in Table A, Table B, Table C or Table D; as listed in the group CDC2, MCM6, RPA3, MCM7, PCNA, G22P1, KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1, CSPG6, POLD2, POLE2, BCCIP₁ Pfs2, TREX1, BUB3, FEN1, DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37; or as listed in the group CDC2, RFC4, PCNA, CCNE1 , CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

In another specific aspect, the array comprises polynucleotides hybridizing to the full set of markers listed in Table A, Table B, Table C or Table D; as listed for the group CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK₁ GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1 , BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37; or as listed for the group CDC2, RFC4, PCNA, CCNE1, CCND1 , CDK7, MCM genes (e.g., one or more of MCM3, MCM6, and MCM7), FEN1 , MAD2L1 , MYBL2, RRM2, and BUB3.

The polynucleotides can be cDNAs, or oligonucleotides, and the solid surface on which they are displayed can be glass, for example. The polynucleotides can hybridize to one or more of the markers as disclosed herein, for example, to the full-length sequences, any coding sequences, any fragments, or any complements thereof.

In still another aspect, the invention relates to a method of predicting the likelihood of long-term survival of a patient diagnosed with cancer, without the recurrence of cancer, comprising the steps of: (1) determining the expression levels of the RNA transcripts or the expression products of the full set or a subset of the markers listed in Table A, Table B, Table C or Table D, herein, in a sample obtained from the patient, normalized against the expression levels of all RNA transcripts or their expression products in the sample, or of a reference set of RNA transcripts or their products; (2) subjecting the data obtained in step (1) to statistical analysis; and (3) determining whether the likelihood of the long-term survival has increased or decreased.

In yet another aspect, the invention concerns a method of preparing a personalized genomics profile for a patient, e.g., a cancer patient, comprising the steps of: (a) subjecting a sample obtained from the patient to expression analysis; (b) determining the expression level of one or more markers selected from the marker set listed in any one of Table A, Table B, Table C or Table D, wherein the expression level is normalized against a control gene or genes and optionally is compared to the amount found in a reference set; and (c) creating a report summarizing the data obtained by the expression analysis. The report may, for example, include prediction of the likelihood of long term survival of the patient and/or recommendation for a treatment modality of the patient.

In additional aspects, the invention relates to a prognostic method comprising: (a) subjecting a sample obtained from a patient to quantitative analysis of the expression level of the RNA transcript of at least one marker selected from Table A, Table B, Table C or Table D, herein, or its product, and (b) identifying the patient as likely to have an increased likelihood of long-term survival without cancer recurrence if the normalized expression levels of the marker or markers, or their products, are above defined expression threshold. In alternate aspects, step (b) comprises identifying the patient as likely to have a decreased likelihood of long-term survival without cancer recurrence if the normalized expression levels of the marker or markers, or their products, are decreased below a defined expression threshold.

In particular, the relatively low expression of proliferation markers is associated with poor outcome. This can include disease progression or the increased likelihood of cancer recurrence, especially for gastrointestinal cancer, such as gastric or colorectal cancer. By contrast, the relatively high expression of proliferation markers is associated with a good outcome. This can include decreased likelihood of cancer recurrence after standard treatment, especially for gastrointestinal cancer, such as gastric or colorectal cancer. Low expression can be determined, for example, by comparison of a test sample (e.g., tumour sample) to samples associated with a positive prognosis. High expression can be determined, for example, by comparison of a test sample (e.g., tumour sample) to samples associated with a negative prognosis. For example, to obtain a prognosis, a patient's sample (e.g., tumour sample) can be compared to samples with known patient outcome. If the patient's sample shows high expression of GCPMs that is comparable to samples with good outcome, and/or higher than samples with poor outcome, then a positive prognosis is implicated. If the patient's sample shows low expression of GCPMs that is comparable to samples with poor outcome, and/or lower than samples with good outcome, then a negative prognosis is implicated. Alternatively, a patient's sample can be compared to samples of actively proliferating/non-proliferating tumour cells. If the patient's sample shows high expression of GCPMs that is comparable to actively proliferating cells, and/or higher than non- proliferating cells, then a positive prognosis is implicated. If the patient's sample shows low expression of GCPMs that is comparable to non-proliferating cells, and/or lower than actively proliferating cells, then a negative prognosis is implicated.

As further examples, the expression levels of a prognostic signature comprising two or more GCPMs from a patient's sample (e.g., tumour sample) can be compared to samples of recurrent/non-recurrent cancer. If the patient's sample shows increased or decreased expression of CCPMs by comparison to samples of non-recurrent cancer, and/or comparable expression to samples of recurrent cancer, then a negative prognosis is implicated. If the patient's sample shows expression of GCPMs that is comparable to samples of non-recurrent cancer, and/or lower or higher expression than samples of recurrent cancer, then a positive prognosis is implicated.

As one approach, a prediction method can be applied to a panel of markers, for example the panel of GCPMs outlined in Table A, Table B Table C or Table D, in order to generate a predictive model. This involves the generation of a prognostic signature, comprising two or more GCPMs.

The disclosed GCPMs in Table A, Table B, Table C or Table Dtherefore provide a useful set of markers to generate prediction signatures for determining the prognosis of cancer, and establishing a treatment regime, or treatment modality, specific for that tumour. In particular, a positive prognosis can be used by a patient to decide to pursue standard or less invasive treatment options. A negative prognosis can be used by a patient to decide to terminate treatment or to pursue highly aggressive or experimental treatments. In addition, a patient can chose treatments based on their impact on the expression of prognostic markers (e.g., GCPMs). Levels of GCPMs can be detected in tumour tissue, tissue proximal to the tumour, lymph node samples, blood samples, serum samples, urine samples, or faecal samples, using any suitable technique, and can include, but is not limited to, oligonucleotide probes, quantitative PCR, or antibodies raised against the markers. It will be appreciated that by analyzing the presence and amounts of expression of a plurality of GCPMs in the form of prediction signatures, and constructing a prognostic signature, the sensitivity and accuracy of prognosis will be increased. Therefore, multiple markers according to the present invention can be used to determine the prognosis of a cancer.

The invention includes the use of archived paraffin-embedded biopsy material for assay of the markers in the set, and therefore is compatible with the most widely available type of biopsy material. It is also compatible with several different methods of tumour tissue harvest, for example, via core biopsy or fine needle aspiration. In certain aspects, RNA is isolated from a fixed, wax-embedded cancer tissue specimen of the patient. Isolation may be performed by any technique known in the art, for example from core biopsy tissue or fine needle aspirate cells.

In one aspect, the invention relates to a method of predicting a prognosis, e.g., the likelihood of long-term survival of a cancer patient without the recurrence of cancer, comprising determining the expression level of one or more prognostic markers or their expression products in a sample obtained from the patient, normalized against the expression level of other RNA transcripts or their products in the sample, or of a reference set of RNA transcripts or their expression products. In specific aspects, the prognostic marker is one or more markers listed in Table A, Table B, Table C or Table D or is included as one or more of the prognostic signatures derived from the markers listed in Table A, Table B, Table C or Table D.

In further aspects, the expression levels of the prognostic markers or their expression products are determined, e.g., for the markers listed in Table A, Table B, Table C or Table D, a prognostic signature derived from the markers listed in Table A, Table B, Table C or Table D. In another aspect, the method comprises the determination of the expression levels of a full set of prognosis markers or their expression products, e.g., for the markers listed in Table A, Table B, Table C or Table D, or, a prognostic signature derived from the markers listed in Table A, Table B, Table C or Table D. ■

In an additional aspect; the invention relates to an array (e.g., microarray) comprising polynucleotides hybridizing to two or more markers, e.g., for the markers listed in Table A, Table B, Table C or Table D, or a prognostic signature derived from the markers listed in Table A, Table B, Table C or Table D. In particular aspects, the array comprises polynucleotides hybridizing to prognostic signature derived from the markers listed in Table A, Table B, Table C or Table D, or e.g., for a prognostic signature. In another specific aspect, the array comprises polynucleotides hybridizing to the full set of markers, e.g., for the markers listed in Table A, Table B, Table C or Table D, or, e.g., for a prognostic signature.

For these arrays, the polynucleotides can be cDNAs, or oligonucleotides, and the solid surface on which they are displayed can be glass, for example. The polynucleotides can hybridize to one or more of the markers as disclosed herein, for example, to the full-length sequences, any coding sequences, any fragments, or any complements thereof. In particular aspects, an increase or decrease in expression levels of one or more GCPM indicates a decreased likelihood of long-term survival, e.g., due to cancer recurrence, while a lack of an increase or decrease in expression levels of one or more GCPM indicates an increased likelihood of long-term survival without cancer recurrence.

In further aspects, the invention relates to a kit comprising one or more of: (1) extraction buffer/reagents and protocol; (2) reverse transcription buffer/reagents and protocol; and (3) quantitative PCR buffer/reagents and protocol suitable for performing any of the foregoing methods.^" Other aspects and advantages of the invention are illustrated in the description and examples included herein.

Table A: Proliferation-related genes differentially expressed between cell lines in high and low proliferative states. Genes that were differentially expressed between cell lines in confluent (low proliferation) and semi-confluent (high proliferation) states (see Figure 1) were identified by microarray analysis on 3OK MWG Biotech arrays. Table A comprises the subset of these genes that were categorized by gene ontology analysis as cell proliferation-related. Table B: GCPMs for cell proliferation signature

,j

W

Table B: Known cell proliferation-related genes. All genes categorized as cell proliferation-related by gene ontology analysis and present on the Affymetrix HG- U 133 platform.

General Approaches to Prognostic Marker Detection

The following approaches are non-limiting methods that can be used to detect the proliferation markers, including GCPM family members: microarray approaches using oligonucleotide probes selective for a GCPM; real-time qPCR on tumour samples using GCPM specific primers and probes; real-time qPCR on lymph node, blood, serum, faecal, or urine samples using GCPM specific primers and probes; enzyme-linked immunological assays (ELISA); immunohistochemistry using anti-marker antibodies; and analysis of array or qPCR data using computers.

Other useful methods include northern blotting and in situ hybridization (Parker and Barnes, Methods in Molecular Biology 106: 247-283 (1999)); RNase protection assays (Hod, BioTechniques 13: 852-854 (1992)); reverse transcription polymerase chain reaction (RT-PCR; Weis et al., Trends in Genetics 8: 263-264 (1992)); serial analysis of gene expression (SAGE; Velculescu et al., Science 270: 484-487 (1995); and Velculescu et al., Cell 88: 243-51 (1997)), MassARRAY technology (Sequenom, San Diego, CA), and gene expression analysis by massively parallel signature sequencing (MPSS; Brenner et al., Nature Biotechnology 18: 630-634 (2000)). Alternatively, antibodies may be employed that can recognize specific complexes, including DNA duplexes, RNA duplexes, and DNA- RNA hybrid duplexes or DNA-protein duplexes.

Primary data can be collected and fold change analysis can be performed, for example, by comparison of marker expression levels in tumour tissue and non-tumour tissue; by comparison of marker expression levels to levels determined in recurring tumours and non-recurring tumours; by comparison of marker expression levels to levels determined in tumours with or without metastasis; by comparison of marker expression levels to levels determined in differently staged tumours; or by comparison of marker expression levels to levels determined in cells with different levels of proliferation. A negative or positive prognosis is determined based on this analysis. Further analysis of tumour marker expression includes matching those markers exhibiting increased or decreased expression with expression profiles of known gastrointestinal tumours to provide a prognosis.

A threshold for concluding that expression is increased is provided as, for example, at least a 1.5-fold or 2-fold increase, and in alternative embodiments, at least a 3-fold increase, 4-fold increase, or 5-fold increase. A threshold for concluding that expression is decreased is provided as, for example, at least a 1.5-fold or 2-fold decrease, and in alternative embodiments, at least a 3-fold decrease, 4-fold decrease, or 5-fold decrease. It can be appreciated that other thresholds for concluding that increased or decreased expression has occurred can be selected without departing from the scope of this invention.

It will also be appreciated that a threshold for concluding that expression is increased will be dependent on the particular marker and also the particular predictive model that is to be applied. The threshold is generally set to achieve the highest sensitivity and selectivity with the lowest error rate, although variations may be desirable for a particular clinical situation. The desired threshold is determined by analysing a population of sufficient size taking into account the statistical variability of any predictive model and is calculated from the size of the sample used to produce the predictive model. The same applies for the determination of a threshold for concluding that expression is decreased. It can be appreciated that other thresholds, or methods for establishing a threshold, for concluding that increased or decreased expression has occurred can be selected without departing from the scope of this invention.

It is also possible that a prediction model may produce as it's output a numerical value, for example a score, likelihood value or probability. In these instances, it is possible to apply thresholds to the results produced by prediction models, and in these cases similar principles apply as those used to set thresholds for expression values

Once the expression level of one or more proliferation markers in a tumour sample has been obtained the likelihood of the cancer recurring can then be determined. In accordance with the invention, a negative prognosis is associated with decreased expression of at least one proliferation marker, while a positive prognosis is associated with increased expression of at least one proliferation marker. In various aspects, an increase in expression is shown by at least 1 , 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or 75 of the markers disclosed herein. In other aspects, a decrease in expression is shown by at least 1 , 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or 75 of the markers disclosed herein

From the genes identified, proliferation signatures comprising one or more GCPMs can be used to determine the prognosis of a cancer, by comparing the expression level of the one or more genes to the disclosed proliferation signature. By comparing the expression of one or more of the GCPMs in a tumour sample with the disclosed proliferation signature, the likelihood of the cancer recurring can be determined. The comparison of expression levels of the prognostic signature to establish a prognosis can be done by applying a predictive model as described previously.

Determining the likelihood of the cancer recurring is of great value to the medical practitioner. A high likelihood of reoccurrence means that a longer or higher dose treatment should be given, and the patient should be more closely monitored for signs of recurrence of the cancer. An accurate prognosis is also of benefit to the patient. It allows the patient, along with their partners, family, and friends to also make decisions about treatment, as well as decisions about their future and lifestyle changes. Therefore, the invention also provides for a method establishing a treatment regime for a particular cancer based on the prognosis established by matching the expression of the markers in a tumour sample with the differential proliferation signature.

It will be appreciated that the marker selection, or construction of a proliferation signature, does not have to be restricted to the GCPMs disclosed in Table A, Table B, Table C or Table D, herein, but could involve the use of one or more GCPMs from the disclosed signature, or a new signature may be established using GCPMs selected from the disclosed marker lists. The requirement of any signature is that it predicts the likelihood of recurrence with enough accuracy to assist a medical practitioner to establish a treatment regime.

Surprisingly, it was discovered that many of the GCPM were associated with increased levels of cell proliferation, and were also associated with a positive prognosis. It has similarly been found that there is a close correlation between the decreased expression level of GCPMs and a negative prognosis, e.g., an increased likelihood of gastrointestinal cancer recurring. Therefore, the present invention also provides for the use of a marker associated with cell proliferation, e.g., a cell cycle component, as a GCPM.

As described herein, determination of the likelihood of a cancer recurring can be accomplished by measuring expression of one or more proliferation-specific markers. The methods provided herein also include assays of high sensitivity. In particular, qPCR is extremely sensitive, and can be used to detect markers in very low copy number (e.g., 1 - 100) in a sample. With such sensitivity, prognosis of gastrointestinal cancer is made reliable, accurate, and easily tested.

Reverse Transcription PCR (RT-PCR) Of the techniques listed above, the most sensitive and most flexible quantitative method is RT-PCR, which can be used to compare RNA levels in different sample populations, in normal and tumour tissues, with or without drug treatment, to characterize patterns of expression, to discriminate between closely related RNAs, and to analyze RNA structure.

For RT-PCR, the first step is the isolation of RNA from a target sample. The starting material is typically total RNA isolated from human tumours or tumour cell lines, and corresponding normal tissues or cell lines, respectively. RNA can be isolated from a variety of samples, such as tumour samples from breast, lung, colon (e.g., large bowel or small bowel), colorectal, gastric, esophageal, anal, rectal, prostate, brain, liver, kidney, pancreas, spleen, thymus, testis, ovary, uterus, etc., tissues, from primary tumours, or tumour cell lines, and from pooled samples from healthy donors. If the source of RNA is a tumour, RNA can be extracted, for example, from frozen or archived paraffin-embedded and fixed (e.g., formalin-fixed) tissue samples.

The first step in gene expression profiling by RT-PCR is the reverse transcription of the RNA template into cDNA, followed by its exponential amplification in a PCR reaction. The two most commonly used reverse transcriptases are avilo myeloblastosis virus reverse transcriptase (AMV-RT) and Moloney murine leukaemia virus reverse transcriptase (MMLV-RT). The reverse transcription step is typically primed using specific primers, random hexamers, or oligo-dT primers, depending on the circumstances and the goal of expression profiling. For example, extracted RNA can be reverse-transcribed using a GeneAmp RNA PCR kit (Perkin Elmer, CA, USA), following the manufacturer's instructions. The derived cDNA can then be used as a template in the subsequent PCR reaction.

Although the PCR step can use a variety of thermostable DNA-dependent DNA polymerases, it typically employs the Taq DNA polymerase, which has a 5'-3' nuclease activity but lacks a 3'-5' proofreading endonuclease activity. Thus, TaqMan (g) PCR typically utilizes the 5' nuclease activity of Taq or Tth polymerase to hydrolyze a hybridization probe bound to its target amplicon, but any enzyme with equivalent 5' nuclease activity can be used.

Two oligonucleotide primers are used to generate an amplicon typical of a PCR reaction. A third oligonucleotide, or probe, is designed to detect nucleotide sequence located between the two PCR primers. The probe is non-extendible by Taq DNA polymerase enzyme, and is labeled with a reporter fluorescent dye and a quencher fluorescent dye. Any laser-induced emission from the reporter dye is quenched by the quenching dye when the two dyes are located close together as they are on the probe. During the amplification reaction, the Taq DNA polymerase enzyme cleaves the probe in a template- dependent manner. The resultant probe fragments disassociate in solution, and signal from the released reporter dye is free from the quenching effect of the second fluorophore. One molecule of reporter dye is liberated for each new molecule synthesized, and detection of the unquenched reporter dye provides the basis for quantitative interpretation of the data. TaqMan RT-PCR can be performed using commercially available equipment, such as, for example, ABI PRISM 7700tam Sequence Detection System (Perkin-Elmer-Applied Biosystems, Foster City, CA, USA), or Lightcycler (Roche Molecular Biochemicals, Mannheim, Germany). In a preferred embodiment, the 5' nuclease procedure is run on a real-time quantitative PCR device such as the ABI PRISM 7700tam Sequence Detection System. The system consists of a thermocycler, laser, charge-coupled device (CCD), camera, and computer. The system amplifies samples in a 96-well format on a thermocycler. During amplification, laser-induced fluorescent signal is collected in real- time through fibre optics cables for all 96 wells, and detected at the CCD. The system includes software for running the instrument and for analyzing the data.

5¹ nuclease assay data are initially expressed as Ct, or the threshold cycle. As discussed above, fluorescence values are recorded during every cycle and represent the amount of product amplified to that point in the amplification reaction. The point when the fluorescent signal is first recorded as statistically significant is the threshold cycle.

To minimize errors and the effect of sample-to-sample variation, RT-PCR is usually performed using an internal standard. The ideal internal standard is expressed at a constant level among different tissues, and is unaffected by the experimental treatment. RNAs most frequently used to normalize patterns of gene expression are mRNAs for the housekeeping genes glyceraldehyde-3-phosphate-dehydrogenase (GAPDH) and-actin.

Real-time quantitative PCR (qPCR) A more recent variation of the RT-PCR technique is the real time quantitative PCR, which measures PCR product accumulation through a dual-labeled fluorigenic probe (i.e., TaqMan@ probe). Real time PCR is compatible both with quantitative competitive PCR and with quantitative comparative PCR. The former uses an internal competitor for each target sequence for normalization, while the latter uses a normalization gene contained within the sample, or a housekeeping gene for RT-PCR. For further details see, e.g., Held et al., Genome Research 6: 986-994 (1996).

Expression levels can be determined using fixed, paraffin-embedded tissues as the RNA source. According to one aspect of the present invention, PCR primers and probes are designed based upon intron sequences present in the gene to be amplified. In this embodiment, the first step in the primer/probe design is the delineation of intron sequences within the genes. This can be done by publicly available software, such as the DNA BLAT software developed by Kent, W. J., Genome Res. 12 (4): 656-64 (2002), or by the BLAST software including its variations. Subsequent steps follow well established methods of PCR primer and probe design.

In order to avoid non-specific signals, it is useful to mask repetitive sequences within the introns when designing the primers and probes. This can be easily accomplished by using the Repeat Masker program available on-line through the Baylor College of Medicine, which screens DNA sequences against a library of repetitive elements and returns a query sequence in which the repetitive elements are masked. The masked sequences can then be used to design primer and probe sequences using any commercially or otherwise publicly available primer/probe design packages, such as Primer Express (Applied Biosystems); MGB assay-by-design (Applied Biosystems); Primer3 (Steve Rozen and Helen J. Skaletsky (2000) Primer3 on the WWW for general users and for biologist programmers in: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in Molecular Biology. Humana Press, Totowa, NJ, pp 365-386).

The most important factors considered in PCR primer design include primer length, melting temperature (T_m), and G/C content, specificity, complementary primer sequences, and 3' end sequence. In general, optimal PCR primers are generally 17-30 bases in length, and contain about 20-80%, such as, for example, about 50-60% G+C bases. T_ms between 50 and 80⁰C, e.g., about 50 to 70⁰C are typically preferred. For further guidelines for PCR primer and probe design see, e.g., Dieffenbach, C. W. et al., General Concepts for PCR Primer Design in: PCR Primer, A Laboratory Manual, Cold Spring Harbor Laboratory Press, New York, 1995, pp. 133-155; lnnis and Gelfand, Optimization of PCRs in: PCR Protocols, A Guide to Methods and Applications, CRC Press, London, 1994, pp. 5-11 ; and Plasterer, T. N. Primerselect: Primer and probe design. Methods MoI. Biol. 70: 520-527 (1997), the entire disclosures of which are hereby expressly incorporated by reference.

Microarray analysis

Differential gene expression can also be identified, or confirmed using the microarray technique. Thus, the expression profile of GCPMs can be measured in either fresh or paraffin-embedded tumour tissue, using microarray technology. In this method, polynucleotide sequences of interest (including cDNAs and oligonucleotides) are plated, or arrayed, on a microchip substrate. The arrayed sequences (i.e., capture probes) are then hybridized with specific polynucleotides from cells or tissues of interest (i.e., targets). Just as in the RT-PCR method, Jhe source of RNA typically is total RNA isolated from human tumours or tumour cell lines, and corresponding normal tissues or cell lines. Thus RNA can be isolated from a variety of primary tumours or tumour cell lines. If the source of RNA is a primary tumour, RNA can be extracted, for example, from frozen or archived paraffin-embedded and fixed (e.g., formalin-fixed) tissue samples, which are routinely prepared and preserved in everyday clinical practice.

In a specific embodiment of the microarray technique, PCR amplified inserts of cDNA clones are applied to a substrate. The substrate can include up to 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or 75 nucleotide sequences. In other aspects, the substrate can include at least 10,000 nucleotide sequences. The microarrayed sequences, immobilized on the microchip, are suitable for hybridization under stringent conditions. As other embodiments, the targets for the microarrays can be at least 50, 100, 200, 400, 500, 1000, or 2000 bases in length; or 50-100, 100-200, 100-500, 100-1000, 100-2000, or 500- 5000 bases in length. As further embodiments, the capture probes for the microarrays can be at least 10, 15, 20, 25, 50, 75, 80, or 100 bases in length; or 10-15, 10-20, 10-25, 10- 50, 10-75, 10-80, or 20-80 bases in length.

Fluorescently labeled cDNA probes may be generated through incorporation of fluorescent nucleotides by reverse transcription of RNA extracted from tissues of interest. Labeled cDNA probes applied to the chip hybridize with specificity to each spot of DNA on the array. After stringent washing to remove non-specifically bound probes, the chip is scanned by confocal laser microscopy or by another detection method, such as a CCD camera. Quantitation of hybridization of each arrayed element allows for assessment of corresponding mRNA abundance. With dual colour fluorescence, separately labeled cDNA probes generated from two sources of RNA are hybridized pairwise to the array. The relative abundance of the transcripts from the two sources corresponding to each specified gene is thus determined simultaneously.

The miniaturized scale of the hybridization affords a convenient and rapid evaluation of the expression pattern for large numbers of genes. Such methods have been shown to have the sensitivity required to detect rare transcripts, which are expressed at a few copies per cell, and to reproducibly detect at least approximately two-fold differences in the expression levels (Schena et al., Proc. Natl. Acad. Sci. USA 93 (2): 106-149 (1996)).

Microarray analysis can be performed by commercially available equipment, following manufacturer's protocols, such as by using the Affymetrix GenChip technology, or Incyte's microarray technology. The development of microarray methods for large-scale analysis of gene expression makes it possible to search systematically for molecular markers of cancer classification and outcome prediction in a variety of tumour types.

RNA isolation, purification, and amplification General methods for mRNA extraction are well known in the art and are disclosed in standard textbooks of molecular biology, including Ausubel et al., Current Protocols of Molecular Biology, John Wiley and Sons (1997). Methods for RNA extraction from paraffin embedded tissues are disclosed, for example, in Rupp and Locker, Lab Invest. 56: A67 (1987), and De Sandres et al., BioTechniques 18: 42044 (1995). In particular, RNA isolation can be performed using purification kit, buffer set, and protease from commercial manufacturers, such as Qiagen, according to the manufacturer's instructions. For example, total RNA from cells in culture can be isolated using Qiagen RNeasy mini- columns. Other commercially available RNA isolation kits include MasterPure Complete DNA and RNA Purification Kit (EPICENTRE (D, Madison, Wl), and Paraffin Block RNA Isolation Kit (Ambion, Inc.). Total RNA from tissue samples can be isolated using RNA Stat-60 (Tel-Test). RNA prepared from tumour can be isolated, for example, by cesium chloride density gradient centrifugation.

The steps of a representative protocol for profiling gene expression using fixed, paraffin- embedded tissues as the RNA source, including mRNA isolation, purification, primer extension and amplification are given in various published journal articles (for example: T. E. Godfrey et al. J. Molec. Diagnostics 2: 84-91 (2000); K. Specht et al., Am. J. Pathol. 158: 419-29 (2001)). Briefly, a representative process starts with cutting about 10 μm thick sections of paraffin-embedded tumour tissue samples. The RNA is then extracted, and protein and DNA are removed. After analysis of the RNA concentration, RNA repair and/or amplification steps may be included, if necessary, and RNA is reverse transcribed using gene specific promoters followed by RT-PCR. Finally, the data are analyzed to identify the best treatment option(s) available to the patient on the basis of the characteristic gene expression pattern identified in the tumour sample examined.

lmmunohistochemistry and proteomics lmmunohistochemistry methods are also suitable for detecting the expression levels of the proliferation markers of the present invention. Thus, antibodies or antisera, preferably polyclonal antisera, and most preferably monoclonal antibodies specific for each marker, are used to detect expression. The antibodies can be detected by direct labeling of the antibodies themselves, for example, with radioactive labels, fluorescent labels, hapten labels such as, biotin, or an enzyme such as horse radish peroxidase or alkaline phosphatase. Alternatively, unlabeled primary antibody is used in conjunction with a labeled secondary antibody, comprising antisera, polyclonal antisera or a monoclonal antibody specific for the primary antibody, lmmunohistochemistry protocols and kits are well known in the art and are commercially available.

Proteomics can be used to analyze the polypeptides present in a sample (e.g., tissue, organism, or cell culture) at a certain point of time. In particular, proteomic techniques can be used to asses the global changes of protein expression in a sample (also referred to as expression proteomics). Proteomic analysis typically includes: (1) separation of individual proteins in a sample by 2-D gel electrophoresis (2-D PAGE); (2) identification of the individual proteins recovered from the gel, e.g., my mass spectrometry or N-terminal sequencing, and (3) analysis of the data using bioinformatics. Proteomics methods are valuable supplements to other methods of gene expression profiling, and can be used, alone or in combination with other methods, to detect the products of the proliferation markers of the present invention.

Selection of Differentially Expressed Genes.

An early approach to the selection of genes deemed significant involved simply looking at the "fold change" of a given gene between the two groups of interest. While this approach hones in on genes that seem to change the most spectacularly, consideration of basic statistics leads one to realize that if the variance (or noise level) is quite high (as is often seen in microarray experiments), then seemingly large fold-change can happen frequently by chance alone.

Microarray experiments, such as those described here, typically involve the simultaneous measurement of thousands of genes. If one is comparing the expression levels for a particular gene between two groups (for example recurrent and non-recurrent tumours), the typical tests for significance (such as the t-test) are not adequate. This is because, in an ensemble of thousands of experiments (in this context each gene constitutes an "experiment"), the probability of at least one experiment passing the usual criteria for significance by chance alone is essentially unity. In a test for significance, one typically calculates the probability that the "null hypothesis" is correct. In the case of comparing two groups, the null hypothesis is that there is no difference between the two groups. If a statistical test produces a probability for the null hypothesis below some threshold (usually 0.05 or 0.01), it is stated that we can reject the null hypothesis, and accept the hypothesis that the two groups are significantly different. Clearly, in such a test, a rejection of the null hypothesis by chance alone could be expected 1 in 20 times (or 1 in 100). The use of t- tests, or other similar statistical tests for significance, fail in the context of microarrays, producing far too many false positives (or type I errors)

In this type of situation, where one is testing multiple hypotheses at the same time, one applies typical multiple comparison procedures, such as the Bonferroni Method (43). However such tests are too conservative for most microarray experiments, resulting in too many false negative (type II) errors.

A more recent approach is to do away with attempting to apply a probability for a given test being significant, and establish a means for selecting a subset of experiments, such that the expected proportion of Type I errors (or false discovery rate; 47) is controlled for. It is this approach that has been used in this investigation, through various implementations, namely the methods provided with BRB Array Tools (48), and the limma (11 ,42) package of Bioconductor (that uses the R statistical environment; 10,39).

General methodology for Data Mining: Generation of Prognostic Signatures

Data Mining is the term used to describe the extraction of "knowledge", in other words the "know-how", or predictive ability from (usually) large volumes of data (the dataset). This is the approach used in this study to generate prognostic signatures. In the case of this study the "know-how" is the ability to accurately predict prognosis from a given set of gene expression measurements, or "signature" (as described generally in this section and in more detail in the examples section).

The specific details used for the methods used in this study are described in Examples 17-20. However, application of any of the data mining methods (both those described in the Examples, and those described here) can follow this general protocol.

Data mining (49), and the related topic machine learning (40) is a complex, repetitive mathematical task that involves the use of one or more appropriate computer software packages (see below). The use of software is advantageous on the one hand, in that one does not need to be completely familiar with the intricacies of the theory behind each technique in order to successfully use data mining techniques, provided that one adheres to the correct methodology. The disadvantage is that the application of data mining can often be viewed as a "black box": one inserts the data and receives the answer. How this is achieved is often masked from the end-user (this is the case for many of the techniques described, and can often influence the statistical method chosen for data mining. For example, neural networks and support vector machines have a particularly complex implementation that makes it very difficult for the end user to extract out the "rules" used to produce the decision. On the other hand, k-nearest neighbours and linear discriminant analysis have a very transparent process for decision making that is not hidden from the user.

There are two types of approach used in data mining: supervised and unsupervised approaches. In the supervised approach, the information that is being linked to the data is known, such as categorical data (e.g. recurrent vs. non recurrent tumours). What is required is the ability to link the observed response (e.g. recurrence vs. non-recurrence) to the input variables. In the unsupervised approach, the classes within the dataset are not known in advance, and data mining methodology is employed to attempt to find the classes or structure within the dataset.

In the present example the supervised approach was used and is discussed in detail here, although it will be appreciated that any of the other techniques could be used.

The overall protocol involves the following steps:

• Data representation. This involves transformation of the data into a form that is most likely to work successfully with the chosen data mining technique. In where the data is numerical, such as in this study where the data being investigated represents relative levels of gene expression, this is fairly simple. If the data covers a large dynamic range (i.e. many orders of magnitude) often the log of the data is taken. If the data covers many measurements of separate samples on separate days by separate investigators, particular care has to be taken to ensure systematic error is minimised. The minimisation of systematic error (i.e. errors resulting from protocol differences, machine differences, operator differences and other quantifiable factors) is the process referred to here as "normalisation".

• Feature Selection. Typically the dataset contains many more data elements than would be practical to measure on a day-to-day basis, and additionally many elements that do not provide the information needed to produce a prediction model. The actual ability of a prediction model to describe a dataset is derived from. some subset of the full dimensionality of the dataset. These dimensions the most important components (or features) of the dataset. Note in the context of microarray data, the dimensions of the dataset are the individual genes. Feature selection, in the context described here, involves finding those genes which are most "differentially expressed". In a more general sense, it involves those groups which pass some statistical test for significance, i.e. is the level of a particular variable consistently higher or lower in one or other of the groups being investigated. Sometimes the features are those variables (or dimensions) which exhibit the greatest variance.

The application of feature selection is completely independent of the method used to create a prediction model, and involves a great deal of experimentation to achieve the desired results. Within this invention, the selection of significant genes, and -those which correlated with the earlier successful model (the NZ classifier), entailed feature selection. In addition, methods of data reduction (such as principal component analysis) can be applied to the dataset.

• Training. Once the classes (e.g. recurrence/non-recurrence) and the features of the dataset have been established, and the data is represented in a form that is acceptable as input for data mining, the reduced dataset (as described by the features) is applied to the prediction model of choice. The input for this model is usually in the form a multi-dimensional numerical input,(known as a vector), with associated output information (a class label or a response). In the training process, selected data is input into the prediction model, either sequentially (in techniques such as neural networks) or as a whole (in techniques that apply some form of regression, such as linear models, linear discriminant analysis, support vector machines). In some instances (e.g. k-nearest neighbours) the dataset (or subset of the dataset obtained after feature selection) is itself the model. As discussed, effective models can be established with minimal understanding of the detailed mathematics, through the use of various software packages where the parameters of the model have been pre-determined by expert analysts as most likely to lead to successful results.

• Validation. This is a key component of the data-mining protocol, and the incorrect application of this frequently leads to errors. Portions of the dataset are to be set aside, apart from feature selection and training, to test the success of the prediction model. Furthermore, if the results of validation are used to effect feature selection and training of the model, then one obtains a further validation set to test the model before it is applied to real-life situations. If this process is not strictly adhered to the model is likely to fail in real-world situations. The methods of validation are described in more detail below.

• Application. Once the model has been constructed, and validated, it must be packaged in some way as it is accessible to end users. This often involves implementation of some form a spreadsheet application, into which the model has been imbedded, scripting of a statistical software package, or refactoring of the model into a hard-coded application by information technology staff.

Examples of software packages that are frequently used are:

- Spreadsheet plugins, obtained from multiple vendors.

- The R statistical environment.

- The commercial packages MatLab, S-plus, SAS, SPSS, STATA.

- Free open-source software such as Octave (a MatLab clone) - many and varied C++ libraries, which can be used to implement prediction models in a commercial, closed-source setting.

Examples of Data Mining Methods.

The methods can be by first performing the step of data mining process (above), and then applying the appropriate known software packages. Further description of the process of data mining is described in detail in many extremely well-written texts. (49)

• Linear models (49, 50): The data is treated as the input of a iinear regression model, of which the class labels or responses variables are the output. Class labels, or other categorical data, must be transformed into numerical values

(usually integer). In generalised linear models, the class labels or response variables are not themselves linearly related to the input data, but are transformed through the use of a "link function". Logistic regression is the most common form of generalized linear model.

• Linear Discriminant analysis (49, 51 , 52). Provided the data is linearly separable (i.e. the groups or classes of data can be separated by a hyperplane, which is an n-dimensional extension of a threshold), this technique can be applied. A combination of variables is used to separate the classes, such that the between group variance is maximised, and the within-group variance is minimised. The byproduct of this is the formation of a classification rule. Application of this rule to samples of unknown class allows predictions or classification of class membership to be made for that sample. There are variations of linear discriminant analysis such as nearest shrunken centroids which are commonly used for microarray analysis.

• Support vector machines (53): A collection of variables is used in conjunction with a collection of weights to determine a model that maximizes the separation between classes in terms of those weighted variables. Application of this model to a sample then produces a classification or prediction of class membership for that sample.

• Neural networks (52): The data is treated as input into a network of nodes, which superficially resemble biological neurons, which apply the input from all the nodes to which they are connected, and transform the input into an output. Commonly, neural networks use the "multiply and sum" algorithm, to transform the inputs from multiple connected input nodes into a single output. A node may not necessarily produce an output unless the inputs to that node exceed a certain threshold. Each node has as its input the output from several other nodes, with the final output node usually being linked to a categorical variable. The number of nodes, and the topology of the nodes can be varied in almost infinite ways, providing for the ability to classify extremely noisy data that may not be possible to categorize in other ways. The most common implementation of neural networks is the multi-layer perceptron.

• Classification and regression trees (54): In these, variables are used to define a hierarchy of rules that can be followed in a stepwise manner to determine the class of a sample. The typical process creates a set of rules which lead to a specific class output, or a specific statement of the inability to discriminate. A example classification tree is an implementation of an algorithm such as: if gene A> x and gene Y > x and gene Z = Z then class A else if geneA = q then class B • Nearest neighbour methods (51 , 52). Predictions or classifications are made by comparing a sample (of unknown class) to those around it (or known class), with closeness defined by a distance function. It is possible to define many different distance functions. Commonly used distance functions are the Euclidean distance (an extension of the Pythagorean distance, as in triangulation, to n-dimensions), various forms of correlation (including Pearson Correlation co-efficient). There are also transformation functions that convert data points that would not normally be interconnected by a meaningful distance metric into euclidean space, so that Euclidean distance can then be applied (e.g. Mahalanobis distance). Although the distance metric can be quite complex, the basic premise of k-nearest neighbours is quite simple, essentially being a restatement of "find the k-data vectors that are most similar to the unknown input, find out which class they correspond to, and vote as to which class the unknown input is".

• Other methods:

- Bayesian networks. A directed acyclic graph is used to represent a collection of variables in conjunction with their joint probability distribution, which is then used to determine the probability of class membership for a sample.

- Independent components analysis, in which independent signals (e.g., class membership) re isolated (into components) from a collection of variables. These components can then be used to produce a classification or prediction of class membership for a sample.

Ensemble learning methods in which a collection of prediction methods are combined to produce a joint classification or prediction of class membership for a sample

There are many variations of these methodologies that can be explored (49), and many new methodologies are constantly being defined and developed. It will be appreciated that any one of these methodologies can be applied in order to obtain an acceptable result. Particular care must be taken to avoid overfitting, by ensuring that all results are tested via a comprehensive validation scheme.

Validation

Application of any of the prediction methods described involves both training and cross-validation (43, 55) before the method can be applied to new datasets (such as data from a clinical trial). Training involves taking a subset of the dataset of interest (in this case gene expression measurements from colorectal tumours), such that it is stratified across the classes that are being tested for (in this case recurrent and non-recurrent tumours). This training set is used to generate a prediction model (defined above), which is tested on the remainder of the data (the testing set).

It is possible to alter the parameters of the prediction model so as to obtain better performance in the testing set, however, this can lead to the situation known as overfitting, where the prediction model works on the training dataset but not on any external dataset. In order to circumvent this, the process of validation is followed. There are two major types of validation typically applied, the first (hold-out validation) involves partitioning the dataset into three groups: testing, training, and validation. The validation set has no input into the training process whatsoever, so that any adjustment of parameters or other refinements must take place during application to the testing set (but not the validation set). The second major type is cross-validation, which can be applied in several different ways, described below.

There are two main sub-types of cross-validation: K-fold cross-validation, and leave-one- out cross-validation

K-fold cross-validation: The dataset is divided into K subsamples, each subsample containing approximately the same proportions of the class groups as the original. In each round of validation, one of the K subsamples is set aside, and training is accomplished using the remainder of the dataset. The effectiveness of the training for that round is guaged by how correctly the classification of the left-out group is. This procedure is repeated K- times, and the overall effectiveness ascertained by comparison of the predicted class with the known class.

Leave-one-out cross-validation: A commonly used variation of K-fold cross validation, in which K=n, where n is the number of samples.

Combinations of CCPMS, such as those described above in Tables 1 and 2, can be used to construct predictive models for prognosis.

Prognostic Signatures

Prognostic signatures, comprising one or more of these markers, can be used to determine the outcome of a patient, through application of one or more predictive models derived from the signature. In particular, a clinician or researcher can determine the differential expression (e.g., increased or decreased expression) of the one or more markers in the signature, apply a predictive model, and thereby predict the negative prognosis, e.g., likelihood of disease relapse, of a patient, or alternatively the likelihood of a positive prognosis (continued remission).

In still further aspects, the invention includes a method of determining a treatment regime for a cancer comprising: (a) providing a sample of the cancer; (b) detecting the expression level of a GgCPM family member in said sample; (c) determining the prognosis of the cancer based on the expression level of a CCPM family member; and (d) determining the treatment regime according to the prognosis.

In still further aspects, the invention includes a device for detecting a GCPM, comprising: a substrate having a GCPM capture reagent thereon; and a detector associated with said substrate, said detector capable of detecting a GCPM associated with said capture reagent. Additional aspects include kits for detecting cancer, comprising: a substrate; a GCPM capture reagent; and instructions for use. Yet further aspects of the invention include method for detecting aGCPM using qPCR, comprising: a forward primer specific for said CCPM; a reverse primer specific for said GCPM; PCR reagents; a reaction vial; and instructions for use.

Additional aspects of this invention comprise a kit for detecting the presence of a GCPM polypeptide or peptide, comprising: a substrate having a capture agent for said GCPM polypeptide or peptide; an antibody specific for said GCPM polypeptide or peptide; a reagent capable of labeling bound antibody for said GCPM polypeptide or peptide; and instructions for use.

In yet further aspects, this invention includes a method for determining the prognosis of colorectal cancer, comprising the steps of: providing a tumour sample from a patient suspected of having colorectal cancer; measuring the presence of a GCPM polypeptide using an ELISA method. In specific aspects of this invention the GCPM of the invention is selected from the markers set forth in Table A, Table B, Table C or Table D. In still further aspects, the GCPM is included in a prognostic signature

While exemplified herein for gastrointestinal cancer, e.g., gastric and colorectal cancer, the GCPMs of the invention also find use for the prognosis of other cancers, e.g., breast cancers, prostate cancers, ovarian cancers, lung cancers (such as adenocarcinoma and, particularly, small cell lung cancer), lymphomas, gliomas, blastomas (e.g., medulloblastomas), and mesothelioma, where decreased or low expression is associated with a positive prognosis, while increased or high expression is associated with a negative prognosis.

EXAMPLES The examples described herein are for purposes of illustrating embodiments of the invention. Other embodiments, methods, and types of analyses are within the scope of persons of ordinary skill in the molecular diagnostic arts and need not be described in detail hereon. Other embodiments within the scope of the art are considered to be part of this invention.

EXAMPLE 1 : Cell cultures

The experimental scheme is shown in FIG. 1. Ten colorectal cell lines were cultured and harvested at semi- and full-confluence. Gene expression profiles of the two growth stages were analyzed on 30,000 oligonucleotide arrays and a gene proliferation signature (GPS; Table C) was identified by gene ontology analysis of differentially expressed genes. Unsupervised clustering was then used to independently dichotomize two cohorts of clinical colorectal samples (Cohort A: 73 stage I-IV on oligo arrays, Cohort B: 55 stage Il on Affymetrix chips) based on the similarities of the GPS expression. Ki-67 immunostaining was also performed on tissue sections from Cohort A tumours. Following this, the correlation between proliferation activity and clinico-pathologic parameters was investigated. -»^•

Ten colorectal cancer cell lines derived from different disease stages were included in this study: DLD-1 , HCT-8, HCT-116, HT-29, LoVo, Ls174T, SK-CO-1 , SW48, SW480, and SW620 (ATCC, Manassas, VA). Cells were cultivated in a 5% CO₂ humidified atmosphere at 37⁰C in alpha minimum essential medium supplemented with 10% fetal bovine serum, 100 IU/ml penicillin and 100 μg/ml streptomycin (GIBCO-lnvitrogen, CA). Two cell cultures were established for each cell line. The first culture was harvested upon reaching semi- confluence (50-60%). When cells in the second culture reached full-confluence (determined both microscopically and macroscopically), media was replaced, and cells were harvested twenty-four hours later to prepare RNA from the growth-inhibited cells. Array experiments were carried out on RNA extracted from each cell culture. In addition, a second culturing experiment was done following the same procedure and extracted RNA was used for dye-reversed hybridizations. EXAMPLE 2: Patients

Two cohorts of patients were analysed. Cohort A included 73 New Zealand colorectal cancer patients who underwent surgery at Dunedin and Auckland hospitals between 1995 and 2000. These patients were part of a prospective cohort study and included all disease stages. Tumour samples were collected fresh from the operation theatre, snap frozen in liquid nitrogen and stored at -8O⁰C. Specimens were reviewed by a single pathologist (H-S Y) and tumours were staged according to the TNM system (34). Of the 73 patients, 32 developed disease recurrence and 41 remained recurrence-free after a minimum of five years follow up. The median overall survival was 29.5 and 66 months for recurrent and recurrent-free patients, respectively. Twenty patients received 5-FU-based post-operative adjuvant chemotherapy and 12 patients received radiotherapy (7 pre- and 5 postoperative).

Cohort B included a group of 55 German colorectal patients who underwent surgery at the Technical University of Munich between 1995 and 2001 and had fresh frozen samples stored in a tissue bank. All 55 had stage Il disease, 26 developed disease recurrence (median survival 47 months) and 29 remained recurrence-free (median survival 82 months). None of patients received chemotherapy or radiotherapy. Clinico-pathologic variables of both cohorts are summarised as part of Table 2.

Table 2: Clinico-pathologic parameters and their association with the GPS expression and Ki-67 Pl

Number of patients GPS Ki-67 PI* cohort A cohort B

Parameters cohort A cohort B Mean ± SD p-value ⁵

(p-value)⁵ (p-value)⁵

Age ¹¹ < Mean 34 31 1 0.79 74.4*17.9 0.6

>Mean 39 24 77.9*17.3

Sex Male 35 33 0.16 1 77.3±15.3 1

Female 38 22 75.3±19.5

Site^£ Right side 30 12 1 0.2 80.4±13.3 0.2

Left side 43 43 73.1*19.7

Grade Well 9 0 0.22 0.2 75.6±18.1

Moderate 50 33 73.9*18.9 0.98

Poor 14 22 84.3±9.3

Dukes stage A 10 0 0.006 NA 78.8±17.3 0.73

B 27 55 75.7*18.4

C 28 0 76±16.1

D 8 0 75.9*22

T stage Tl 5 0 0.16 0.62 71.3±22.4 0.16

T2 11 11 85.4±7.4

T3 50 41 76*17

T4 7 3 66.2±26.3

N stage NO 38 55 0.03 NA 76.5±17.9 1

N1+>J2 35 0 76*17.4

Vascular invasion Yes 5 1 0.67 NA 54.4±31.5 0.32

No 68 54 78*15

Lymphatic invasion Yes 32 5 0.06 0.35 76.5*18.3 0.6

No 41 50 75.1±17.3

Lymphocyte infiltration Mild 35 15 0.89 1 75±18.6 0.85

Moderate 27 25 79.4±16.5

Prominent 11 15 73.5±18.3

Margin Infiltrative 45 0.47 NA 75.8*18.9 1

INA

Expansive 28 77.1±15.7

Recurrence Yes 32 26 0.03 <0.001 75.6*19 0.79

No 41 29 76.8*16.2

Total 73 55 76.3*17.5

§ A Fisher's Exact Test or Kruskal-Wallis Test were used for testing association between clinico-pathologic parameters and

GPS expression or Ki-67 PI, as appropriate.

* Ki-67 immunostaining was performed on tumor sections from cohort A patients.

£ Proximal and distal to splenic flexure, respectively

K Average age 68 and 63 years for cohort A and B patients, respectively

NA: not applicable

EXAMPLE 3: Array preparation and gene expression analysis

Cohort A tumours and cell lines: Tissue samples and cell lines were homogenised and RNA was extracted using Tri-Reagent (Progenz, Auckland, NZ). The RNA was then purified using RNeasy mini column (Qiagen, Victoria, Australia) according to the manufacture's protocol. Ten micrograms of total RNA extracted from each culture or tumour sample was oligo-dT primed and cDNA synthesis was carried out in the presence of aa-dUTP and Superscript Il RNase H-Reverse Transcriptase (Invitrogen). Cy dyes were incorporated into cDNA using the indirect amino-allyl cDNA labelling method. cDNA derived from a pool of 12 different cell lines was used as the reference for all hybridizations. The Cy5-dUTP-tagged cDNA from an individual colorectal cell line or tissue sample was combined with Cy3-dUTP-tagged cDNA from reference sample. The mixture was then purified using a QiaQuick PCR purification Kit (Qiagen, Victoria, Australia) and co-hybridized to a microarray spotted with the MWG 3OK Oligo Set (MWG Biotech, NC). cDNA samples from the second culturing experiment were additionally analysed on microarrays using reverse labelling.

Arrays were scanned with a GenePix 4000B Microarray Scanner and data were analysed using GenePix Pro 4.1 Microarray Acquisition and Analysis Software (Axon, CA). The foreground intensities from each channel were log₂ transformed and normalised using the SNOMAD software (35) Normalised values were collated and filtered using BRB-Array Tools Version 3.2 (developed by Dr. Richard Simon and Amy Peng Lam, Biometric Research Branch, National Cancer Institute). Low intensity genes, and genes for which over 20% of measurements across tissue samples or cell lines were missing, were excluded from further analysis.

Cohort B tumours: Total RNA was extracted from each tumour using RNeasy Mini Kit and purified on RNeasy Columns (Qiagen, Hilden, Germany). Ten micrograms of total RNA was used to synthesize double-stranded cDNA with Superscript Il reverse transcriptase (GIBCO-lnvitrogen, NY) and an oligo-dT-T7 primer (Eurogentec, Koeln, Germany). Biotinylated cRNA was synthesized from the double-stranded cDNA using the Promega RiboMax T7-kit (Promega, Madison, Wl) and Biotin-NTP labelling mix (Loxo, Dossenheim, Germany). Then, the biotinylated cRNA was purified and fragmented. The fragmented cRNA was hybridized to Affymetrix HGU133A GeneChips (Affymetrix, Santa Clara, CA) and stained with streptavidin-phycoerythrin. The arrays were then scanned with a HP- argon-ion laser confocal microscope and the digitized image data were processed using the Affymetrix® Microarray Suite 5.0 Software. All Affymetrix U133A GeneChips passed quality control to eliminate scans with abnormal characteristics. Background correction and normalization were performed in the R computing environment using the robust multi- array average function implemented in the Bioconductor package affy.

EXAMPLE 4: Quantitative real-time PCR (QPCR)

The expression of eleven genes (MAD2L1 , POLE2, CDC2, MCM6, MCM7, RANSEH2A, TOPK, KPNA2, G22P1 , PCNA, and GMNN) was validated using the cDNA from the cell cultures. Total RNA (2 μg) was reverse transcribed using Superscript Il RNase H-Reverse Transcriptase kit (Invitrogen) and oligo dT primer (Invitrogen). QPCR was performed on an ABI Prism 7900HT Sequence Detection System (Applied Biosystems) using Taqman Gene Expression Assays (Applied Biosystems). Relative fold changes were calculated using the 2^"MCT method36 with Topoisomerase 3A as the internal control. Reference RNA was used as the calibrator to enable comparison between different experiments.

EXAMPLE 5: lmmunohistochemical analysis lmmunohistochemical expression of Ki-67 antigen (MIB-1 ; DakoCytornation, Denmark) was investigated on 4 μm sections of 73 paraffin-embedded primary colorectal tumours from Cohort A. Endogenous peroxidase activity was blocked with 0.3% hydrogen peroxidase in methanol and antigens were retrieved in boiling citrate buffer (pH 6). Nonspecific binding sites were blocked with 5% normal goat serum containing 1% BSA. Primary antibody (dilution 1 :50) was detected using the EnVision system (Dako EnVision, CA) and the DAB substrate kit (Vector laboratories, CA). Five high-power fields were selected using a 10 x 10 microscope grid and cell counts were performed manually in a blind fashion without knowledge of the ciinico-pathologic data. The Ki-67 proliferation index (Pl) was presented as the percentage of positively stained nuclei for each tumour. .

EXAMPLE 6: Statistical analysis

Statistical analyses were performed using SPSS® version 14.0.0 (SPSS Inc., Chicago, IL). Ki-67 proliferation indices were presented as mean ± SD. A Fisher's Exact Test or Kruskal-Wallis Test was used to evaluate the differences between categorized groups based on the expression of the GPS or the Ki-67 Pl versus the ciinico-pathologic parameters. A P value ≤ 0.05 was considered significant. Overall survival (OS) and recurrence-free survival (RFS) were plotted using the method of Kaplan and Meier (37). A log-rank test was used to test for differences in survival time between the categorized groups. Relative risk and associated confidence intervals were also estimated for each variable using the Cox univariate model, and a multivariate Cox proportional hazard model was developed using forward stepwise regression with predictive variables that were significant in the univariate analysis. K-means clustering method was used to classify clinical samples based on the expression level of GPS.

EXAMPLE 7: Identification of a gene proliferation signature (GPS) using a colorectal cell line model

An overview of the approach used to derive and apply a gene proliferation signature (GPS) is summarised in FIG. 1. The GPS, including 38 mitotic cell cycle genes (Table C), was relatively over-expressed in cycling cells in semi-confluent cultures. Low proliferation, defined by low GPS expression, was associated with unfavourable ciinico-pathologic variables, shorter overall and recurrence-free survival (p<0.05). No association was found between Ki-67 proliferation index and ciinico-pathologic variables or clinical outcome. Table C: GCPMs for cell proliferation signature

The GPS was identified as a subset of genes whose expression correlates with CRC cell proliferation rate. Statistical Analysis of Microarray (SAM; Reference 38) was used to identify genes differentially expressed (DE) between exponentially growing (semi- confluent) and non-cycling (fully-confluent) CRC cell lines (FIG. 1, stage 1). To adjust for gene specific dye bias and other sources of variation, each culture set was analysed independently. Analyses were limited to 502 DE genes for which a significant expression difference was observed between two growth stages in both sets of cultures (false discovery rate < 1 %). Gene Ontology (GO) analysis was carried out using EASE39 to identify the biological process categories that were significantly reflected in the DE genes. Cell-proliferation related categories were over-represented mainly due to genes upregulated in exponentially growing cells. The mitotic cell cycle category (GO:0000278) was defined as the GPS because (i) this biological process was the most over- represented GO term (EΞASE score=5.5211); and (ii) all 38 mitotic cell cycle genes (Table C) were expressed at higher levels in rapidly growing compared to growth-inhibited cells. The expression of eleven genes from the GPS was assessed by QPCR and correlated with corresponding values obtained from the array data. Therefore, QPCR confirmed that elevated expression of the proliferation signature genes correlates with the increased proliferation in CRC cell lines (FIG. 5).

EXAMPLE 8: Classification of CRC samples according to the expression level of gene proliferation signature

In order to examine the relative proliferation state of CRC tumours and the utility of the GPS for clinical application, CRC tumours from two cohorts were stratified into two clusters based on the expression of GPS (FIG. 1, stage 2). Expression values of the 38 genes defining the GPS were first obtained from the microarray-generated expression profiles of tumours. Tumours from each cohort were then separately classified into two clusters (K=2) based on their GPS expression level similarities using K-means unsupervised clustering. Analysis of DE genes between two defined clusters using all filtered genes revealed that the GPS was contained within the list of genes upregulated in cluster 1 (FIG. 2A, upper panel) relative to cluster 2 (lower panel) in both cohorts. Thus, the tumours in cluster 1 are characterised by high GPS expression, while the tumours in cluster 2 are characterised by low GPS expression.

EXAMPLE 9: Low gene proliferation signature is associated with unfavourable ciinico-pathologic variables

Table 2 summarises the association between GPS expression levels and ciinico- pathologic variables. An association was observed between low proliferation activity, defined by low GPS expression, and an increased risk of recurrence in both cohorts (P=0.03 and <0.001 for Cohort A and B, respectively). In Cohort A, low GPS expression was also associated with a higher disease stage and lymph node metastasis (P=0.006 and 0.03 respectively). In addition, tumours with lymphatic invasion from Cohort A tended to be less proliferative than tumours without lymphatic invasion, albeit without reaching statistical significance (P=0.06). No association was found between the GPS expression level and tumour site, age, sex, degree of differentiation, T-stage, vascular invasion, degree of lymphocyte infiltration and tumour margin. EXAMPLE 10: Gene proliferation signature predicts clinical outcome

To examine the performance of the GPS in predicting patient outcome, Kaplan-Meier survival analysis was used to compare RFS and OS between low and high GPS tumours (FIG. 3). All patients were censored at 60 months post-operation. In colorectal cancer Cohort A, OS and RFS were shorter in patients with low GPS expression (Log rank test P=O.04 and 0.01 , respectively). In colorectal cancer Cohort B, low GPS expression was also associated with decreased OS (P=0.0004) and RFS (P=0.0002). When the parameters predicting OS and RFS in univariate analysis were investigated in a multivariate model, disease stage was the only independent predictor of 5-year OS, while disease stage and T-stage were independent predictors of RFS in Cohort A. In Cohort B, low GPS expression and lymphatic invasion showed an independent contribution to both OS and RFS. If survival analysis was limited to Cohort B patients without lymphatic invasion, low GPS was still associated with shorter OS and RFS, confirming the independence of the GPS as a predictor. Analyses of single and multiple-variable associations with survival are summarized in Table 3.

Low GPS expression was also associated with decreased 5-year overall survival in patients with gastric cancer (p=0.008). A Kaplan-Meier survival plot comparing the overall survival of low and high GPS gastric tumours is shown in Fig. 4.

Table 3: Uni- and multivariate analysis of prognostic factors for OS and RFS in both cohorts

EXAMPLE 11 : Ki-67 is not associated with clinico-pathologic variables or survival Ki-67 immunostaining was performed on tissue sections from Cohort A tumours only as paraffin-embedded samples were unavailable for Cohort B (FIG. 1 , stage 3). Nuclear staining was detected in all 73 CRC tumours. Ki-67 Pl ranged from 25 to 96 %, with a mean value of 76.3±17.5. Using the mean Ki-67 value as a cut-off point, tumours were assigned into two groups with low or high Pl. Ki-67 Pl was neither associated with clinico- pathologic variables (Table 2) nor survival (FIG. 3). When the survival analysis was limited to the patients with the highest and lowest Ki-67 values, no statistical difference was observed (data not shown). The sum of these results indicates that the low expression of growth-related genes is associated with poor outcome in colorectal cancer, and Ki-67 was not sensitive enough to detect an association. These findings can be used as additional criteria for identifying patients at high risk of early death from cancer. EXAMPLE 12: Selection of correlated cell proliferation genes

Cohort B (55 German CRC patients; Table 2) were first classified into low and high proliferation groups using the 38 gene cell proliferation signature (Table C) and the K- means clustering method (Pearson uncentered, 1000 permutations, threshold of occurrence in the same cluster sat at 80%). Statistical Analysis of Microarrays (SAM) was then applied to identify differentially expressed genes between low and high proliferation groups (FDR=O) when all filtered genes (16041 genes) were included for the analysis. 754 genes were found to be over-expressed in high proliferation group. The GATHER gene ontology program was then used to identify the most over-represented gene ontology categories within the list of differentially expressed genes. The cell cycle category was the most over-represented category within the list of differentially expressed genes. 102 cell cycle genes which are differentially expressed between the low and high proliferation groups (in addition to the original 38 gene signature) are shown in Table D. Table D: Cell Cycle Genes that are Differentially Expressed in Low and High Proliferation

Conclusions

The present invention is the first to report an association between a gene proliferation signature and major clinico-pathologic variables as well as outcome in colorectal cancer. The disclosed study investigated the proliferation state of tumours using an in vitro- derived multi-gene proliferation signature and by Ki-67 immunostaining. According to the results herein, low expression of the GPS in tumours was associated with a higher risk of recurrence and shorter survival in two independent cohorts of patients. In contrast, Ki-67 proliferation index was not associated with any clinically relevant endpoints.

The colorectal GPS encompasses 38 mitotic cell cycle genes and includes a core set of genes (CDC2, RFC4, PCNA, CCNE1 , CDK7, MCM genes, FEN1 , MAD2L1 , MYBL2, RRM2 and BUB3) that are part of proliferation signatures defined for tumours of the breast (40), (41), ovary (42), liver (43), acute lymphoblastic leukaemia (44), neuroblastoma (45), lung squamous cell carcinoma (46), head and neck (47), prostate (48), and stomach (49). This represents a conserved pattern of expression, as most of these genes have been found to be highly overexpressed in fast-growing tumours and to reflect a high proportion of rapidly cycling cells (50). Therefore, the expression level of the colorectal GPS provides a measure for the proliferative state of a tumour.

In this study, several clinico-pathologic variables related to poor outcome (disease stage, lymph node metastasis and lymphatic invasion) were associated with low GPS expression in Cohort A patients. In Cohort B, consisting entirely of stage Il tumours, the study assessed the association between the GPS and lymphatic invasion. The association failed to reach statistical significance due to the small number of tumours with lymphatic invasion in this cohort (5/55). Without being bound by theory, the low GPS expression in more advanced tumours may indicate that CRC progression is not driven by enhanced proliferation. While accelerated proliferation may still be an important driving force during the initial phases of tumourigenesis, it is possible that more advanced disease is more dependent on processes such as genetic instability to allow continuous selection. Consistent with our finding, two large-scale studies reported an association between decreased expression of CDK2, cyclin E and A, and advanced stage, deep infiltration and lymph node metastasis (51), (52).

The relationship between low GPS and unfavourable clinico-pathologic variables suggested that the GPS should also predict patient outcome. Indeed, in both Cohort A and B, low GPS expression was associated with a higher risk of recurrence and shorter overall and recurrence-free survival. In Cohort B, where all patients had stage Il tumours, the association remained in multivariate analysis. However, in Cohort A, where patients had stage I-IV disease, the association was not independent of tumour stage. The number of patients with and without recurrence, within each stage of disease in Cohort A, was probably insufficient to demonstrate an independent association between the GPS and survival. In Cohort B, low GPS expression and lymphatic invasion remained independent predictors in multivariate analysis suggesting that the GPS may improve the prediction of CRC patient outcome within the same disease stage. Not surprisingly, the presence of lymph node and distant organ involvement were the most powerful predictors of outcome as these are direct manifestations of tumour metastasis.

Treatment with radiotherapy or chemotherapy, used in 18% and 27% of Cohort A patients respectively, was a possible confounding factor in this study. Theoretically, the improved survival associated with elevated GPS expression might reflect the better response of fast proliferating tumours to cancer treatment (53), (54). However, no correlation was found between treatment and GPS expression. Furthermore, no patients in Cohort B received adjuvant therapy indicating that the association between GPS and survival is independent of treatment. It should be noted that this study was not designed to investigate the relationship between tumour proliferation and response to chemotherapy or radiotherapy.

The sample size may also explain the lack of an association between clinico-pathologic variables and survival with Ki-67 Pl in the present study. As mentioned above, other studies on Ki-67 and CRC outcome have reported inconsistent findings. However, in the three other CRC studies with the largest sample size a low Ki-67 Pl was associated with a worse prognosis (27), (29), (30). We came to the same conclusion applying the GPS, but based on a much smaller sample size. The multi-gene expression analysis was therefore a more sensitive tool to assess the relationship between proliferation and prognosis than the Ki-67 Pl.

The biological reason behind an unfavourable prognosis in tumours with a low GPS will involve further investigation. Mechanisms that could potentially contribute to worse clinical outcome in low GPS tumours include: (i) a more effective immune response to rapidly proliferating tumours; (ii) a higher level of genetic damage that may render cancer cells more resistant to apoptosis, and increase invasiveness, but also perturb smooth replication machinery; (iii) an increased number of cancer stem cells that divide slowly, similar to normal stem cells, but have a high metastatic potential; and (iv) a higher proportion of microsatellite unstable tumours which have a high proliferation rate but a relatively good prognosis. In sum, the present invention has clarified the previous, conflicting results relating to the prognostic role of cell proliferation in colorectal cancer. A GPS has been developed using CRC cell lines and has been applied to two independent patient cohorts. It was found that low expression of growth-related genes in CRC was associated with more advanced tumour stage (Cohort A) and poor clinical outcome within the same stage (Cohort B). Multi-gene expression analysis was shown as a more powerful indicator than the long- established proliferation marker, Ki-67, for predicting outcome. For future studies, it will be useful to determine the reasons that CRC differs from other common epithelia cancers, such as breast and lung cancers (e.g., in reference to Ki-67). This will likely provide insights into important underlying biological mechanisms. From a practical viewpoint, the ability to stratify recurrence risk within a given pathological stage could enable adjuvant therapy to be targeted more accurately. Thus, GPS expression can be used as an adjunct to conventional staging for identifying patients at high risk of recurrence and death from colorectal cancer.

All publications and patents mentioned in the above specification are herein incorporated by reference.

Wherein in the foregoing description reference has been made to integers or components having known equivalents, such equivalents are herein incorporated as if individually set fourth.

Although the invention has been described by way of example and with reference to possible embodiments thereof, it is to be appreciated that improvements and/or modifications may be made without departing from the scope or the spirit thereof.

References:

1. Evan Gl, Vousden KH: Proliferation, cell cycle and apoptosis in cancer. Nature 411:342-8, 2001

2. Whitfield ML, George LK, Grant GD, et al: Common markers of proliferation. Nat Rev Cancer 6:99-106, 2006

3. Rew DA, Wilson GD: Cell production rates in human tissues and tumours and their significance. Part 1 : an introduction to the techniques of measurement and their limitations. Eur J Surg Oncol 26:227-38, 2000

4. Endle E, Gerdes J: The Ki-67 protein: fascinating forms and an unknown function. Exp Cell Res 257:231-7, 2000 5. Brown DC, Gatter KC: Ki67 protein: The immaculate deception. Histopathology 40:2- 11, 2002

6. Paik S, Shak S, Tang G, et al: A multigene assay to predict recurrence of tamoxifen- treated, node-negative breast cancer. N Engl J Med 351:2817-26, 2004 7. Ofner D, Grothaus A, Riedmann B, et al: MIB1 in colorectal carcinomas: its evaluation by three different methods reveals lack of prognostic significance. Anal Cell Pathol 12:61- 70, 1996

8. lhmann T, Liu J, Schwabe W, et al: High-level mRNA quantification of proliferation marker pKi-67 is correlated with favorable prognosis in colorectal carcinoma. J Cancer Res Clin Oncol 130:749-756, 2004

9. Van Oijen MG, Medema RH, Slootweg PJ, et al: Positivity of the proliferation marker pKi-67 in non-cycling cells. Am J Clin Pathol 110:24-31, 1998

10. Duchrow M, Ziemann T, Windhόvel U, et al: Colorectal carcinomas with high MIB-1 labelling indices but low pKi67 mRNA levels correlate with better prognostic outcome. Histopathology 42:566-574, 2003

11. Evans C, Morrison I, Heriot AG, et al: The correlation between colorectal cancer rates of proliferation and apoptosis and systemic cytokine levels; plus their influence upon survival. Br J Cancer 94:1412-9, 2006

12. Rosati G, Chiacchio R, Reggiardo G, et al: Thymidylate synthase expression, p53, bcl- 2, Ki-67 and p27 in colorectal cancer: relationships with tumour recurrence and survival.

Tumour Biol 25:258-63, 2004

13. lshida H, Miwa H, Tatsuta M, et al: Ki-67 and CEA expression as prognostic markers in Dukes' C colorectal cancer. Cancer Lett 207:109-115, 2004

14. Buglioni S, D'Agnano I, Cosimelli M, et al: Evaluation of multiple bio-pathological factors in colorectal adenocarcinomas: independent prognostic role of p53 and bcl-2. lnt J Cancer 84:545-52, 1999

15. Guerra A, Borda F, Javier Jimenez F, et al: Multivariate analysis of prognostic factors in resected colorectal cancer: a new prognostic index. Eur J Gastroenterol Hepatol 10:51- 8, 1998 16. Kyzer S, Gordon PH: Determination of proliferative activity in colorectal carcinoma using monoclonal antibody Ki67. Dis Colon Rectum 40:322-5, 1997

17. Jansson A, Sun XF: Ki-67 expression in relation to clinicopathological variables and prognosis in colorectal adenocarcinomas. APMIS105:730-4, 1997

18. Baretton GB, Diebold J, Christoforis G₁ et al: Apoptosis and immunohistochemical bcl- 2 expression in colorectal adenomas and carcinomas. Aspects of carcinogenesis and prognostic significance. Cancer 77:255-64, 1996 19. Sun XF, Carstensen JM, Stal O, et al: Proliferating cell nuclear antigen (PCNA) in relation to ras, c-erbB-2, p53, clinico-pathological variables and prognosis in colorectal adenocarcinoma, lnt J Cancer 69:5-8, 1996

20. Kubota Y, Petras RE, Easley KA, et al: Ki-67-determined growth fraction versus standard staging and grading parameters in colorectal carcinoma. A multivariate analysis. Cancer 70:2602-9, 1992

21. Valera V, Yokoyama N, Walter B, et al: Clinical significance of Ki-67 proliferation index in disease progression and prognosis of patients with resected colorectal carcinoma. Br J Surg 92:1002-7, 2005 22. Dziegiel P, Forgacz J, Suder E, et al: Prognostic significance of metallothionein expression in correlation with Ki-67 expression in adenocarcinomas of large intestine. Histol Histopathol 18:401-7, 2003

23. Scopa CD, Tsamandas AC, Zolata V, et al: Potential role of bcl-2 and Ki-67 expression and apoptosis in colorectal carcinoma: a clinicopathologic study. Dig Dis Sci 48:1990-7, 2003

24. Bhatavdekar JM, Patel DD, Chikhlikar PR, et al: Molecular markers are predictors of recurrence and survival in patients with Dukes B and Dukes C colorectal adenocarcinoma. Dis Colon Rectum 44:523-33, 2001

25. Chen YT, Henk MJ, Carney KJ, et al: Prognostic Significance of Tumor Markers in Colorectal Cancer Patients: DNA Index, S-Phase Fraction, p53 Expression, and Ki-67

Index. J Gastrointest Surg 1:266-273, 1997

26. Choi HJ₁ Jung IK, Kim SS₁ et al: Proliferating cell nuclear antigen expression and its relationship to malignancy potential in invasive colorectal carcinomas. Dis Colon Rectum 40:51-9, 1997 27. Hilska M, Collan YU, O Laine VJ, et al: The significance of tumour markers for proliferation and apoptosis in predicting survival in colorectal cancer. Dis Colon Rectum 48:2197-208, 2005

28. Salminen E, Palmu S, Vahlberg T, et al: Increased proliferation activity measured by immunoreactive Ki67 is associated with survival improvement in rectal/recto sigmoid cancer. World J Gastroenterol 11 :3245-9, 2005

29. Garrity MM, Burgart LJ, Mahoney MR, et al: Prognostic value of proliferation, apoptosis, defective DNA mismatch repair, and p53 overexpression in patients with resected Dukes' B2 or C colon cancer: a North Central Cancer Treatment Group Study. J Clin Oncol 22:1572-82, 2004 30. Allegra CJ, Paik S, Colangelo LH₁ et al: Prognostic value of thymidylate synthase, Ki- 67, and p53 in patients with Dukes' B and C colon cancer: a National Cancer Institute- National Surgical Adjuvant Breast and Bowel Project collaborative study. J Clin Oncol 21 :241-50, 2003

31. Palmqvist R, Sellberg P, Oberg A, et al: Low tumour cell proliferation at the invasive margin is associated with a poor prognosis in Dukes' stage B colorectal cancers. Br J Cancer 79:577-81 , 1999

32. Paradiso A, Rabinovich M, Vallejo C, et al: p53 and PCNA expression in advanced colorectal cancer: response to chemotherapy and long-term prognosis, lnt J Cancer 69:437-41, 1996

33. Neoptolemos JP, Oates GD, Newbold KM, et al: Cyclin/proiiferation cell nuclear antigen immunohistochemistry does not improve the prognostic power of Dukes' or Jass¹ classifications for colorectal cancer. Br J Surg 82:184-7, 1995

34. Compton C, Fenoglio-Preiser CM, Pettigrew N, et al: American joint committee on cancer prognostic factors consensus conference. Colorectal working group. Cancer 88: 1739-1757, 2000 35. Colantuoni C, Henry G, Zeger S, et al: SNOMAD (Standarization and Normalization of MicroArray Data): web-accessible gene expression data analysis. Bioinformatics 18:1540- 1541 , 2002

36. Livak KJ, Schmittgen TD: Analysis of Relative Gene Expression Data Using Real- Time Quantitative PCR and the 2-ΔΔCT Method. METHODS 25:402-408, 2001 37. Pocock SJ, Clayton TC, Altman DG: Survival plots of time-to-event outcomes in clinical trials: good practice and pitfalls. Lancet 359:1686-89, 2002

38. Trusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116-21, 2001

39. Hosack DA, Dennis G, Sherman BT, et al: Identifying biological themes within lists of genes with EASE. Genome biology 4:R70, 2003

40. Perou CM, Jeffrey SS, DE Rijn MV: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. USA 96:9212-17, 1999

41. Perou CM: Molecular portraits of human breast tumours. Nature 406:747-752, 2000 42. Welsh JB, Zarrinkar PP, Sapinoso LM, et al: Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc. Natl Acad. Sci. USA 98: 1176-1181, 2001 43. Chen X, Cheung ST, So S, et al: Gene expression patterns in human liver cancers. MoI. Biol. Cell 13:1929-1939, 2002 44. Kirschner-Schwabe R, Lottaz C, Todling J, et al: Expression of late cell cycle genes and an increased proliferative capacity characterize very early relapse of childhood acute lymphoblastic leukemia. Clin Cancer Res 12:4553-61 , 2006 45. Krasnoselsky AL, Whiteford CC, Wei JS, et al: Altered expression of cell cycle genes distinguishes aggressive neuroblastoma. Oncogene 24:1533-1541 , 2005

46. lnamura K, Fujiwara T, Hoshida Y, et al: Two subclasses of lung squamous cell carcinoma with different gene expression profiles and prognosis identified by hierarchical clustering and non-negative matrix factorization. Oncogene 24:7105-13, 2005

47. Chung CH, Parker JS, Karaca G, et al: Molecular classification of head and neck squamous cell carcinomas using patterns of gene expression. Cancer Cell 5:489-500, 2004

48. LaTulippe E, Satagopan J, Smith A, et al: Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease. Cancer Res 62:4499-4506, 2002

49. Hippo Y, Taniguchi H, Tsutumi S, et al: Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Res 62:233-40, 2002

50. Whitfield ML, Sherlock G, Saldanha AJ, et al: Identification of genes periodically expressed in the human cell cycle and their expression in tumours. MoI Biol Cell 13:1977- 2000, 2002

51. Li JQ, Miki H, Ohmori M, et al: Expression of cyclin E and cyclin-dependent kinase 2 correlates with metastasis and prognosis in colorectal carcinoma. Hum Pathol 32:945-53, 2001 52. Li JQ, Miki H, Wu F, et al: Cyclin A correlates with carcinogenesis and metastasis, and p27 (kip1) correlates with lymphatic invasion, in colorectal neoplasms. Hum Pathol 33, 1006-15, 2002

53. ltamochi H, Kigawa J, Sugiyama T, et al: Low proliferation activity may be associated with chemoresistance in clear cell carcinoma of the ovary. Obstet Gynecol 100:281-287, 2002

54: lmdahl A, Jenkner J, lhling C, et al: Is MIB-1 proliferation index a predictor for response to neoadjuvant therapy in patients with esophageal cancer? Am J Surg 179:514-520, 2000

Claims

1. A prognostic signature for determining progression of gastrointestinal cancer in a patient, comprising one or more genes selected from Table A, Table B, Table C or Table D.

2. The signature of claim 1 , wherein the signature comprises one or more genes selected from any one of CDC2, MCM6, RPA3, MCM7, PCNA₁ G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1, CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1, CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1, BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37.

3. A method of predicting the likelihood of long-term survival of a gastrointestinal cancer patient without the recurrence of gastrointestinal cancer, comprising determining the expression level of one or more prognostic RNA transcripts or their expression products in a gastrointestinal sample obtained from the patient, normalized against the expression level of all RNA transcripts or their products in the gastrointestinal cancer tissue sample, or of a reference set of RNA transcripts or their expression products; wherein the prognostic RNA transcript is the transcript of one or more genes selected from table A, Table B, Table C or Table D ; and establishing likelihood of long-term survival without gastrointestinal cancer recurrence.

4. The method of claim 3, wherein at least one prognostic RNA transcripts or its expression products is selected from any one of CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK₁ GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1 , BUB3, FEN1, DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37

5. The method of claim 3 or claim 4 comprising determining the expression level of at least two, at least five, at least 10, or at least 15 of the prognostic RNA transcripts or their expression products.

6. The method according to any one of claims 3 to 5, wherein increased expression of the one or more prognostic RNA transcripts or their expression products indicates an increased likelihood of long-term survival without gastrointestinal cancer recurrence.

7. The method according to any one of claims 3 to 5, wherein a predictive model is applied, established by applying a predictive method to expressions levels of the predictive signature in recurrent and non-recurrent tumour samples, to establishing likelihood of long-term survival without gastrointestinal cancer recurrence.

8. The method of claim 7, wherein said predictive method is selected from the group consisting of linear models, support vector machines, neural networks, classification and regression trees, ensemble learning methods, discriminant analysis, nearest neighbor method, bayesian networks, independent components analysis.

9. The method of any one of claims 3 to 8 wherein the gastrointestinal cancer is gastric cancer or colorectal cancer.

10. The method of any one of claims 3 to 9 wherein the expression level of one or more prognostic RNA transcripts is determined.

11. The method of any one of claims 3 to 10 wherein the RNA is isolated from a fixed, wax- embedded gastrointestinal cancer tissue specimen of the patient.

12. The method of any one of claims 3 to 10 wherein the RNA is isolated from core biopsy tissue or fine needle aspirate cells.

13. An array comprising polynucleotides hybridizing to two or more genes selected from table A, Table B, Table C or Table D.

14 An array of claim 13 comprising polynucleotides hybridizing to two or more of the following genes: CDC2, MCM6, RPA3, MCM7, PCNA, G22P1, KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1, CDC45L, MAD2L1, RAN, DUT, RRM2, CDK7, MLH3, SMC4L1, CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1, BUB3, FEN1 , DRF1, PREI3, CCNE1, RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37.

15. The array of claim 13 or claim 14 comprising polynucleotides hybridizing to at least 3, at least five, at least 10 or at least 15 of the genes.

16. The array of claim 13 comprising polynucleotides hybridizing to the following genes: CDC2, MCM6, RPA3, MCM7, PCNA, G22P1, KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1, CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1, BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1, CCND1 , and CDC37.

17. The array of any one of claims 13 to 16 wherein the polynucleotides are cDNAs.

18. The array of claim 17 wherein the cDNAs are about 500 to 5000 bases long.

19. The array of claim any one of claims 13 to 16 wherein the polynucleotides are oligonucleotides.

20. The array of claim 19 wherein the oligonucleotides are about 20 to 80 bases long.

21. The array of any one of claims 13 to 20 wherein the solid surface is glass.

22. A method of predicting the likelihood of long-term survival of a patient diagnosed with gastrointestinal cancer, without the recurrence of gastrointestinal cancer, comprising the steps of:

(1) determining the expression levels of the RNA transcripts or the expression products of genes or a gene selected from table A, Table B, Table C or Table D, in a gastrointestinal cancer tissue sample obtained from the patient, normalized against the expression levels of all RNA transcripts or their expression products in the gastrointestinal cancer tissue sample, or of a reference set of RNA transcripts or their products;

(2) subjecting the data obtained in step (1) to statistical analysis; and (3) determining whether the likelihood of the long-term survival has increased or decreased; and establishing the likelihood of long-term survival without gastrointestinal cancer recurrence.

23 The method of claim 22, wherein at least one prognostic RNA transcripts or its expression products is selected from any one CDC2, MCM6; RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN, RRM1 , CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1, CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1, BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37.

24. The method of claim 22 or claim 23 wherein the statistical analysis is performed by using the Cox Proportional Hazards model.

25. A method of preparing a personalized genomics profile for a cancer patient, comprising the steps of: (a) subjecting RNA extracted from a gastrointestinal tissue obtained from the patient to gene expression analysis; (b) determining the expression level of one or more genes selected from the gastrointestinal cancer gene set listed in any one of Table A, Table B, Table C or Table D, wherein the expression level is normalized against a control gene or genes and optionally is compared to the amount found in a gastrointestinal cancer reference tissue set; and (c) creating a report summarizing the data obtained by the gene expression analysis.

25. The method of claim 24, wherein the gastrointestinal tissue comprises gastrointestinal cancer cells.

26. The method of claim 24 wherein the gastrointestinal tissue is obtained from a fixed, paraffin-embedded biopsy sample.

27. The method of claim 26 wherein the RNA is fragmented.

28. The method of any on of claims 22 to 27 wherein the report includes prediction of the likelihood of long term survival of the patient.

29. The method of any one of claims 22 to 29 wherein the report includes recommendation for a treatment modality of the patient.

30. A prognostic method comprising: (a) subjecting a sample comprising gastrointestinal cancer cells obtained from a patient to quantitative analysis of the levels of RNA transcripts of at least one gene selected from any one of Table A, Table B, Table C or table D, or its product, and (b) identifying the patient as likely to have an increased likelihood of long-term survival without gastrointestinal cancer recurrence if normalized expression levels of the gene or genes, or their products, are elevated above a defined expression threshold.

31. The method of claim 30, wherein at least one prognostic RNA transcripts or its expression products is selected from any one CDC2, MCM6, RPA3, MCM7, PCNA, G22P1 , KPNA2, ANLN, APG7L, TOPK, GMNN₁ RRM1, CDC45L, MAD2L1 , RAN, DUT, RRM2, CDK7, MLH3, SMC4L1 , CSPG6, POLD2, POLE2, BCCIP, Pfs2, TREX1, BUB3, FEN1 , DRF1 , PREI3, CCNE1 , RPA1 , POLE3, RFC4, MCM3, CHEK1 , CCND1 , and CDC37.

32. The method of claim 30 or claim 31 , wherein the levels of the RNA transcripts of the genes are normalized relative to the mean level of the RNA transcript or the product of two or more housekeeping genes.

33. The method of claim 32 wherein the housekeeping genes are selected from the group consisting of glyceraldehyde-3-phosphate dehydrogenase (GAPDH), Cypl, albumin, actins, tubulins, cyclophiiin hypoxantine phosphoribosyltransferase (HRPT), L32, 28S, and 185.

34. The method of any one of claims 30 to 33 wherein the sample is subjected to global gene expression analysis of all genes present above the limit of detection.

35. The method of any one of claims 30 to 34 wherein the levels of RNA transcripts of the genes are normalized relative to the mean signal of the RNA transcripts or the products of all assayed genes or a subset thereof.

36. The method of any one of claims 30 to 35 wherein the levels of RNA transcripts are determined by quantitative RT-PCR, and the signal is a Ct value.

37. The method of claim 35 wherein the assayed genes include at least 50 or at least 100 cancer related genes.

38. The method of any one of claims 30 to 37 wherein the patient is human.

39. The method of any one of claims 30 to 38 wherein the sample is a fixed, paraffin- embedded tissue (FPET) sample, or fresh or frozen tissue sample.

40. The method of any one of claims 30 to 38 wherein the sample is a tissue sample from fine needle, core, or other types of biopsy.

41. The method of any one of claims 30 to 40 wherein the quantitative analysis is performed by quantitative RT-PCR.

42. The method of any one of claims 30 to 40 wherein the quantitative analysis is performed by quantifying the products of the genes.

43. The method of any one of claims 30 to 40 wherein the products are quantified by immunohistochemistry or by proteomics technology.

44. The method of any one of claims 30 to 43 further comprising the step of preparing a report indicating that the patient has an increased likelihood of long-term survival without gastrointestinal cancer recurrence.

45. A kit comprising one or more of (1) extraction buffer/reagents and protocol; (2) reverse transcription buffer/reagents and protocol; and (3) quantitative RT-PCR buffer/reagents and protocol suitable for performing the method of any one of claims 3, 25, and 30.