WO2024008040A1 - 癌症特异性甲基化标志物及其应用 - Google Patents

癌症特异性甲基化标志物及其应用 Download PDF

Info

Publication number
WO2024008040A1
WO2024008040A1 PCT/CN2023/105537 CN2023105537W WO2024008040A1 WO 2024008040 A1 WO2024008040 A1 WO 2024008040A1 CN 2023105537 W CN2023105537 W CN 2023105537W WO 2024008040 A1 WO2024008040 A1 WO 2024008040A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
cancer
methylation
region
sample
Prior art date
Application number
PCT/CN2023/105537
Other languages
English (en)
French (fr)
Inventor
苏志熙
马成城
谢可辉
苏明扬
刘轶颖
徐敏杰
何其晔
刘蕊
Original Assignee
江苏鹍远生物科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210787623.2A external-priority patent/CN118127150A/zh
Priority claimed from CN202210786398.0A external-priority patent/CN117385026A/zh
Priority claimed from CN202210787502.8A external-priority patent/CN117385028A/zh
Priority claimed from CN202210787313.0A external-priority patent/CN117344012A/zh
Priority claimed from CN202210787425.6A external-priority patent/CN117363728A/zh
Priority claimed from CN202210787412.9A external-priority patent/CN117385027A/zh
Application filed by 江苏鹍远生物科技股份有限公司 filed Critical 江苏鹍远生物科技股份有限公司
Publication of WO2024008040A1 publication Critical patent/WO2024008040A1/zh

Links

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61PSPECIFIC THERAPEUTIC ACTIVITY OF CHEMICAL COMPOUNDS OR MEDICINAL PREPARATIONS
    • A61P35/00Antineoplastic agents
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • the present invention belongs to the field of molecular auxiliary diagnosis, and specifically relates to cancer-specific methylation markers and their applications, such as colorectal cancer tissue-specific methylation markers and their applications in diagnosing colorectal cancer.
  • Colorectal cancer is one of the most common tumors in humans, ranking third in global incidence and second in mortality among malignant tumors. In China, the incidence of colorectal cancer is also increasing.
  • Cancer screening can detect early-stage cancer patients in a timely manner by detecting early related signals in high-risk groups. Early-stage cancer patients can be completely cured through surgical resection. Cancer screening can greatly reduce the mortality rate of cancer patients. The 5% mortality rate of early-stage colorectal cancer The annual survival rate is over 90%, and the 5-year survival rate for patients with advanced colorectal cancer is less than 10%. From 1990 to 2015, the overall cancer death rate in the United States decreased by 25%, with colorectal cancer (47% decrease for men and 44% decrease for women) and breast cancer (39% decrease for women) having the largest decreases. An important part of the reason for the reduction in cancer rates is the widespread application of cancer screening technology (Byers T et al., 2016).
  • FIT fecal occult immune test
  • colonoscopy is the "gold standard" for digestive tract cancers
  • tumor markers carcinoembryonic antigen CEA, carbohydrate antigen CA19-9 detection, etc.
  • traditional methods have certain limitations.
  • colonoscopy screening is the "gold standard" for digestive tract cancers
  • colonoscopy is not invasive.
  • sexual testing the examination process is more painful, and patient compliance is poor;
  • FIT has limited diagnostic effect on colorectal precancerous lesions; the performance of tumor markers is generally poor and can only be used as a clinical reference, making it difficult to apply for large-scale screening.
  • ctDNA cell-free DNA
  • ctDNA can reflect cancer information from many aspects, such as mutations, fragment length distribution, methylation, etc. Methylation of ctDNA has become a hot spot in the research and development of cancer early screening products due to its outstanding performance. There are already many ctDNA methylation early screening applications.
  • PanSeer a pan-cancer methylation early screening application
  • PanSeer has a specificity of 96% in 5 cancer types (gastric cancer, esophageal cancer, liver cancer, colorectal cancer, and lung cancer). It can reach a sensitivity of 88% and can detect cancer 4 years earlier than traditional methods (Xingdong Chen et al., 2020).
  • a machine learning model constructed using only 6 qPCR markers in colorectal cancer can achieve a sensitivity of 86% at a specificity of 92%, achieving results that are far superior to traditional cancer screening methods (Guo-Xiang Cai et al. , 2021).
  • tissue-specific methylation markers for colorectal cancer.
  • Lung cancer is the leading cause of death worldwide. Although the comprehensive application of surgery, chemotherapy, targeted and immunotherapy has significantly improved the survival rate of lung cancer, the prognosis of lung cancer patients is still relatively poor compared with other cancers. The main reason is that most lung cancers are diagnosed at an advanced stage, which is related to the lack of universal early-stage lung cancer screening.
  • Cancer screening can detect early-stage cancer patients in a timely manner by detecting early related signals in high-risk groups. Early-stage cancer patients can be completely cured through surgical resection. Cancer screening can greatly reduce the mortality rate of cancer patients. About 85% of lung cancers are non-small cell lung cancer (NSCLC). The five-year survival rate of patients with early-stage carcinoma in situ is as high as 55.6%. However, metastasis is prone to occur in the middle and late stages, and the five-year survival rate of patients after metastasis is only 4.5%.
  • NSCLC non-small cell lung cancer
  • LDCT low-dose CT
  • LDCT low-dose CT
  • LDCT can detect patients with early-stage NSCLC to a certain extent, its specificity is low, and patients with a positive diagnosis require long-term follow-up, continuous reexamination or other diagnostic and treatment methods to confirm the diagnosis.
  • ctDNA cell-free DNA
  • ctDNA can reflect cancer information from many aspects, such as mutations, fragment length distribution, methylation, etc. Among them, ctDNA methylation has become a hot spot in the research and development of cancer early screening products due to its outstanding performance.
  • methylation early screening such as the pan-cancer methylation early screening application PanSeer
  • PanSeer can reach 88% in 5 cancer types (gastric cancer, esophageal cancer, liver cancer, colorectal cancer, lung cancer) with a specificity of 96%.
  • % sensitivity cancer can be detected 4 years earlier than traditional methods (Xingdong Chen et al., 2020).
  • tissue-specific methylation markers for lung cancer.
  • Liver cancer often has no obvious clinical symptoms and signs in the early stage, and the tumor mass grows slowly and rapidly. Most patients are only detected at an advanced stage, resulting in limited treatment options and a very poor prognosis.
  • Recent survival rate data show that the 5-year survival rate of liver cancer in the Chinese population cancer registry is approximately 9.8%-12.1% (Zeng H M et al., 2018), and the 5-year survival rate of liver cancer in the hospital cancer registry is 11.69% (Chen J G et al., 2018).
  • the 5-year survival rates of patients who underwent surgical resection in 1958-1970, 1971-1982 and 1983-1994 were 4.8%, 11.2% and 45.4% respectively; the mortality rate of patients with small liver cancer resection was 63.8% (Zhou X D et al., 1996).
  • DNA methylation detection technology is considered to be the most potential non-invasive cancer screening method, and technology has been proven to be used for cancer screening and tissue traceability (E.A. Klein et al., 2021).
  • a detection method can be designed to detect multiple cancers and perform early detection of multiple cancers at the same time. This greatly expands the scope of screening, from the high-risk groups of a certain cancer to the high-risk groups of multiple cancers, and tests a wider range of people within one screening as much as possible, increasing the compliance and compliance of the subjects. Expand the number of people available for screening.
  • the difficulty of this type of detection also lies in high-quality detection targets. Finding the most informative detection targets is the focus and difficulty of this type of detection technology.
  • tissue-specific methylation markers for liver cancer There is a need in the art for tissue-specific methylation markers for liver cancer.
  • Breast cancer is the number one killer of women. About 278,800 people are diagnosed with breast cancer in my country every year. With changes in lifestyle, the incidence and mortality of breast cancer in my country continue to increase. In European and American countries, the 5-year survival rate of breast cancer can reach 90%, while data from the same period in my country show that the 5-year survival rate of breast cancer patients in the economically developed Shanghai area is 78%, and in some areas it is only 58% (Fan L et al., 2014), which is largely attributable to the intensity of early breast cancer screening. In the United States, the screening rate for women over 40 years old has reached 75%, while in my country, the screening rate for women is only 21%, and 84% of patients are already in the middle and late stages of diagnosis and have missed the best time for treatment.
  • breast ultrasound, mammography (mammography) and magnetic resonance imaging are commonly used methods for breast cancer screening.
  • these traditional methods have certain technical limitations and are more dependent on the doctor's operation level, and have a high probability of missed diagnosis and misdiagnosis. .
  • ctDNA cell-free DNA
  • ctDNA can reflect cancer information from many aspects, such as mutations, fragment length distribution, methylation, etc. Among them, ctDNA methylation has become a hot spot in the research and development of cancer early screening products due to its outstanding performance.
  • methylation early screening such as the pan-cancer methylation early screening application PanSeer
  • PanSeer can reach 88% in 5 cancer types (gastric cancer, esophageal cancer, liver cancer, colorectal cancer, lung cancer) with a specificity of 96%.
  • % sensitivity which can detect cancer 4 years earlier than traditional methods (Xingdong Chen et al., 2020); a machine learning model built using only 6 qPCR markers in colorectal cancer can achieve a specificity of 92% With a sensitivity of 86%, the effect is far better than traditional cancer screening methods (Guo-Xiang Cai et al., 2021).
  • Cancer screening especially pan-cancer early screening, not only needs to predict the presence or absence of cancer signals, but also requires tissue tracing of positive samples, and cancer types in different locations in the human body have different methylation characteristics (Kundaje A et al., 2015), tissue origin tracing can be achieved by utilizing these tissue-specific methylation signatures.
  • tissue-specific methylation markers requires a large amount of methylation sequencing data for multiple cancer types and a strict screening and verification process, which is a relatively challenging task.
  • tissue-specific methylation markers for breast cancer There is a need in the art for tissue-specific methylation markers for breast cancer.
  • Gastric cancer and esophageal cancer are common digestive tract tumors.
  • my country is a country with a high incidence of gastric cancer and esophageal cancer.
  • the incidence and mortality rate of gastric cancer ranked second among malignant tumors in my country, and the incidence and mortality rate of esophageal cancer ranked fourth and fifth respectively among malignant tumors.
  • Most early esophageal cancer and precancerous lesions can be cured through minimally invasive endoscopic treatment, with a 5-year survival rate of 95%.
  • the 5-year survival rate of early gastric cancer also exceeds 90% (Sumyama K. et al.
  • the quality of life and prognosis of patients with intermediate and advanced esophageal cancer are poor, the overall 5-year survival rate is less than 20%, and the 5-year survival rate of advanced gastric cancer is less than 30%.
  • the early diagnosis rate of esophageal cancer and gastric cancer in my country is relatively low.
  • Patients with early-stage esophageal cancer and gastric cancer lack typical clinical characteristics, and most patients are already in the middle and late stages when they seek treatment. Therefore, the most effective way to improve the survival rate of patients with esophageal cancer and gastric cancer is to conduct early screening of high-risk groups.
  • the screening methods for gastric cancer mainly include serological screening and endoscopic screening.
  • Serological screening includes serum tumor marker detection (carcinoembryonic antigen CEA, carbohydrate antigen CA19-9, etc.), serum pepsinogen (pepsinogen, etc.) PG) detection, Helicobacter pylori infection detection, etc.
  • the main screening method for esophageal cancer is endoscopy. Endoscopy and its biopsy are the gold standard for diagnosing gastric cancer and esophageal cancer. However, endoscopy relies on equipment and endoscopist resources. The inspection cost is relatively high, and it is an invasive test, and patient compliance is low. Poor and difficult to use for large-scale population screening.
  • ctDNA cell-free DNA
  • ctDNA can reflect cancer information from many aspects, such as mutations, fragment length distribution, methylation, etc. Among them, ctDNA methylation has become a hot spot in the research and development of cancer early screening products due to its outstanding performance.
  • methylation early screening such as the pan-cancer methylation early screening application PanSeer
  • PanSeer can reach 88% in 5 cancer types (gastric cancer, esophageal cancer, liver cancer, colorectal cancer, lung cancer) with a specificity of 96%.
  • % sensitivity which can detect cancer 4 years earlier than traditional methods (Xingdong Chen et al., 2020); a machine learning model built using only 6 qPCR markers in colorectal cancer can achieve a specificity of 92%
  • the sensitivity is 86%, which is far better than traditional cancer screening methods (Guo-Xiang Cai et al., 2021).
  • Cancer screening especially pan-cancer early screening, not only needs to predict the presence or absence of cancer signals, but also requires tissue tracing of positive samples, and cancer types in different locations in the human body have different methylation characteristics (Kundaje A et al., 2015), tissue origin tracing can be achieved by utilizing these tissue-specific methylation signatures.
  • tissue-specific methylation markers requires a large amount of methylation sequencing data for multiple cancer types and a strict screening and verification process, which is a relatively challenging task.
  • stomach and esophagus are two adjacent organs in the human body.
  • gastroscopy can be used to confirm the lesions of the esophagus and stomach at the same time. Therefore, in the tissue traceability stage of the pan-cancer screening process, esophageal cancer and gastric cancer can be classified Divide them into one category, search for methylation markers specific to two cancer types, and build a model to distinguish esophageal cancer and gastric cancer from other cancer types.
  • tissue-specific methylation markers for gastric and/or esophageal cancer.
  • pancreatic cancer screening methods mainly include imaging screening (color ultrasound, CT, MRI, etc.) and Blood tumor markers (mainly carbohydrate antigen CA199 test). If pancreatic masses are detected on color ultrasound and CT, or if the tumor indicator CA199 is significantly elevated, the possibility of pancreatic cancer is considered. However, CA199 expression is only elevated in 65% of patients with resectable pancreatic cancer and is not suitable for early screening in large-scale populations. Color ultrasound can detect tumors with a diameter of more than 2cm, and CT/MRI can detect pancreatic tumors with a diameter of more than 1cm. Early pancreatic cancer tumors less than 1cm will be missed, and it is also difficult to apply to large-scale population screening.
  • ctDNA cell-free DNA
  • ctDNA can reflect cancer information from many aspects, such as mutations, fragment length distribution, methylation, etc. Among them, ctDNA methylation has become a hot spot in the research and development of cancer early screening products due to its outstanding performance.
  • methylation early screening such as the pan-cancer methylation early screening application PanSeer
  • PanSeer can reach 88% in 5 cancer types (gastric cancer, esophageal cancer, liver cancer, colorectal cancer, lung cancer) with a specificity of 96%.
  • % sensitivity cancer can be detected 4 years earlier than traditional methods (Xingdong Chen et al., 2020).
  • a machine learning model constructed using only 6 qPCR markers in colorectal cancer can achieve a sensitivity of 86% at a specificity of 92%, achieving results that are far superior to traditional cancer screening methods (Guo-Xiang Cai et al. , 2021).
  • Cancer screening especially pan-cancer early screening, not only needs to predict the presence or absence of cancer signals, but also requires tissue tracing of positive samples, and cancer types in different locations in the human body have different methylation characteristics (Kundaje A et al., 2015), tissue origin tracing can be achieved by utilizing these tissue-specific methylation signatures.
  • tissue-specific methylation markers requires a large amount of methylation sequencing data for multiple cancer types and a strict screening and verification process, which is a very challenging task.
  • Colorectal cancer diagnosis in the prior art has many of the above-mentioned shortcomings.
  • the inventors studied a large number of methylation markers from 7 cancer types (lung cancer, liver cancer, colorectal cancer, gastric cancer, esophageal cancer, pancreatic cancer, and breast cancer).
  • Colorectal cancer tissue-specific methylation markers were screened from next generation sequencing (NGS) cfDNA methylation targeted sequencing data.
  • NGS next generation sequencing
  • the inventor used the screened methylation markers to construct and verify the machine learning model, which was used to trace the tissue origin of colorectal cancer in the early screening process of pan-cancer species, so as to achieve the purpose of better distinguishing colorectal cancer.
  • the invention provides isolated nucleic acids that are one or more specific methylation markers.
  • the isolated nucleic acid is a colorectal cancer tissue-specific methylation marker.
  • the isolated nucleic acid is the region, or the site of the region, of the gene and the 2.3 kb upstream and 2.3 kb downstream regions of the gene in the chromosome in which it is located: gene SFN; gene GPR3; gene FCGR1B; gene FAM150B; gene RGPD3; gene NUP210; gene LMOD3; gene FOXF2; gene TBXT; gene PRR15; gene ELN; gene TFPI2; gene REPIN1; gene PDLIM2; gene SDC2; gene TRAPPC9; gene TJP2; gene DIP2C; Gene DDIT4; Gene MRPL23; Gene PAX6; Gene PLXNC1; Gene MLNR; Gene MYO16; Gene TMEM179; Gene GATM; Gene CACNA
  • the isolated nucleic acid is isolated from the sample.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the isolated nucleic acid is obtained from a colorectal cancer patient.
  • isolated nucleic acids are obtained from cell-free DNA in plasma.
  • a variant comprises a sequence that is at least 50% identical to the sequence of either gene.
  • a variant includes a sequence that is at least 60%, 65%, 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical sequence.
  • the region is the gene and the 2.3 kb upstream and 2.3 kb downstream regions of the gene in the chromosome in which it is located.
  • the upstream region is 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream region.
  • the length of the site can be 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460 bp, 470bp, 480bp, 490bp or 500bp.
  • the isolated nucleic acid comprises the nucleotide sequence set forth in any one or more of the following, or the complement or variant thereof: SEQ ID Nos. 52-90.
  • the variant is at least 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or a variant sequence with 99% identity.
  • the present invention provides the use of reagents or components in the preparation of a kit or device for (1) distinguishing colorectal cancer patients from non-colorectal cancer cancer patients, (2) using For diagnosing or assisting in the diagnosis of colorectal cancer; or (3) for tissue traceability of colorectal cancer during pan-cancer screening, where the reagent or component includes the detection of colorectal cancer tissue-specific methylation markers in the genomic DNA of the sample
  • the methylation level reagent or component, the methylation marker is the following region or its site, the region is the following gene and the 2.3kb upstream region and 2.3kb downstream region of the gene in the chromosome where it is located Region: gene SFN; gene GPR3; gene FCGR1B; gene FAM150B; gene RGPD3; gene NUP210; gene LMOD3; gene FOXF2; gene TBXT; gene PRR15; gene ELN; gene TFPI2; gene REPIN1; gene PDLIM2; gene SDC
  • the length of the sites can vary.
  • the site may be 140 bp-510 bp in length.
  • the site may be 200bp-470bp in length.
  • the length of the site can be 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • the non-colorectal cancer is lung, liver, gastric, esophageal, pancreatic, and/or breast cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or Its complementary sequence or variant sequence: SEQ ID No. 52-90.
  • the reagents or components comprise reagents or components for use in one or more of the following methods of detecting methylation: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction endonuclease Enzymatic assays, fluorescence quantitation, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents or components comprise primers and/or probes for detecting methylation markers.
  • the sample is cells, tissue, fine needle aspiration biopsy, and/or plasma.
  • the sample genomic DNA is cell-free DNA in plasma.
  • the present invention provides a method for constructing a prediction model for distinguishing colorectal cancer from other non-colorectal cancers, which includes: (1) obtaining genomic DNA of colorectal cancer samples and non-colorectal cancer cancer samples The methylation level of the methylation marker; the methylation marker is selected from the following region or the site of the region, the region is the following gene and the 2.3kb upstream region of the gene in the chromosome where it is located And 2.3kb downstream region: gene SFN; gene GPR3; gene FCGR1B; gene FAM150B; gene RGPD3; gene NUP210; gene LMOD3; gene FOXF2; gene TBXT; gene PRR15; gene ELN; gene TFPI2; gene REPIN1; gene PDLIM2; gene SDC2 ; Gene TRAPPC9; Gene TJP2; Gene DIP2C; Gene DDIT4; Gene MRPL23; Gene PAX6; Gene PLXNC1; Gene ML
  • the length of the sites can vary.
  • the site may be 140 bp-510 bp in length.
  • the site may be 200bp-470bp in length.
  • the length of the site can be 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • the non-colorectal cancer is lung, liver, gastric, esophageal, pancreatic, and/or breast cancer.
  • the method includes (2) constructing a machine learning model of logistic regression using data on methylation levels of methylation markers.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the genomic DNA is cell-free DNA in plasma.
  • step (1) includes obtaining a sample DNA Base sequencing data.
  • methylation sequencing data of sample DNA is obtained by the method of MethylTitan.
  • step (2) includes using a logistic regression model to obtain a model prediction score; and using the obtained methylation levels of the methylation markers as a training set for training, and determining the model based on samples of the training set related threshold.
  • the formula of the model is as follows, where x is the methylation level value of the sample target marker , w is the coefficient of methylation marker, b is the intercept value, y is the model prediction score
  • the present invention provides a colorectal cancer prediction model constructed by the method herein.
  • the invention provides an apparatus for diagnosing colorectal cancer, comprising a memory and a processor for processing instructions stored in the memory, the instructions executing the methods described herein to construct a colorectal cancer prediction model; and using a sample to be tested
  • the methylation levels of the methylation markers in the genomic DNA are used as the test set to obtain the model prediction score.
  • the prediction score is used to determine whether the sample is colorectal cancer based on the threshold.
  • the invention provides methods that (1) differentiate patients with colorectal cancer from patients with cancers other than colorectal cancer, (2) are used to diagnose or assist in the diagnosis of colorectal cancer; or (3) are used in pan-cancer screening Tracing the tissue origin of colorectal cancer during the examination process includes measuring the methylation level of one or more colorectal cancer-specific methylation markers described herein in the genomic DNA of the sample.
  • the invention provides a kit or device for (1) distinguishing colorectal cancer patients from non-colorectal cancer cancer patients, (2) for diagnosing or assisting in the diagnosis of colorectal cancer; or (3) ) is used in tissue tracing of colorectal cancer during pan-cancer screening.
  • the application includes determining the methylation level of one or more colorectal cancer-specific methylation markers described herein in the genomic DNA of a sample.
  • kits or devices for detecting colorectal cancer tissue-specific methylation markers.
  • a kit or device includes reagents or components that detect the status and/or level of one or more colorectal cancer tissue-specific methylation markers described herein in genomic DNA from a sample.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the nucleic acid is cell-free DNA in plasma.
  • the reagents or components comprise reagents or components for use in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence Quantitative methods, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents comprise oligonucleotides for detecting colorectal cancer-specific methylation markers.
  • the oligonucleotides are primers and/or probes.
  • the primers are primers that detect the methylation level/status of a site using methylation sequencing or PCR primers that amplify one or more methylation sites.
  • the reagents include bisulfite and its derivatives, PCR buffer, polymerase, dNTPs, primers, probes, methylation-sensitive or insensitive restriction endonucleases, enzyme digestion buffer, Fluorescent dyes, fluorescent quenchers, fluorescent reporters, exonucleases, alkaline phosphatase, internal standards and/or controls, the controls being the aforementioned from normal subjects or cancer patients without colorectal cancer Specific methylation markers.
  • the non-colorectal cancer is lung, liver, gastric, esophageal, pancreatic, and/or breast cancer.
  • the present invention provides new colorectal cancer-specific methylation markers, which can be used to trace the tissue origin of colorectal cancer during the early screening of pan-cancer species to achieve the purpose of better distinguishing colorectal cancer;
  • the colorectal cancer-specific methylation marker of the present invention can detect colorectal cancer with high sensitivity and specificity.
  • the inventors analyzed a large number of next-generation sequencing data from seven cancer types (lung cancer, liver cancer, lung cancer, gastric cancer, esophageal cancer, pancreatic cancer, breast cancer) ( Lung cancer tissue-specific methylation markers were screened from NGS) cfDNA methylation targeted sequencing data.
  • the inventor used the screened methylation markers to construct and verify the machine learning model, which was used to trace the tissue origin of lung cancer during the early screening of pan-cancer species, so as to better distinguish lung cancer.
  • the invention provides the use of reagents or components in the preparation of kits or devices for (1) distinguishing lung cancer patients from non-lung cancer patients, (2) for diagnosis or auxiliary diagnosis Lung cancer; or (3) used to trace the tissue origin of lung cancer during pan-cancer screening, where the reagents or components include reagents or components that detect the methylation level of lung cancer tissue-specific methylation markers in the genomic DNA of the sample, so
  • the methylation marker is the following region or its site, the region is the following gene and the 2.2kb upstream region and 2.2kb downstream region of the gene in the chromosome where it is located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6 ; Gene TRIM58; Gene ARHGEF33; Gene PSD4; Gene HOXD4; Gene SLC12A8; Gene DGKG; Gene TERT; Gene NR2F1; Gene PCDHGC5; Gene KCNMB1; Gene FOXC1; Gene HIST1H4F; Gene TYW1
  • the length of the site is 120bp-500bp, preferably 200bp-480bp.
  • non-lung cancers or pan-cancers include colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following, or its complement or variant sequence: SEQ ID NOs: 24, 65, 76 and 91-135.
  • the reagents or components comprise reagents or components for use in one or more of the following methods of detecting methylation: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction endonuclease Enzymatic assays, fluorescence quantitation, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents or components comprise primers and/or probes for detection of methylation markers and/or the sample is cells, tissue, fine needle aspiration biopsy and/or plasma, preferably the sample genome DNA is cell-free DNA in plasma.
  • the present invention provides a method of constructing a prediction model for distinguishing lung cancer from other non-lung cancers, which includes:
  • methylation levels of methylation markers in the genomic DNA of lung cancer samples and non-lung cancer samples are selected from the following regions or sites in this region,
  • the region is the following gene and the 2.2kb upstream region and 2.2kb downstream region of the gene in the chromosome where it is located: gene ARHGEF16; gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; gene TERT; gene NR2F1; gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW1; gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; gene OPLAH; gene NR5A1; gene SPAG6; Gene WAPAL; Gene BTBD16; Gene DPYSL4; Gene TTC40; Gene ADAM
  • the length of the site is 120bp-500bp, preferably 200bp-480bp.
  • the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following, or its complement or variant sequence: SEQ ID NOs: 24, 65, 76 and 91-135.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the genomic DNA is cell-free DNA in plasma.
  • step (1) includes obtaining methylation sequencing data of the sample DNA.
  • step (2) includes establishing a logistic regression model to obtain a model prediction score; and using the obtained methylation levels of the methylation markers as a training set for training, and determining the model based on the samples of the training set related threshold.
  • the formula of the model is as follows, where x is the methyl group of the methylation marker in the sample lation level value, w is the coefficient of methylation marker, b is the intercept value, y is the model prediction score
  • a lung cancer prediction model constructed according to the method of the present invention is provided.
  • a device for diagnosing lung cancer which includes a memory and a processor that processes instructions stored in the memory, the instructions execute the method according to the present invention to construct a lung cancer prediction model; and use the genomic DNA of the sample to be tested.
  • the methylation level of the methylation marker is used as a test set to obtain the model prediction score.
  • the prediction score is used to judge whether the sample is lung cancer according to the threshold. If the sample is greater than the threshold, it is predicted to be lung cancer, and otherwise it is predicted to be other cancer types.
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data and TestPred is the model prediction score. value.
  • a kit or device for detecting lung cancer tissue-specific methylation markers comprising detecting one or more lung cancer tissue-specific methylation marker states in genomic DNA from a sample and/or levels of reagents or components, the lung cancer tissue-specific methylation marker is the following region or its site, the region is the following gene and the 2.2kb upstream region of the gene in the chromosome where it is located and 2.2kb downstream region: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; gene TERT; gene NR2F1; gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F ; Gene TYW1; Gene LRRC4; Gene DGKI; Gene PDLIM2; Gene RHOBTB2; Gene TMEM75; Gene OPLAH; Gene NR5A1
  • the length of the site is 120bp-500bp, preferably 200bp-480bp.
  • the methylation marker comprises a nucleotide sequence set forth in any one or more of the following, or a complement or variant sequence thereof: SEQ ID NOs: 24, 65, 76, and 91-135.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the nucleic acid is cell-free DNA in plasma.
  • the reagents or components comprise reagents or components for use in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence Quantitative methods, methylation-sensitive high-resolution melting curve methods, and chip-based Methylation profiling and mass spectrometry.
  • the reagents comprise oligonucleotides for detecting methylation markers.
  • the oligonucleotides are primers and/or probes.
  • the primers are primers that detect the methylation level/status of a site using methylation sequencing or PCR primers that amplify one or more methylation sites.
  • the reagents include bisulfite and its derivatives, PCR buffer, polymerase, dNTPs, primers, probes, methylation-sensitive or insensitive restriction endonucleases, enzyme digestion buffer, Fluorescent dyes, fluorescent quenchers, fluorescent reporters, exonucleases, alkaline phosphatase, internal standards and/or controls, the controls are the aforementioned specificities from normal subjects or non-lung cancer patients. Methylation markers.
  • the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • the invention provides isolated nucleic acids that are one or more specific methylation markers.
  • the isolated nucleic acid is a lung cancer tissue-specific methylation marker.
  • the lung cancer tissue-specific methylation marker is the following region or a site thereof, which region is the following gene and the 2.2kb upstream region and 2.2kb downstream region of the gene in the chromosome where it is located Region: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; gene TERT; gene NR2F1; gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW1 ; Gene LRRC4; Gene DGKI; Gene PDLIM2; Gene RHOBTB2; Gene TMEM75; Gene OPLAH; Gene NR5A1; Gene SPAG6; Gene WAPAL; Gene
  • the variant comprises a sequence that is at least 70% identical to the sequence of either gene.
  • a variant includes a sequence that is at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, Sequences that are 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical.
  • the region is the gene and the 2.2 kb upstream and 2.2 kb downstream regions of the gene in the chromosome in which it is located.
  • the upstream region is 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream region.
  • the downstream region is 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp downstream region.
  • the length of the sites may vary.
  • the length of the site may be 120bp-500bp, preferably 200bp-480bp.
  • the length of the site can be 130bp, 140bp, 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • the variant is at least 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96% , variant sequences with 97%, 98% or 99% identity.
  • the present invention provides a method that (1) differentiates lung cancer patients from non-lung cancer patients, (2) is used to diagnose or assist in the diagnosis of lung cancer; or (3) is used for the detection of lung cancer during pan-cancer screening.
  • Tissue traceability includes measuring the methylation level of one or more methylation markers described herein in the genomic DNA of the sample.
  • the method is performed using the lung cancer prediction model of the invention.
  • the present invention provides new lung cancer tissue-specific methylation markers, which can be used to trace the tissue origin of lung cancer during the early screening of pan-cancer species to achieve the purpose of better distinguishing lung cancer;
  • the lung cancer tissue-specific methylation marker of the present invention can detect lung cancer with high sensitivity and specificity.
  • Tissue-specific methylation markers for liver cancer are urgently needed.
  • the inventors screened out liver cancer tissue-specific specificity from a large amount of next-generation sequencing (NGS) cfDNA methylation targeted sequencing data of 7 cancer types (lung cancer, colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and breast cancer).
  • NGS next-generation sequencing
  • cfDNA methylation targeted sequencing data 7 cancer types (lung cancer, colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and breast cancer).
  • Sexual methylation markers The inventor used the screened methylation markers to construct and verify the machine learning model, which was used to trace the tissue origin of liver cancer in the early screening process of pan-cancer species, so as to achieve the purpose of better distinguishing liver cancer.
  • the present invention provides the use of reagents or components in the preparation of kits or devices for (1) distinguishing liver cancer patients from non-liver cancer cancer patients, (2) for diagnosis or auxiliary diagnosis Liver cancer; or (3) used for tissue traceability of liver cancer during pan-cancer screening, where the reagents or components include reagents or components that detect the methylation level of liver cancer tissue-specific methylation markers in the genomic DNA of the sample, so
  • the methylation marker is the following region or its site, which is the following gene and the 3kb upstream region and 3kb downstream region of the gene in the chromosome where it is located: TAL1 (T-cell acute lymphocytic leukemia protein 1) Gene; TRIM58 gene; LBH gene; ABCG5 (ATP Binding Cassette Subfamily G Member 5) gene; PAX8 (Paired Box 8) gene; DLEC1 gene; AMIGO3 gene; RASSF1 gene; CLDN11 gene; SLC2A9 gene; SLC9A3 gene; C
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 7, 18, 23, 29, 41, 90 , 94, 104, 117, 120, 125, 128, 132 and 136-159.
  • the reagents or components comprise reagents or components for use in one or more of the following methods of detecting methylation: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction endonuclease Enzymatic assays, fluorescence quantitation, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents or components comprise primers and/or probes for detection of methylation markers and/or the sample is cells, tissue, fine needle aspiration biopsy and/or plasma, preferably the sample genome DNA is cell-free DNA in plasma.
  • the present invention provides a method for constructing a prediction model for distinguishing liver cancer from other non-liver cancer, which includes:
  • methylation levels of methylation markers in the genomic DNA of liver cancer samples and non-liver cancer cancer samples are selected from the following regions or sites in this region, the The region is the following genes and the 3kb upstream region and 3kb downstream region of the gene in the chromosome where it is located: TAL1 gene; TRIM58 gene; LBH gene; ABCG5 gene; PAX8 gene; DLEC1 gene; AMIGO3 gene; RASSF1 gene; CLDN11 gene; SLC2A9 gene; SLC9A3 gene; CXXC5 gene; FOXC1 gene; HIST1H4F gene; TRIM40 gene; HOXA13 gene; CRHR2 gene; AGPAT6 gene; TCF24 gene; OPLAH gene; GPAM gene; ADAM8 gene; GRASP gene; B4GALNT1 gene; STX2 gene; ATL1 gene ; ITPKA gene; PIF1 gene; ZFHX3 gene;
  • the site is 100 bp to 550 bp in length. In one embodiment, the loci are 150bp-480bp in length. In one embodiment, the non-liver cancer is colorectal cancer, lung cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 7, 18, 23, 29, 41, 90 , 94, 104, 117, 120, 125, 128, 132 and 136-159.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the genomic DNA is cell-free DNA in plasma.
  • step (1) includes obtaining methylation sequencing data of the sample DNA.
  • a logistic regression model such as the logistic regression model in the sklearn (V1.0.1) package in python (V3.9.7)
  • methylation levels of methylation markers as a training set for training, and determine the relevant threshold of the model based on the samples of the training set. For example, use AllModel.fit(Traindata,TrainPheno), where TrainData is the data of the training set, and TrainPheno is the trait of the training set sample, where liver cancer is 1 and other cancer types are 0.
  • liver cancer prediction model constructed according to the method of the present invention.
  • a device for diagnosing liver cancer which includes a memory and a processor that processes instructions stored in the memory, the instructions execute the method according to the present invention to build a liver cancer prediction model; and use the genomic DNA of the sample to be tested.
  • the methylation level of the methylation marker is used as a test set to obtain the model prediction score.
  • the prediction score is used to judge whether the sample is liver cancer according to the threshold. If the sample is greater than the threshold, it is predicted to be liver cancer, and otherwise it is predicted to be other cancer types.
  • a kit or device for detecting liver cancer tissue-specific methylation markers comprising detecting one or more liver cancer tissue-specific methylation marker states in genomic DNA from a sample and/or levels of reagents or components, the liver cancer tissue-specific methylation marker is the following region or its site, the region is the following gene and the 3kb upstream region and 3kb of the gene in the chromosome where it is located Downstream region: TAL1 gene; TRIM58 gene; LBH gene; ABCG5 gene; PAX8 gene; DLEC1 gene; AMIGO3 gene; RASSF1 gene; CLDN11 gene; SLC2A9 gene; SLC9A3 gene; CXXC5 gene; FOXC1 gene; HIST1H4F gene; TRIM40 gene; HOXA13 gene ; CRHR2 gene; AGPAT6 gene; TCF24 gene; OPLAH gene; GPAM gene; ADAM8 gene; GRASP gene; B4GALNT1 gene; ST
  • the methylation marker includes the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 7, 18, 23, 29, 41, 90, 94, 104, 117, 120, 125, 128, 132 and 136-159.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the nucleic acid is cell-free DNA in plasma.
  • the reagents or components comprise reagents or components for use in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence Quantitative methods, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents comprise oligonucleotides for detecting methylation markers.
  • the oligonucleotides are primers and/or probes;
  • the primers are primers that detect the methylation level/status of a site using methylation sequencing or PCR primers that amplify one or more methylation sites.
  • the reagents include bisulfite and its derivatives, PCR buffer, polymerase, dNTPs, primers, probes, methylation-sensitive or insensitive restriction endonucleases, enzyme digestion buffer, Fluorescent dyes, fluorescent quenchers, fluorescent reporters, exonucleases, alkaline phosphatase, internal standards and/or controls, the controls are the aforementioned specificities from normal subjects or cancer patients without liver cancer Methylation markers.
  • the non-liver cancer is colorectal cancer, lung cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • the invention provides isolated nucleic acids that are one or more specific methylation markers.
  • the isolated nucleic acid is a liver cancer tissue-specific methylation marker.
  • the liver cancer tissue-specific methylation marker is the following region or its site, which region is the following gene and the 3kb upstream region and 3kb downstream region of the gene in the chromosome where it is located: TAL1 gene; TRIM58 gene; LBH gene; ABCG5 gene; PAX8 gene; DLEC1 gene; AMIGO3 gene; RASSF1 gene; CLDN11 gene; SLC2A9 gene; SLC9A3 gene; CXXC5 gene; FOXC1 gene; HIST1H4F gene; TRIM40 gene; HOXA13 gene; CRHR2 gene ; AGPAT6 gene; TCF24 gene; OPLAH gene; GPAM gene; ADAM8 gene; GRASP gene; B4GALNT1 gene; STX2 gene; ATL1 gene; ITPKA gene; P
  • the loci are 100 bp to 550 bp in length. In one embodiment, the loci are 150bp-480bp in length. In one embodiment, the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 7, 18, 23, 29, 41, 90, 94, 104, 117, 120, 125, 128, 132 and 136-159.
  • the isolated nucleic acid is isolated from the sample. In one embodiment, the sample is cells, tissue, fine needle aspiration biopsy, or plasma. In one embodiment, the isolated nucleic acid is obtained from a liver cancer patient. For example, isolated nucleic acids are obtained from cell-free DNA in plasma.
  • the variant comprises a sequence that is at least 60% identical to the sequence of either gene.
  • a variant includes a sequence that is at least 65%, 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, Sequences that are 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical.
  • the region is the gene and the 3 kb upstream and 3 kb downstream regions of the gene in the chromosome in which it is located.
  • the upstream region is 2.9kb, 2.8kb, 2.7kb, 2.6kb, 2.5kb, 2.4kb, 2.3kb, 2.2kb, 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream region.
  • the downstream region is 2.9kb, 2.8kb, 2.7kb, 2.6kb, 2.5kb, 2.4kb, 2.3kb, 2.2kb, 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, or 5bp downstream region.
  • the length of the sites may vary.
  • the loci are 100 bp-550 bp in length.
  • the loci are 150bp-480bp in length.
  • the length of the site can be 110bp, 120bp, 130bp, 140bp, 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp, 500bp, 510b
  • the variant is at least 60%, 65%, 70%, 75%, 76%, 77%, identical to the nucleotide sequence shown in any one or more of the above. 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94% , variant sequences that are 95%, 96%, 97%, 98% or 99% identical.
  • the present invention provides a method that (1) distinguishes liver cancer patients from non-liver cancer cancer patients, (2) is used to diagnose or assist in the diagnosis of liver cancer; or (3) is used for the detection of liver cancer during pan-cancer screening.
  • Tissue traceability includes measuring the methylation level of one or more methylation markers described herein in the genomic DNA of the sample.
  • the method is performed using the liver cancer prediction model of the invention.
  • the present invention provides new methylation markers, which can be used to trace the tissue origin of liver cancer in the early screening process of pan-cancer species, so as to achieve the purpose of better distinguishing liver cancer;
  • the methylation marker of the present invention can detect liver cancer with high sensitivity and specificity.
  • breast ultrasound, mammography (mammography) and magnetic resonance imaging are commonly used methods for breast cancer screening, but these traditional methods have certain technical limitations and are more dependent on the doctor's operational level.
  • tissue-specific methylation markers for breast cancer There is a lack of tissue-specific methylation markers for breast cancer in this field.
  • the inventors screened breast cancer tissues from a large amount of next-generation sequencing (NGS) cfDNA methylation targeted sequencing data of 7 cancer types (lung cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and breast cancer). Specific methylation markers.
  • NGS next-generation sequencing
  • the inventor used the screened methylation markers to construct and verify the machine learning model, which was used to trace the tissue origin of breast cancer during the early screening of pan-cancer species and achieve the purpose of better distinguishing breast cancer.
  • the breast cancer tissue-specific methylation markers of the invention have not been previously described.
  • the present invention provides the use of reagents or components in the preparation of kits or devices for (1) distinguishing breast cancer patients from non-breast cancer patients, (2) for diagnosis or To assist in the diagnosis of breast cancer; or (3) for tissue traceability of breast cancer during pan-cancer screening, where the reagent or component contains a method for detecting the methylation level of breast cancer tissue-specific methylation markers in the genomic DNA of the sample Reagents or components, the methylation marker is the following region or its site, the region is the following gene and the 2kb upstream region and 2kb downstream region of the gene in the chromosome where it is located: Gene BARHL2; Gene ALX3; Gene TBX15; Gene C2CD4D; Gene RYR2; Gene LBH; SIX3; Gene SIX2; Gene OTX1; Gene EMX1; Gene LBX2; Gene BCL2L11; Gene PAX8; Gene HOXD1; Gene SATB2; Gene VILL; Gene CLDN11; Gene
  • non-breast cancer or pan-cancer includes colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or lung cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 1-51.
  • the reagents or components comprise reagents or components for use in one or more of the following methods of detecting methylation: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction endonuclease Enzymatic assays, fluorescence quantitation, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents or components comprise primers and/or probes for detection of methylation markers and/or the sample is cells, tissue, fine needle aspiration biopsy and/or plasma, preferably the sample genome DNA is cell-free DNA in plasma.
  • the present invention provides a method of constructing a prediction model for distinguishing breast cancer from other non-breast cancer, which includes:
  • the length of the site is 150bp-500bp, preferably 200bp-470bp.
  • the non-breast cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or lung cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 1-51.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the genomic DNA is cell-free DNA in plasma.
  • step (1) includes obtaining methylation sequencing data of the sample DNA.
  • step (2) includes establishing a logistic regression model and using the obtained methylation levels of the methylation markers as a training set for training and determining the relevant threshold of the model based on the samples of the training set.
  • AllModel LogisticRegression().
  • x is the methylation of the methylation marker in the sample Level value
  • w is the coefficient of methylation marker
  • b is the intercept value
  • y is the model prediction score
  • a breast cancer prediction model constructed according to the method of the present invention is provided.
  • kits or device for detecting breast cancer tissue-specific methylation markers comprising detecting one or more breast cancer tissue-specific methylation markers in genomic DNA from a sample
  • the breast cancer tissue-specific methylation marker is the following region or its site, which is the following gene and the 2kb upstream of the gene in the chromosome where it is located.
  • gene BARHL2 gene ALX3; gene TBX15; gene C2CD4D; gene RYR2; gene LBH; SIX3; gene SIX2; gene OTX1; gene EMX1; gene LBX2; gene BCL2L11; gene PAX8; gene HOXD1; gene SATB2; Gene VILL; gene CLDN11; gene EPHB3; gene NKX3-2; gene KCTD8; gene PITX1; gene CXXC5; gene FOXC1; gene NRN1; gene HOXA9; gene DLX6; gene MOS; gene TCF24; gene CA3; gene GDF6; gene FOXD4; Gene PTF1A; Gene TLX1; Gene INA; Gene NKX6-2; Gene PAX6; Gene BCAT1; Gene FAIM2; Gene GRASP; Gene CCNA1; Gene SIX1; Gene PRKCB; Gene SOX9; Gene ST8SIA5; Gene NFIX; Gene NFIX; Gene
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 1-51.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the nucleic acid is cell-free DNA in plasma.
  • the reagents or components comprise reagents or components for use in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence Quantitative methods, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents comprise oligonucleotides for detecting methylation markers.
  • the oligonucleotides are primers and/or probes.
  • the primers utilize methylation sequencing to detect the methylation level of the site/ status primers or PCR primers used to amplify one or more methylation sites.
  • the invention provides isolated nucleic acids that are one or more specific methylation markers.
  • the isolated nucleic acid is a breast cancer tissue-specific methylation marker.
  • the breast cancer tissue-specific methylation marker is the following region or a site thereof, which region is the following gene and the 2kb upstream region and 2kb downstream region of the gene in the chromosome where it is located : Gene BARHL2; Gene ALX3; Gene TBX15; Gene C2CD4D; Gene RYR2; Gene LBH; SIX3; Gene SIX2; Gene OTX1; Gene EMX1; Gene LBX2; Gene BCL2L11; Gene PAX8; Gene HOXD1; Gene SATB2; Gene VILL; Gene CLDN11 ; Gene EPHB3; Gene NKX3-2; Gene KCTD8; Gene PITX1; Gene CXXC5; Gene FOXC1; Gene NRN1; Gene HOXA9; Gene DLX6; Gene MOS; Gene
  • the loci are 150bp-500bp in length. In one embodiment, the loci are 200bp-470bp in length. In one embodiment, the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 1-51.
  • the isolated nucleic acid is isolated from the sample. In one embodiment, the sample is cells, tissue, fine needle aspiration biopsy, or plasma. In one embodiment, the isolated nucleic acid is obtained from a breast cancer patient. For example, isolated nucleic acids are obtained from cell-free DNA in plasma.
  • the variant comprises a sequence that is at least 70% identical to the sequence of either gene.
  • a variant includes a sequence that is at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% Identity sequence.
  • the region is the gene and the 2 kb upstream and 2 kb downstream regions of the gene in the chromosome in which it is located.
  • the upstream region is 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp upstream of the gene , 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream region.
  • the downstream region is 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp downstream of the gene , 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp downstream region.
  • the length of the sites may vary.
  • the site may be 150 bp-500 bp in length.
  • the site may be 200bp-470bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • the variant is at least 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96% , variant sequences with 97%, 98% or 99% identity.
  • the present invention provides new methylation markers, which can be used to trace the tissue origin of breast cancer in the early screening process of pan-cancer species, so as to achieve the purpose of better distinguishing breast cancer;
  • the methylation marker of the present invention can detect breast cancer with high sensitivity and specificity.
  • tissue-specific methylation markers for gastric cancer and/or esophageal cancer in this field, the inventors selected methylation markers from 7 cancer types (lung cancer, liver cancer, colorectal cancer, gastric cancer, esophageal cancer, pancreatic cancer, and breast cancer). ), gastric cancer and/or esophageal cancer tissue-specific methylation markers were screened from a large amount of next-generation sequencing (NGS) cfDNA methylation targeted sequencing data.
  • NGS next-generation sequencing
  • the inventor used the screened methylation markers to construct and verify the machine learning model, which was used to trace the tissue origin of gastric cancer and/or esophageal cancer in the early screening process of pan-cancer species, so as to better distinguish gastric cancer and/or esophageal cancer.
  • the purpose of esophageal cancer was used to trace the tissue origin of gastric cancer and/or esophageal cancer in the early screening process of pan-cancer species, so as to better distinguish gastric cancer and/or esophageal cancer.
  • the purpose of esophageal cancer was used to trace the tissue origin of gastric cancer and/or esophageal cancer in the early screening process of pan-cancer species, so as to better distinguish gastric cancer and/or esophageal cancer.
  • the invention provides isolated nucleic acids that are one or more specific methylation markers.
  • the isolated nucleic acid is a gastric cancer and/or esophageal cancer tissue-specific methylation marker.
  • the isolated nucleic acid is the region, or the site of the region, of the following gene and the 2 kb upstream and 2 kb downstream regions of the gene in the chromosome in which it is located: gene TAL1; gene VAV3; Gene PMF1; Gene ATP2B4; Gene SH3YL1; Gene SLC9A3; Gene CXXC5; Gene PCDHGA11; Gene FOXF2; Gene ZNF273; Gene KLRG2; Gene CRB2; Gene SEC16A; Gene GPAM; Gene ASCL2; Gene PAX6; Gene PTGDR2; Gene PLEKHB1; Gene TBX5 ; Gene STX2; Gene FBRSL1; Gene ATP11A; Gene BTBD6; Gene CRIP2; Gene
  • the isolated nucleic acid is isolated from the sample.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the isolated nucleic acid is obtained from a gastric cancer and/or esophageal cancer patient.
  • isolated nucleic acids are obtained from cell-free DNA in plasma.
  • the variant comprises a sequence that is at least 70% identical to the sequence of any gastric and/or esophageal cancer tissue-specific methylation marker gene.
  • a variant includes a sequence that is at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, Sequences that are 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical.
  • the region is the gene and the 2 kb upstream and 2 kb downstream regions of the gene in the chromosome in which it is located.
  • the upstream region is 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp upstream of the gene , 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream region.
  • the downstream region is 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp downstream of the gene , 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp downstream region.
  • the length of the sites can vary.
  • the site may be 150 bp-500 bp in length.
  • the site may be 200bp-470bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • the isolated nucleic acid comprises the nucleotide sequence set forth in any one or more of the following, or a complementary sequence or variant thereof: SEQ ID No. 23, 72, 143, 150, 152, 157, and 160- 187.
  • the variant is at least 60%, 65%, 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96% , variant sequences with 97%, 98% or 99% identity.
  • the invention provides the use of reagents or components in the preparation of a kit or device for (1) distinguishing patients with gastric cancer and/or esophageal cancer from patients with cancers other than gastric cancer and esophageal cancer , (2) used to diagnose or assist in the diagnosis of gastric cancer and/or esophageal cancer; or (3) used to trace the tissue origin of gastric cancer and/or esophageal cancer during pan-cancer screening, where the reagent or component contains genomic DNA of the detection sample Reagents or components for methylation levels of gastric cancer and/or esophageal cancer tissue-specific methylation markers that are the following regions or sites thereof, the regions being the following genes and the genes in which 2kb upstream region and 2kb downstream region in the chromosome: gene TAL1; gene VAV3; gene PMF1; gene ATP2B4; gene SH3YL1; gene SLC9A3; gene CXXC5; gene PCDHGA11; gene FOXF
  • the length of the sites can vary.
  • the site may be 150 bp-500 bp in length.
  • the site may be 200bp-470bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • cancers other than gastric and esophageal cancer or pan-cancer include lung cancer, liver cancer, colorectal cancer, pancreatic cancer, and/or breast cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID No. 23, 72, 143, 150, 152, 157 and 160-187.
  • the reagents or components comprise reagents or components for use in one or more of the following methods of detecting methylation: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction endonuclease Enzymatic assays, fluorescence quantitation, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents or components comprise primers and/or probes for detecting methylation markers.
  • the sample is cells, tissue, fine needle aspiration biopsy, and/or plasma.
  • the sample genomic DNA is cell-free DNA in plasma.
  • the invention provides a method for constructing a prediction model that distinguishes gastric cancer and/or esophageal cancer from cancers other than gastric cancer and esophageal cancer, which includes: (1) obtaining a gastric cancer and/or esophageal cancer sample and removing gastric cancer and methylation levels of methylation markers in genomic DNA of cancer samples other than esophageal cancer; the methylation markers are selected from the following region or the site of this region, the region is the following gene and the gene is in The 2kb upstream region and 2kb downstream region in the chromosome where it is located: gene TAL1; gene VAV3; gene PMF1; gene ATP2B4; gene SH3YL1; gene SLC9A3; gene CXXC5; gene PCDHGA11; gene FOXF2; gene ZNF273; gene KLRG2; gene CRB2 ; Gene SEC16A; Gene GPAM; Gene ASCL2; Gene PAX6; Gene PTG
  • the length of the sites can vary.
  • the site may be 150 bp-500 bp in length.
  • the site may be 200bp-470bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • cancers other than gastric and esophageal cancer or pan-cancer include lung cancer, liver cancer, colorectal cancer, pancreatic cancer, and/or breast cancer.
  • the method includes (2) constructing a machine learning model of logistic regression using data on methylation levels of methylation markers.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the genomic DNA is cell-free DNA in plasma.
  • step (1) includes obtaining methylation sequencing data of the sample DNA.
  • methylation sequencing data of sample DNA is obtained by the method of MethylTitan.
  • step (2) includes establishing a logistic regression model to obtain a model prediction score; and using the obtained methylation levels of the methylation markers as a training set for training, and determining the model based on the samples of the training set related threshold.
  • a logistic regression model such as the logistic regression model in the sklearn (V1.0.1) package in python (V3.9.7)
  • AllModel LogisticRegression(), the formula of the model is as follows, where x is the sample target marker methylation level value, w is the coefficient of methylation marker, b is the intercept value, y is the model prediction score
  • the present invention provides a gastric cancer and/or esophageal cancer prediction model constructed by the method herein.
  • the invention provides a device for diagnosing gastric cancer and/or esophageal cancer, comprising a memory and a processor processing instructions stored in a memory to perform the methods described herein to construct a gastric cancer and/or esophageal cancer prediction model; and using methylation levels of methylation markers in genomic DNA of the sample to be tested As a test set to obtain the model prediction score, use the prediction score and judge whether the sample is gastric cancer and/or esophageal cancer according to the threshold. If the sample is greater than the threshold, it is predicted to be gastric cancer and/or esophageal cancer, and otherwise it is predicted to be other cancer types.
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data and TestPred is the model prediction score. Value, use the prediction score and judge whether the sample is gastric cancer and/or esophageal cancer according to the threshold value. If the value is greater than the threshold value, it is predicted to be gastric cancer and/or esophageal cancer, otherwise it is predicted to be other cancer types.
  • the present invention provides methods that (1) differentiate patients with gastric cancer and/or esophageal cancer from patients with cancers other than gastric cancer and esophageal cancer, (2) are used to diagnose or assist in the diagnosis of gastric cancer and/or esophageal cancer; or (3) Used to trace the tissue origin of gastric cancer and/or esophageal cancer during pan-cancer screening, including measuring the methylation level of one or more methylation markers described herein in the genomic DNA of the sample.
  • the invention provides a kit or device for (1) distinguishing patients with gastric cancer and/or esophageal cancer from patients with cancer other than gastric cancer and esophageal cancer, (2) for diagnosing or assisting in the diagnosis of gastric cancer and/or esophageal cancer; or (3) used in tissue tracing of gastric cancer and/or esophageal cancer during pan-cancer screening.
  • the application includes determining the methylation level of one or more methylation markers described herein in a sample genomic DNA.
  • the present invention provides a kit or device for detecting gastric cancer and/or esophageal cancer tissue-specific methylation markers.
  • kits or devices includes a reagent for detecting the status and/or level of one or more gastric cancer and/or esophageal cancer tissue-specific methylation markers described herein in genomic DNA from a sample or components.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the nucleic acid is cell-free DNA in plasma.
  • the reagents or components comprise reagents or components for use in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence Quantitative methods, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents comprise oligonucleotides for detecting methylation markers.
  • the oligonucleotides are primers and/or probes.
  • the primers are primers that detect the methylation level/status of a site using methylation sequencing or PCR primers that amplify one or more methylation sites.
  • the reagents include bisulfite and its derivatives, PCR buffer, polymerase, dNTPs, primers, probes, methylation-sensitive or insensitive restriction endonucleases, enzyme digestion buffer, Fluorescent dyes, fluorescent quenchers, fluorescent reporters, exonucleases, alkaline phosphatase, internal standards and/or controls from normal subjects or patients with cancers other than gastric cancer and esophageal cancer of the aforementioned specific methylation markers.
  • cancers other than gastric and esophageal cancer or pan-cancer include lung cancer, liver cancer, colorectal cancer, pancreatic cancer, and/or breast cancer.
  • the present invention provides new gastric cancer and/or esophageal cancer tissue-specific methylation markers, which can be used to trace the tissue origin of gastric cancer and/or esophageal cancer in the early screening process of pan-cancer species to achieve better differentiation. gastric and/or esophageal cancer purposes;
  • ctDNA free DNA released by tumor cells into the plasma
  • the gastric cancer and/or esophageal cancer tissue-specific methylation markers of the present invention can detect gastric cancer and/or esophageal cancer with high sensitivity and specificity.
  • pancreatic cancer tissue-specific methylation markers were screened from next-generation sequencing (NGS) cfDNA methylation targeted sequencing data.
  • NGS next-generation sequencing
  • the inventor used the screened methylation markers to construct and verify the machine learning model, which was used to trace the tissue origin of pancreatic cancer during the early screening of pan-cancer species and achieve the purpose of better distinguishing pancreatic cancer.
  • the invention provides the use of reagents or components in the preparation of a kit or device for (1) distinguishing pancreatic cancer patients from non-pancreatic cancer cancer patients, (2) for diagnosis or To assist in the diagnosis of pancreatic cancer; or (3) for tissue traceability of pancreatic cancer during pan-cancer screening, where the reagent or component contains a method for detecting the methylation level of pancreatic cancer tissue-specific methylation markers in the sample genomic DNA.
  • the methylation marker is the following region or its site, the region is the following gene and the 2.5kb upstream region and 2.5kb downstream region of the gene in the chromosome where it is located: gene PGM1 (Phosphoglucomutase 1); Gene CELF3 (CUGBP Elav-Like Family Member 3); Gene ATP2B4 (ATPase Plasma Membrane Ca2+Transporting4); Gene SF3B6 (Splicing Factor 3b Subunit 6); Gene CNNM4 (Cyclin And CBS Domain Divalent Metal Cation Transport Mediator 4); Gene SP9 (Sp9Transcription Factor); Gene C2orf82 (chromosome 2 open reading frame 82); Gene NEU4 (Neuraminidase 4); Gene RPL35A (Ribosomal Protein L35a); Gene HGFAC; Gene EXOC3 (Exocyst Complex Component 3); gene GDNF (Glial cell line-derived neurotrophic factor); gene NEUROG1
  • cancers other than pancreatic cancer or pan-cancer include colorectal cancer, liver cancer, gastric cancer, esophageal cancer, breast cancer, and/or lung cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 68, 88, 154, 163, 172, 177 and 188-217.
  • the reagents or components comprise reagents or components for use in one or more of the following methods of detecting methylation: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction endonuclease Enzymatic assays, fluorescence quantitation, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents or components comprise primers and/or probes for detection of methylation markers and/or the sample is cells, tissue, fine needle aspiration biopsy and/or plasma, preferably the sample genome DNA is cell-free DNA in plasma.
  • the present invention provides a method for constructing a method to distinguish pancreatic cancer from other non-pancreatic cancers.
  • the method of disease prediction model includes:
  • the site is between 130bp and 530bp in length, preferably between 150bp and 480bp.
  • the non-pancreatic cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, breast cancer and/or lung cancer.
  • the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 68, 88, 154, 163, 172, 177 and 188-217.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the genomic DNA is cell-free DNA in plasma.
  • step (1) includes obtaining methylation sequencing data of the sample DNA.
  • step (2) includes establishing a logistic regression model to obtain a model prediction score; and using the obtained methylation levels of the methylation markers as a training set for training and determining the model based on the samples of the training set. threshold.
  • Train the training set AllModel.fit(Traindata,TrainPheno), where TrainData is the data of the training set, and TrainPheno is the trait of the training set sample, where pancreatic cancer is 1 and other cancer types are 0, and the model is determined based on the samples of the training set. related threshold.
  • pancreatic cancer prediction model constructed according to the method of the present invention.
  • a device for diagnosing pancreatic cancer comprising a memory and a processor that processes instructions stored in the memory, the instructions executing the method according to the invention to construct a pancreatic cancer prediction model; and using genomic DNA of a sample to be tested
  • the methylation levels of the methylation markers in the model are used as the test set to obtain the model prediction score, and the prediction score is used to determine whether the sample is pancreatic cancer based on the threshold.
  • the prediction score is used to judge whether the sample is pancreatic cancer based on the threshold. If the sample is greater than the threshold, it is predicted to be pancreatic cancer. Otherwise, it is predicted to be other cancer types.
  • kits or device for detecting pancreatic cancer tissue-specific methylation markers comprising detecting one or more pancreatic cancer tissue-specific methylation markers in genomic DNA from a sample
  • the pancreatic cancer tissue-specific methylation marker is the following region or its site, which is the following gene and 2.5kb of the chromosome in which it is located.
  • Upstream region and 2.5kb downstream region gene TNFRSF14; gene PGM1; gene CELF3; gene ATP2B4; gene SF3B6; gene CNNM4; gene SP9; gene C2orf82; gene NEU4; gene RPL35A; gene HGFAC; gene EXOC3; gene GDNF; gene NEUROG1; Gene HIST1H2BA; Gene OSTM1; Gene CCR6; Gene CCAR2; Gene TNFRSF10D; Gene TJP2; Gene DAB2IP; Gene NTMT1; Gene MKI67; Gene PTGDR2; Gene CCDC77; Gene MYL2; Gene FRY; Gene SMEK1; Gene BTBD6; Gene PIF1; Gene SRL ; Gene SPNS1; Gene DNM2; Gene ZNF569; Gene SDF2L1; or the complementary sequence or variant of any gene, as long as the methylation site in the variant is not mutated.
  • the loci are 130bp-530bp in length.
  • the methylation marker includes the nucleotide sequence shown in any one or more of the following or its complementary sequence or variant sequence: SEQ ID NO: 68, 88, 154, 163, 172, 177 and 188-217.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the nucleic acid is cell-free DNA in plasma.
  • the reagents or components comprise reagents or components for use in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence Quantitative methods, methylation-sensitive high-resolution melting curve methods, and chip-based methylation profiling and mass spectrometry.
  • the reagents comprise oligonucleotides for detecting methylation markers.
  • the oligonucleotides are primers and/or probes;
  • the primers are primers that detect the methylation level/status of a site using methylation sequencing or PCR primers that amplify one or more methylation sites.
  • the reagents include bisulfite and its derivatives, PCR buffer, polymerase, dNTPs, primers, probes, methylation-sensitive or insensitive restriction endonucleases, enzyme digestion buffer, Fluorescent dye, fluorescent quencher, fluorescent reporter, exonuclease, alkaline phosphatase, internal standard and/or control substance, the control substance is the aforementioned specific substance from a normal subject or a cancer patient other than pancreatic cancer. Sexual methylation markers.
  • the non-pancreatic cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, breast cancer and/or lung cancer.
  • the invention provides isolated nucleic acids that are one or more specific methylation markers.
  • the isolated nucleic acid is a pancreatic cancer tissue-specific methylation marker.
  • the pancreatic cancer tissue-specific methylation marker is the following region or a site thereof, which region is the following gene and the 2.5kb upstream region and 2.5kb of the gene in the chromosome where it is located Downstream region: gene TNFRSF14; gene PGM1; gene CELF3; gene ATP2B4; gene SF3B6; gene CNNM4; gene SP9; gene C2orf82; gene NEU4; gene RPL35A; gene HGFAC; gene EXOC3; gene GDNF; gene NEUROG1; gene HIST1H2BA; gene OSTM1 ; Gene CCR6; Gene CCAR2; Gene TNFRSF10D; Gene TJP2; Gene DAB2IP; Gene NTMT1; Gene MKI67; Gene PTGDR2; Gene CCDC77; Gene MY
  • the loci are 130bp-530bp in length. In one embodiment, the loci are 150bp-480bp in length. In one embodiment, the methylation marker comprises the nucleotide sequence shown in any one or more of the following or its complement or variant sequence: SEQ ID NO: 68, 88, 154, 163, 172, 177 and 188-217.
  • the isolated nucleic acid is isolated from the sample. In one embodiment, The sample is cells, tissue, fine needle aspiration biopsy, or plasma. In one embodiment, the isolated nucleic acid is obtained from a pancreatic cancer patient. For example, isolated nucleic acids are obtained from cell-free DNA in plasma.
  • the variant comprises a sequence that is at least 70% identical to the sequence of either gene.
  • a variant includes a sequence that is at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, Sequences that are 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical.
  • the region is the gene and the 2.5 kb upstream and 2.5 kb downstream regions of the gene in the chromosome in which it is located.
  • the upstream region is 2.4kb, 2.3kb, 2.2kb, 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream region.
  • the downstream region is 2.4kb, 2.3kb, 2.2kb, 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp downstream of the gene , 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp downstream region.
  • the length of the sites may vary.
  • the site may be 130 bp-530 bp in length.
  • the site may be 150bp-480bp in length.
  • the length of the site can be 140bp, 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp, 500bp, 510bp or 520bp.
  • the variant is at least 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96% , variant sequences with 97%, 98% or 99% identity.
  • the present invention provides methods that (1) differentiate pancreatic cancer patients from non-pancreatic cancer cancer patients, (2) are used to diagnose or assist in the diagnosis of pancreatic cancer; or (3) are used in pan-cancer screening procedures
  • To the pancreas Tracing the tissue origin of adenocarcinoma includes measuring the methylation level of one or more methylation markers described herein in the genomic DNA of the sample.
  • the method is performed using the pancreatic cancer prediction model of the invention.
  • the present invention provides new methylation markers, which can be used to trace the tissue origin of pancreatic cancer in the early screening process of pan-cancer species, so as to achieve the purpose of better distinguishing pancreatic cancer;
  • the methylation marker of the present invention can detect pancreatic cancer with high sensitivity and specificity.
  • Figure 1 Methylation levels of selected colorectal cancer-specific markers in the training set.
  • Figure 2 Methylation levels of selected colorectal cancer-specific markers in the test set.
  • Figure 3 Methylation levels of colorectal cancer (also called intestinal cancer in the attached figure)-specific Seq ID NO:52 in various cancer types in the training set.
  • Figure 4 Methylation levels of colorectal cancer-specific Seq ID NO:52 in various cancer types in the test set.
  • Figure 5 AllModel model score distribution for colorectal cancer and other cancer types in the training set and test set.
  • Figure 6 ROC curve of AllModel in the training set and test set.
  • Figure 7 Scores of colorectal cancer-specific marker panel 1 model.
  • Figure 8 ROC curve of colorectal cancer-specific marker combination 1 model.
  • Figure 9 Colorectal cancer-specific marker panel 2 model scores.
  • Figure 10 Colorectal cancer specific marker combination 2 model ROC curve.
  • Figure 11 Methylation levels of selected lung cancer tissue-specific methylation markers in the training set.
  • Figure 12 Methylation levels of selected lung cancer tissue-specific methylation markers in the test set.
  • Figure 13 Methylation levels of lung cancer tissue-specific methylation marker Seq ID NO:91 in various cancer types in the training set.
  • Figure 14 Methylation levels of lung cancer tissue-specific methylation marker Seq ID NO:91 in various cancer types in the test set.
  • Figure 15 Distribution of model scores for lung cancer and other cancer types in the training set and test set for all lung cancer tissue-specific methylation markers.
  • Figure 16 ROC curves of all lung cancer tissue-specific methylation markers in the training set and test set Wire.
  • Figure 17 Scores of lung cancer tissue-specific methylation marker combination 1 model.
  • Figure 18 ROC curve of lung cancer tissue-specific methylation marker combination 1 model.
  • Figure 19 Lung cancer tissue-specific methylation marker combination 2 model scores.
  • Figure 20 ROC curve of lung cancer tissue-specific methylation marker combination 2 model.
  • Figure 21 Methylation levels of liver cancer methylation markers in the training set.
  • Figure 22 Methylation levels of liver cancer methylation markers in the test set.
  • Figure 23 Methylation levels of liver cancer methylation marker Seq ID NO:137 in various cancer types in the training set.
  • Figure 24 Methylation levels of liver cancer methylation marker Seq ID NO:137 in various cancer types in the test set.
  • Figure 25 Distribution of model scores for liver cancer and other cancer types in the training set and test set for all liver cancer markers.
  • Figure 26 ROC curves of all liver cancer methylation markers in the training set and test set.
  • Figure 27 Liver cancer methylation marker combination 1 model score.
  • Figure 28 ROC curve of liver cancer methylation marker combination 1 model.
  • Figure 29 Liver cancer methylation marker combination 2 model scores.
  • Figure 30 ROC curve of liver cancer methylation marker combination 2 model.
  • Figure 31 Methylation levels of selected breast cancer methylation markers in the training set.
  • Figure 32 Methylation levels of selected breast cancer methylation markers in the test set.
  • Figure 33 Methylation levels of breast cancer methylation marker Seq ID NO:21 in various cancer types in the training set.
  • Figure 34 Methylation levels of breast cancer methylation marker Seq ID NO:21 in various cancer types in the test set.
  • Figure 35 Model score distribution of breast cancer and other cancer types in the training set and test set for all breast cancer methylation markers.
  • Figure 36 ROC curves of all breast cancer methylation markers in the training set and test set.
  • Figure 37 Breast cancer methylation marker panel 1 model scores.
  • Figure 38 ROC curve of breast cancer methylation marker combination 1 model.
  • Figure 39 Breast cancer methylation marker combination 2 model scores.
  • Figure 40 ROC curve of breast cancer methylation marker combination 2 model.
  • Figure 41 Methylation levels of selected gastric cancer and/or esophageal cancer tissue-specific methylation markers in the training set.
  • Figure 42 Methylation levels of selected gastric cancer and/or esophageal cancer tissue-specific methylation markers in the test set.
  • Figure 43 Methylation levels of gastric cancer and/or esophageal cancer tissue-specific methylation marker Seq ID NO:172 in various cancer types in the training set.
  • Figure 44 Methylation levels of gastric cancer and/or esophageal cancer tissue-specific methylation marker Seq ID NO:172 in various cancer types in the test set.
  • Figure 45 Distribution of model scores for all gastric cancer and/or esophageal cancer tissue-specific methylation markers in the training set and test set for gastric cancer and/or esophageal cancer and other cancer types.
  • Figure 46 ROC curves of all gastric and/or esophageal cancer tissue-specific methylation markers in the training and test sets.
  • Figure 47 Scores of gastric cancer and/or esophageal cancer tissue-specific methylation marker combination 1 model.
  • Figure 48 ROC curve of gastric cancer and/or esophageal cancer tissue-specific methylation marker combination 1 model.
  • Figure 49 Gastric and/or esophageal cancer tissue-specific methylation marker combination 2 model scores.
  • Figure 50 ROC curve of gastric cancer and/or esophageal cancer tissue-specific methylation marker combination 2 model.
  • Figure 51 Methylation levels of pancreatic cancer markers in the training set.
  • Figure 52 Methylation levels of pancreatic cancer markers in the test set.
  • Figure 53 Methylation levels of pancreatic cancer marker Seq ID NO:202 in various cancer types in the training set.
  • Figure 54 Methylation levels of pancreatic cancer marker Seq ID NO:202 in various cancer types in the test set.
  • Figure 55 Distribution of model scores for pancreatic cancer and other cancer types in the training set and test set for all pancreatic cancer markers.
  • Figure 56 ROC curves of all pancreatic cancer markers in the training set and test set.
  • Figure 57 Pancreatic cancer marker panel 1 model scores.
  • Figure 58 ROC curve of pancreatic cancer marker combination 1 model.
  • Figure 59 Pancreatic cancer marker panel 2 model scores.
  • Figure 60 Pancreatic cancer marker combination 2 model ROC curve.
  • the inventors screened out colorectal cancer tissue-specific methylation markers from a large amount of NGS methylation sequencing data of 7 cancer types, and achieved good tissue traceability effects in the relevant verification data, which is a pan-cancer
  • the tissue traceability of colorectal cancer during early screening provides important technical support.
  • the inventors screened out lung cancer tissue-specific methylation markers from a large amount of NGS methylation sequencing data of 7 cancer types, and achieved good tissue traceability effects in the relevant verification data, providing early detection methods for pan-cancer types. It provides important technical support for tissue traceability of lung cancer during the screening process.
  • the present invention screens liver cancer tissue-specific methylation markers from a large amount of NGS methylation sequencing data of 7 cancer types, and can achieve good tissue traceability effects in relevant verification data, providing early screening for pan-cancer types.
  • the tissue traceability of liver cancer provided important technical support.
  • the present invention has screened out breast cancer tissue-specific methylation markers from a large amount of NGS methylation sequencing data of 7 cancer types, and can achieve good tissue traceability effects in relevant verification data, providing early access to pan-cancer types. It provides important technical support for tissue traceability of breast cancer during the screening process.
  • the inventors screened out gastric cancer and/or esophageal cancer tissue-specific methylation markers from a large amount of NGS methylation sequencing data of 7 cancer types, and achieved good tissue traceability effects in the relevant verification data. , which provides important technical support for the tissue traceability of gastric cancer and/or esophageal cancer during the early screening of pan-cancer species.
  • the inventors found that gastric cancer and/or esophageal cancer are associated with methylation levels in the following gene regions: SEQ ID No. 23, 72, 143, 150, 152, 157 and 160-187.
  • the present invention has screened out pancreatic cancer tissue-specific methylation markers from a large amount of NGS methylation sequencing data of 7 cancer types, and can achieve good tissue traceability effects in relevant verification data, providing early access to pan-cancer types. It provides important technical support for tissue traceability of pancreatic cancer during the screening process.
  • Machine learning modeling is the process of finding the most appropriate representation of input data features so that it can solve specific problems, such as classification problems.
  • the modeled data has better discriminating power than each input single data feature.
  • This article shows the best model and the classification effect of each marker in the model. The distinguishing effect of selecting any combination of features for modeling is between the best model and a single feature. As shown in this article, each individual marker has a distinguishing effect, and the results of randomly selecting markers for classification are also shown in the embodiments of this patent application. Therefore, this patent application protects all marker combination models.
  • colorectal cancer is related to the methylation level of the following gene region (SEQ ID No. 52-90): Chromosome 1, No. 27189993-27190207; Chromosome 1, No. 1
  • Chromosome No. 43331809-43332099 the physical location of the methylation marker was determined with reference to the human genome sequence hg19.
  • lung cancer is related to the methylation level of the following gene regions or their upstream and downstream regions: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; gene TERT; gene NR2F1; gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW1; gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; Gene DPYSL4; Gene TTC40; Gene ADAM8; Gene SLC22A11; Gene CPT1A; Gene B4GALNT1; Gene FBRSL1; Gene XPO4; Gene TFDP1; Gene GCH1; Gene TMEM179; Gene ITPKA;
  • liver cancer is related to the methylation level of the following gene regions or their upstream and downstream regions: TAL1 gene; TRIM58 gene; LBH gene; ABCG5 gene; PAX8 gene; DLEC1 gene; AMIGO3 gene; RASSF1 gene; CLDN11 gene; SLC2A9 gene ; SLC9A3 gene; CXXC5 gene; FOXC1 gene; HIST1H4F gene; TRIM40 gene; HOXA13 gene; CRHR2 gene; AGPAT6 gene; TCF24 gene; OPLAH gene; GPAM gene; ADAM8 gene; GRASP gene; B4GALNT1 gene; STX2 gene; ATL1 gene; ITPKA gene; PIF1 gene; ZFHX3 gene; C1QL1 gene; SEPT-9 gene; KCTD1 gene; PIP5K1C gene; RASAL3 gene; CYP2F1 gene; or WISP2 gene.
  • breast cancer is related to the methylation level of the following gene regions or their upstream and downstream regions: gene BARHL2; gene ALX3; gene TBX15; gene C2CD4D; gene RYR2; gene LBH; SIX3; gene SIX2; gene OTX1; gene EMX1 ; Gene LBX2; Gene BCL2L11; Gene PAX8; Gene HOXD1; Gene SATB2; Gene VILL; Gene CLDN11; Gene EPHB3; Gene NKX3-2; Gene KCTD8; Gene PITX1; Gene CXXC5; Gene FOXC1; Gene NRN1; Gene HOXA9; Gene DLX6 ; Gene MOS; Gene TCF24; Gene CA3; Gene GDF6; Gene FOXD4; Gene PTF1A; Gene TLX1; Gene INA; Gene NKX6-2; Gene PAX6; Gene BCAT1; Gene FAIM2; Gene GRASP; Gene CCNA1; Gene SIX1; Gene PR
  • pancreatic cancer is related to the methylation level of the following gene regions or their upstream and downstream regions: gene TNFRSF14; gene PGM1; gene CELF3; gene ATP2B4; gene SF3B6; gene CNNM4; gene SP9; gene C2orf82; gene NEU4; gene RPL35A; gene HGFAC; gene EXOC3; gene GDNF; gene NEUROG1; gene HIST1H2BA; gene OSTM1; gene CCR6; gene CCAR2; gene TNFRSF10D; gene TJP2; gene DAB2IP; gene NTMT1; gene MKI67; gene PTGDR2; gene CCDC77; gene MYL2; Gene FRY; gene SMEK1; gene BTBD6; gene PIF1; gene SRL; gene SPNS1; gene DNM2; gene ZNF569; gene SDF2L1.
  • DNA methylation is a mechanism of epigenetic inheritance. It is a common epigenetic modification of eukaryotic cell genomes and can change genetic expression without changing the DNA sequence.
  • the so-called DNA methylation refers to the covalent bonding of a methyl group at carbon position 5 of cytosine in genomic CpG dinucleotides under the action of DNA methyltransferase.
  • DNA methylation plays a role in cell proliferation and differentiation It plays an important role in the formation and development of tumors and is closely related to the occurrence and development of tumors. Its effects include transcription inhibition, chromatin structure regulation, X chromosome inactivation, genomic imprinting, etc.
  • Abnormal DNA methylation can participate in the occurrence and progression of tumors by affecting chromatin structure and the expression of oncogenes and tumor suppressor genes.
  • primer refers to a nucleic acid molecule with a specific nucleotide sequence that guides synthesis at the initiation of nucleotide polymerization. Primers are usually two artificially synthesized oligonucleotide sequences. One primer is complementary to a DNA template strand at one end of the target region, and the other primer is complementary to another DNA template strand at the other end of the target region. Its function is to act as a nucleotide. The starting point of polymerization. In vitro artificially designed primers are widely used in polymerase chain reaction (PCR), qPCR, sequencing and probe synthesis.
  • PCR polymerase chain reaction
  • primers are designed to amplify product lengths of 50-150bp, 60-140, 70-130, and 80-120bp.
  • the primers contained in the reagents herein can be primers for genome sequencing, such as whole-genome sequencing primers or sequencing primers targeting a certain region of the genome. They can also be PCR primers used to amplify a specific region or used to amplify one or more regions. PCR primers for methylation sites.
  • the primers can be whole-genome sequencing primers, which can obtain many amplification products, and these amplification products can include the region or include the region after splicing.
  • the methylation status of each methylation site (CpG) in the region is obtained after sequencing, thereby obtaining the methylation level of the entire region.
  • the primers are complementary or substantially complementary to the gene or region of interest.
  • variant refers to a polynucleotide that has a nucleic acid sequence that is altered by the insertion, deletion, or substitution of one or more nucleotides compared to a reference sequence, while retaining its ability to hybridize to other nucleic acids.
  • Variants according to any embodiment herein include at least 70%, preferably at least 80%, preferably at least 85%, preferably at least 90%, preferably at least 95%, preferably at least 97% sequence identity to a reference sequence or reference gene. And retain the reference sequence or the nucleotide sequence of the methylation site of the reference gene. Sequence identity between two aligned sequences can be calculated using, for example, NCBI's BLASTn.
  • Variants also include nucleotide sequences that have one or more mutations (insertions, deletions, or substitutions) in the nucleotide sequence of the reference sequence while still retaining the methylation sites of the reference sequence. Multiple mutations usually refer to within 1-10, such as 1-8, 1-5 or 1-3.
  • the substitution may be between purine nucleotides and pyrimidine nucleotides, or between purine nucleotides or between pyrimidine nucleotides. Substitutions are preferably conservative substitutions. For example, in the art, conservative substitutions with nucleotides with similar or similar properties generally do not alter the stability and function of the polynucleotide.
  • Conservative substitutions include the exchange of purine nucleotides (A and G) and the exchange of pyrimidine nucleotides (T or U and C). Therefore, substitution of one or more positions in a polynucleotide of the invention with residues from the same residue will not substantially change affect its activity. Furthermore, the methylation sites described herein contained in the variants of the invention are not mutated. That is, the method of the present invention detects the methylation status of methylation sites in the corresponding sequence, and mutations can occur in bases other than these sites.
  • biological sample generally refers to a sample obtained or derived from a biological source of interest, such as a tissue or organism or cell culture.
  • the organism from which the sample is derived is an animal or human, preferably human.
  • the sample is or includes biological tissue or fluid.
  • a biological sample may be or include cells, tissue, or body fluids.
  • the biological sample can be or include blood, blood cells, cell-free DNA, free floating nucleic acids, ascites fluid, biopsy samples, surgical samples, cell-containing body fluids, sputum, saliva, feces, urine, cerebrospinal fluid, Peritoneal fluid, pleural fluid, lymph fluid, gynecological fluid, secretions, excretions, skin swabs, vaginal swabs, oral swabs, nasal swabs, washes such as catheter wash or bronchoalveolar wash, aspirates, scrapings Films etc.
  • a biological sample is or includes cells obtained from a single subject or from multiple subjects.
  • a sample may be a "primary sample" obtained directly from a biological source, or it may be a "processed sample”.
  • the term "cancer” is used to refer to a disease or condition in which cells exhibit abnormal, uncontrolled, and/or autonomous growth such that they exhibit an abnormally elevated proliferation rate and/or an abnormal growth phenotype.
  • the cancer of interest may be colorectal cancer.
  • the cancer of interest may be lung cancer.
  • the cancer of interest may be liver cancer.
  • the cancer of interest may be breast cancer.
  • the cancer of interest may be gastric cancer and/or esophageal cancer.
  • the cancer of interest may be pancreatic cancer.
  • diagnosis refers to the quantitative probability and/or qualitative probability of determining whether a subject has or is at risk of developing cancer.
  • the diagnosis may include a determination regarding the risk, type, stage, malignancy, etc. of the cancer.
  • the methylation marker may be or include a locus (e.g., one or more methylation loci) and/or the state of the locus (e.g., the status of one or more methylated loci).
  • the marker may be or include a marker for a specific disease, or may be a marker for a quantitative probability that a specific disease will develop, occur, or recur in a subject.
  • the methylation marker of the present invention can be a prediction, prognosis and/or diagnosis of one of colorectal cancer, lung cancer, liver cancer, breast cancer, gastric cancer and/or esophageal cancer, and pancreatic cancer. of markers.
  • DNA region refers to any contiguous portion of a larger DNA molecule.
  • DNA regions refer to the gene of interest and the regions upstream and downstream of it.
  • Upstream of a gene or region refers to the region relative to the 5' end of the gene or region.
  • Downstream of a gene or region refers to the region relative to the 3' end of the gene or region.
  • identity refers to the overall relatedness between nucleic acid molecules (eg, DNA molecules and/or RNA molecules). Methods for calculating percent identity between two provided sequences are known in the art. For example, the percent identity of two nucleic acids can be calculated as follows: the two sequences are aligned for optimal comparison purposes (e.g., gaps can be introduced in one or both of the first and second sequences for optimal comparison purposes).
  • nucleotides at the corresponding positions are compared; when a position in the first sequence is replaced by the same residue (e.g., a nucleotide) as the corresponding position in the second sequence or amino acid), then the molecules are identical at that position.
  • the percent identity between two sequences is a function of the number of identical positions shared by the sequences (taking into account the number of gaps introduced for optimal alignment and the length of each gap). Comparison of sequences and determination of percent identity between two sequences can be accomplished using computational algorithms such as BLAST (Basic Local Alignment Search Tool).
  • methylation includes (i) any C5 position of cytosine; (ii) methylation of the N4 position of cytosine; (iii) methylation of the N6 position of adenine; and (iv) other types methylation of nucleotides.
  • Methylated nucleotides may be referred to as “methylated nucleotides” or “methylated nucleotide bases.”
  • methylation as described herein specifically refers to methylation of cytosine residues. In some cases, methylation refers to methylation of cytosine residues present in CpG sites.
  • methylation analysis refers to any technique that can be used to determine the methylation status or level of a methylated site.
  • methylation marker refers to a marker of at least one methylation site and/or the methylation status of at least one methylation site (eg, a hypermethylation site).
  • a methylation marker is characterized by a change in the methylation state of one or more nucleic acid sites between a first state and a second state (eg, between a cancerous state and a non-cancerous state).
  • methylation status refers to the amount, frequency, or pattern of methylation at a methylation site within a methylation locus. Accordingly, a change in methylation status between the first state and the second state may be or include an increase in the number, frequency or pattern of methylation sites, or may be or include an increase in the number, frequency or pattern of methylation sites. or pattern reduction. In various cases, changes in methylation status are Changes in base value. In this article, methylation status can be expressed as methylation haplotype frequency.
  • methylation value refers to a numerical representation of methylation status, for example, in the form of a number representing the frequency or ratio of methylation at a methylated locus.
  • methylation values can be generated by a method that includes quantifying the amount of intact nucleic acid present in the sample after restriction digestion of the sample with a methylation-dependent restriction enzyme.
  • methylation values may be generated by methods involving comparison of amplification profiles following bisulfite reaction of samples.
  • methylation values can be generated by comparing the sequences of bisulfite-treated and untreated nucleic acids.
  • the methylation value is a quantitative PCR result, includes a quantitative PCR result or is based on a quantitative PCR result.
  • methylation level represents the proportion of one or more sites that are in a methylated state.
  • the methylation level of a region (or a group of sites) is the average of the methyl levels of all sites in the region (or of all sites in the group). Therefore, an increase or decrease in methylation levels in a region does not mean an increase or decrease in methylation levels at all methylation sites in the region.
  • the process of converting results from methods of detecting DNA methylation (eg, simplified methylation sequencing) into methylation levels is known in the art.
  • the software Bismark (v0.17.0) can be used to obtain the methylation level of CpG sites.
  • Methods for detecting DNA methylation are known in the art, including but not limited to bisulfite conversion-based PCR (e.g., methylation-specific PCR (MSP)), DNA sequencing (e.g., Bisulfite sequencing (BS), whole-genome bisulfite sequencing (WGBS), reduced methylation sequencing (Reduced Representation Bisulfite Sequencing (RRBS)), methylation-sensitive limitations Endonuclease analysis method (Methylation-Sensitive Dependent Restriction Enzymes), fluorescence quantification method, methylation-sensitivity high-resolution melting curve method (Methylation-sensitivity High-resolution Melting, MS-HRM), chip-based methylation map Analysis or mass spectrometry (e.g.
  • MSP methylation-specific PCR
  • DNA sequencing e.g., Bisulfite sequencing (BS), whole-genome bisul
  • detecting includes detecting either strand at a gene or locus.
  • DNA methylation can also be detected using reduced genome methylation sequencing (RRBS).
  • RRBS reduced genome methylation sequencing
  • Simplified genome methylation sequencing is a technology that uses restriction endonucleases to digest the genome, and then processes the CpG region of the genome through bisulfite treatment.
  • reagents used for simplified genome methylation sequencing include: plasma nucleic acid purification kit, ligase, bisulfite and its derivatives, dNTPs, polymerase, primers, nuclease-free water and/or magnetic beads, etc.
  • the "specificity" of a marker refers to the percentage of samples characterized by the absence of an event or condition of interest, where the measurement of the marker accurately indicates the absence of the event or condition of interest (true negative rate) .
  • characterization of negative samples is independent of markers and can be accomplished by any relevant measurement, such as any relevant measurement known to those skilled in the art. Specificity therefore reflects the probability that a marker will detect the absence of an event or state of interest when measured in a sample that does not characterize the event or state of interest.
  • the event or state of interest is colorectal cancer
  • specificity refers to the probability that the marker will detect the absence of colorectal cancer in a subject lacking colorectal cancer.
  • the absence of colorectal cancer can be determined, for example, by histology.
  • specificity refers to the probability that the marker will detect the absence of lung cancer in a subject lacking lung cancer.
  • the absence of lung cancer can be determined, for example, by histology.
  • specificity refers to the probability that the marker will detect the absence of liver cancer in a subject lacking liver cancer.
  • the absence of liver cancer can be determined, for example, by histology.
  • specificity refers to the probability that a marker will detect the absence of breast cancer in a subject lacking breast cancer.
  • the absence of breast cancer can be determined, for example, by histology.
  • specificity refers to the probability that the marker will detect the absence of gastric cancer and/or esophageal cancer in a subject lacking gastric cancer and/or esophageal cancer. .
  • the absence of gastric and/or esophageal cancer can be determined, for example, by histology.
  • specificity refers to the probability that the marker will detect the absence of pancreatic cancer in a subject lacking pancreatic cancer. The absence of pancreatic cancer can be determined, for example, by histology.
  • sensitivity of a marker refers to the percentage of samples characterized by the presence of an event or condition of interest, where the measurement of the marker accurately indicates the presence of the event or condition of interest (true positivity rate).
  • characterization of positive samples is independent of markers and can be accomplished by any relevant measurement, such as any relevant measurement known to those skilled in the art.
  • sensitivity reflects the probability that a marker will detect the presence of an event or state of interest when measured in a sample characterized by the presence of the event or state of interest.
  • sensitivity refers to the probability that the marker will detect the presence of colorectal cancer in a subject with colorectal cancer.
  • the presence of colorectal cancer can be determined, for example, by histology.
  • sensitivity refers to the probability that a marker will detect the presence of lung cancer in a subject with lung cancer.
  • the presence of lung cancer can be determined, for example, by histology.
  • the sensitivity refers to the marker The probability of the presence of liver cancer in subjects with liver cancer will be detected.
  • the presence of liver cancer can be determined, for example, by histology.
  • sensitivity refers to the probability that a marker will detect the presence of breast cancer in a subject with breast cancer.
  • the presence of breast cancer can be determined, for example, by histology.
  • sensitivity refers to the probability that the marker will detect the presence of gastric cancer and/or esophageal cancer in a subject with gastric cancer and/or esophageal cancer.
  • the presence of gastric and/or esophageal cancer can be determined, for example, by histology.
  • sensitivity refers to the probability that the marker will detect the presence of pancreatic cancer in a subject with pancreatic cancer.
  • the presence of pancreatic cancer can be determined, for example, by histology.
  • the term "subject" refers to an organism, typically a mammal (eg, a human).
  • the subject has cancer.
  • the subject has colorectal cancer.
  • the subject has lung cancer.
  • the subject has liver cancer.
  • the subject has breast cancer.
  • the subject has gastric cancer and/or esophageal cancer.
  • the subject has pancreatic cancer.
  • the present invention provides isolated nucleic acids isolated from a sample of a subject.
  • isolated nucleic acids are isolated from cell-free DNA in the plasma of colorectal cancer patients.
  • the isolated nucleic acid is one or more specific methylation markers, preferably colorectal cancer tissue-specific methylation markers.
  • the methylation marker is the following region or the site of the region, which is the following gene and the 2.3kb upstream region and 2.3kb downstream region of the gene in the chromosome where it is located: gene SFN; gene GPR3; gene FCGR1B ; Gene FAM150B; Gene RGPD3; Gene NUP210; Gene LMOD3; Gene FOXF2; Gene TBXT; Gene PRR15; Gene ELN; Gene TFPI2; Gene REPIN1; Gene PDLIM2; Gene SDC2; Gene TRAPPC9; Gene TJP2; Gene DIP2C; Gene DDIT4; Gene MRPL23; gene PAX6; gene PLXNC1; gene MLNR; gene MYO16; gene TMEM179; gene GATM; gene CACNA1H; gene NLRC5; gene SHISA6; gene KCNJ12; gene PRAC1; gene MYO15B; gene CANT1; gene SALL3; gene THOP1; gene ZBTB
  • This site is a methylated site.
  • genes in the genome can have mutations, so it is conceivable that variants of these genes can also be used as methylation markers, as long as the methylation in the variant The site was not mutated.
  • Variants may comprise sequences that are at least 70% identical to the sequence of either gene.
  • the site selected as a marker may contain 1 or more CpGs, such as 2 CpGs, 3 CpGs, 4 CpGs, 5 CpGs, 6 CpGs, 10 CpGs, 20 CpGs, or 30 CpGs. Suitable sites may be 150bp-500bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • a gene has the same or similar methylation level or status as its upstream and downstream regions. Therefore, when the present invention discovers a methylation site within a specific gene, it can be assumed that the gene and the 2.3kb upstream region and 2.3kb downstream region in the original chromosome also have the same or similar methylation level or status.
  • the invention covers the gene described in the invention and the 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp of the gene in the chromosome where it is located.
  • the present invention uses the following nucleotide sequences as methylation markers.
  • the coordinates of the chromosome position are determined with reference to the human genome sequence hg19. Based on the selected colorectal cancer tissue-specific methylation markers and the genes where they are located, those skilled in the art will understand that sites within the following items can be used as methylation markers: located in the SFN region of the gene inside or up and down Located in the gene GPR3 region or upstream and downstream; located in the gene FCGR1B region or upstream and downstream; located in the gene FAM150B region or upstream and downstream; located in the gene RGPD3 region or upstream and downstream; located in the gene NUP210 region or upstream and downstream; located in the gene NUP210 region or upstream and downstream; Within or upstream of the LMOD3 region; within or upstream of the gene FOXF2 region; within or upstream of the gene TBXT region; within or upstream of the gene PRR15 region; within or upstream of the gene ELN region; located within or upstream of the gene TFPI2 region Downstream; located in the gene
  • Either a single methylation marker or a combination of multiple methylation markers can be used as colorectal cancer-specific methylation markers.
  • the methylation marker is within 2 kb upstream and 2 kb downstream of any of the genes described above.
  • the present invention provides isolated nucleic acids isolated from a sample of a subject.
  • isolated nucleic acids are isolated from cell-free DNA in the plasma of lung cancer patients.
  • the isolated nucleic acid is one or more specific methylation markers, preferably lung cancer tissue-specific methylation markers.
  • the methylation marker is the following region or the site of this region, which is the following gene and the 2.2kb upstream region and 2.2kb downstream region of the gene in the chromosome where it is located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; gene TERT; gene NR2F1; gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW1; gene LRRC4; gene DGKI; gene PDLIM2; Gene RHOBTB2; gene TMEM75; gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; Gene ADAM8; gene SLC22A11; gene CPT1A; gene B4GALNT1; gene FBRSL1; gene X
  • This site is a methylated site.
  • genes in the genome may have mutations, so it is conceivable that variants of these genes can also be used as methylation markers, as long as the methylation sites in the variants are not mutated.
  • Variants may comprise sequences that are at least 70% identical to the sequence of either gene.
  • the site selected as a marker may contain 1 or more CpGs, such as 2 CpGs, 3 CpGs, 4 CpGs, 5 CpGs, 6 CpGs, 10 CpGs, 20 CpGs, or 30 CpGs. Suitable sites may be 150bp-500bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • a gene has the same or similar methylation level or status as its upstream and downstream regions. Therefore, when the inventor discovers a methylation site within a specific gene, it can be assumed that the gene and the 2.2kb upstream region and 2.2kb downstream region in the original chromosome also have the same or similar methylation level or status.
  • the invention covers the gene described in the invention and the 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp of the gene in the chromosome where it is located.
  • the present invention uses the following nucleotide sequences as methylation markers.
  • the coordinates of the chromosome position are determined with reference to the human genome sequence hg19. Based on the screened lung cancer tissue-specific methylation markers and the genes where they are located, those skilled in the art will understand that sites within the following items can be used as methylation markers: located within or upstream of the gene ARHGEF16 region or downstream region; located within the gene CASZ1 or the upstream region or the downstream region; located within the gene MAP3K6 or the upstream region or the downstream region; located within the gene TRIM58 or the upstream region or the downstream region; located within the gene ARHGEF33 or the upstream region or the downstream region; located within the gene ARHGEF33 or the upstream region or the downstream region; located within the gene ARHGEF33 or the upstream region or the downstream region; located within the gene TRIM58 or the upstream region or the downstream region; Within the gene PSD4 or in the upstream or downstream region; within the gene HOXD4 or in the upstream region or the downstream region; within the gene SLC12A8 or
  • the present invention provides isolated nucleic acids isolated from a sample of a subject.
  • the isolated nucleic acid was isolated from cell-free DNA in the plasma of liver cancer patients.
  • the isolated nucleic acid is one or more specific methylation markers, preferably liver cancer tissue-specific methylation markers.
  • Methylation markers are the following regions or sites in this region, which are the following genes and the 3kb upstream region and 3kb downstream region of the gene in the chromosome where it is located: TAL1 gene; TRIM58 gene; LBH gene; ABCG5 Gene; PAX8 gene; DLEC1 gene; AMIGO3 gene; RASSF1 gene; CLDN11 gene; SLC2A9 gene; SLC9A3 gene; CXXC5 gene; FOXC1 gene; HIST1H4F gene; TRIM40 gene; HOXA13 gene; CRHR2 gene; AGPAT6 gene; TCF24 gene; OPLAH gene; GPAM gene; ADAM8 gene; GRASP gene; B4GALNT1 gene; STX2 gene; ATL1 gene; ITPKA gene; PIF1 gene; ZFHX3 gene; C1QL1 gene; SEPT-9 gene; KCTD1 gene; PIP5K1C gene; RASAL3 gene; CYP2F1 gene
  • This site is a methylated site.
  • genes in the genome may have mutations, so it is conceivable that variants of these genes can also be used as methylation markers, as long as the methylation sites in the variants are not mutated.
  • Variants may comprise sequences that are at least 70% identical to the sequence of either gene.
  • the site selected as a marker can contain 1 or more CpGs, such as 2 CpGs, 3 CpGs, 4 CpGs, 5 CpGs, 6 CpGs, 10 CpGs, 20 CpG or 30 CpG. Suitable sites may be 100bp-550bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • a gene has the same or similar methylation level or status as its upstream and downstream regions. Therefore, when the present invention discovers a methylation site within a specific gene, it can be assumed that the gene and the 3kb upstream region and 3kb downstream region in the original chromosome also have the same or similar methylation level or status.
  • the invention covers the gene described in the invention and the 2.9kb, 2.8kb, 2.7kb, 2.6kb, 2.5kb, 2.4kb, 2.3kb, 2.2kb, 2.1kb, 2kb, 1.9 of the gene in the chromosome where it is located.
  • the present invention uses the following nucleotide sequences as methylation markers.
  • the coordinates of the chromosome position are determined with reference to the human genome sequence hg19. Based on the screened liver cancer tissue-specific methylation markers and the genes where they are located, those skilled in the art will understand that sites within the following items can be used as methylation markers: within the TAL1 gene and upstream thereof region or downstream region; within the TRIM58 gene and its upstream region or downstream region; within the LBH gene and its upstream region or downstream region; within the ABCG5 gene and its upstream region or downstream region; within the PAX8 gene and its upstream region or downstream region; DLEC1 Within the gene and its upstream region or downstream region; Within the AMIGO3 gene and its upstream region or downstream region; Within the RASSF1 gene and its upstream region or downstream region; Within the CLDN11 gene and its upstream region or downstream region; Within the SLC2A9 gene and its upstream region or downstream region; SLC9A3 gene Within the CXXC5 gene and its upstream region or downstream region; Within the CXXC5 gene
  • Either a single methylation marker or a combination of multiple methylation markers can be used as liver cancer-specific methylation markers.
  • the methylation marker is within 3 kb or 2 kb upstream and 3 kb or 2 kb downstream of any of the above genes.
  • the present invention provides isolated nucleic acids isolated from a sample of a subject.
  • isolated nucleic acids are isolated from cell-free DNA in the plasma of breast cancer patients.
  • the isolated nucleic acid is one or more specific methylation markers, preferably breast cancer tissue-specific methylation markers.
  • the methylation marker is the following region or the site of the region, which is the following gene and the 2kb upstream region and 2kb downstream region of the gene in the chromosome where it is located: gene BARHL2; gene ALX3; gene TBX15; gene C2CD4D; gene RYR2; gene LBH; SIX3; gene SIX2; gene OTX1; gene EMX1; gene LBX2; gene BCL2L11; gene PAX8; gene HOXD1; gene SATB2; gene VILL; gene CLDN11; gene EPHB3; gene NKX3-2; gene KCTD8 ; Gene PITX1; Gene CXXC5; Gene FOXC1; Gene NRN1; Gene HOXA9; Gene DLX6; Gene MOS; Gene TCF24; Gene CA3; Gene GDF6; Gene FOXD4; Gene PTF1A; Gene TLX1; Gene INA; Gene NKX6-2; Gene PAX6 ; Gene BCAT1
  • This site is a methylated site.
  • genes in the genome may have mutations, so it is conceivable that variants of these genes can also be used as methylation markers, as long as the methylation sites in the variants are not mutated.
  • Variants may comprise sequences that are at least 70% identical to the sequence of either gene.
  • the site selected as a marker may contain 1 or more CpGs, such as 2 CpGs, 3 CpGs, 4 CpGs, 5 CpGs, 6 CpGs, 10 CpGs, 20 CpGs, or 30 CpGs. Suitable sites may be 150bp-500bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • a gene has the same or similar methylation level or status as its upstream and downstream regions. Therefore, when the present invention discovers a methylation site within a specific gene, it can be assumed that the gene and the 2kb upstream region and 2kb downstream region in the original chromosome also have the same or similar methylation level or status.
  • the invention covers the gene described in the invention and the 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp of the gene in the chromosome where it is located.
  • the present invention uses the following nucleotide sequences as methylation markers.
  • the coordinates of the chromosome position are determined with reference to the human genome sequence hg19. Based on the screened breast cancer tissue-specific methylation markers and the genes where they are located, those skilled in the art will understand that sites within the following items can be used as methylation markers: the gene BARHL2 and its upstream region or downstream region; gene ALX3 and its upstream region or downstream region; gene TBX15 and its upstream region or downstream region; gene C2CD4D and its upstream region or downstream region; gene RYR2 and its upstream region or downstream region; gene LBH and its upstream region region or downstream region; SIX3 and its upstream region or downstream region; gene SIX2 and its upstream region or downstream region; gene OTX1 and its upstream region or downstream region; gene EMX1 and its upstream region or downstream region; gene LBX2 and its upstream region or downstream region; gene BCL2L11 and its upstream region or downstream region; gene PAX8 and its upstream region or downstream region; gene HOXD1 and
  • the present invention provides isolated nucleic acids isolated from a sample of a subject.
  • the isolated nucleic acid is isolated from cell-free DNA in the plasma of gastric cancer and/or esophageal cancer patients.
  • the isolated nucleic acid is one or more specific methylation markers, preferably gastric cancer and/or esophageal cancer tissue-specific methylation markers.
  • the methylation marker is the following region or the site of the region, which is the following gene and the 2kb upstream region and 2kb downstream region of the gene in the chromosome where it is located: gene TAL1; gene VAV3; gene PMF1; gene ATP2B4; gene SH3YL1; gene SLC9A3; gene CXXC5; gene PCDHGA11; gene FOXF2; gene ZNF273; gene KLRG2; gene CRB2; gene SEC16A; gene GPAM; gene ASCL2; gene PAX6; gene PTGDR2; gene PLEKHB1; gene TBX5; gene STX2; Gene FBRSL1; gene ATP11A; gene BTBD6; gene CRIP2; gene ONECUT1; gene ZNF764; gene IGHV3OR16-17; gene SALL1; gene ACTG1; gene GATA6; gene KCTD1; gene CYP2F1; gene TPTE; gene CLDN5.
  • This site is a methylated site.
  • genes in the genome may have mutations, so it is conceivable that variants of these genes can also be used as methylation markers, as long as the methylation sites in the variants are not mutated.
  • Variants may comprise sequences that are at least 70% identical to the sequence of either gene.
  • the site selected as a marker may contain 1 or more CpGs, such as 2 CpGs, 3 CpGs, 4 CpGs, 5 CpGs, 6 CpGs, 10 CpGs, 20 CpGs, or 30 CpGs. Suitable sites may be 150bp-500bp in length.
  • the length of the site can be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.
  • genes and their upstream and downstream regions have the same or similar alpha The level or state of basification. Therefore, when the present invention discovers a methylation site within a specific gene, it can be assumed that the gene and the 2kb upstream region and 2kb downstream region in the original chromosome also have the same or similar methylation level or status.
  • the invention covers the gene described in the invention and the 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp in the chromosome where it is located.
  • the present invention uses the following nucleotide sequences as methylation markers.
  • the coordinates of the chromosome position are determined with reference to the human genome sequence hg19. Based on the selected gastric cancer and/or esophageal cancer tissue-specific methylation markers and the genes where they are located, those skilled in the art will understand that sites within the following items can be used as methylation markers: genes Within the TAL1 region or the upstream region and the downstream region; within the gene VAV3 region or the upstream region and the downstream region; within the gene PMF1 region or the upstream region and the downstream region; within the gene ATP2B4 region or the upstream region and the downstream region; within the gene SH3YL1 region or the upstream region and the downstream region; the gene SLC9A3 region or the upstream region and the downstream region; the gene CXXC5 region or the upstream region and the downstream region; the gene PCDHGA11 region or the upstream region and the downstream region; the gene FOXF2 region or the upstream region and the downstream region; the gene ZNF273 Within the region or the upstream region and the downstream
  • a single methylation marker or a combination of multiple methylation markers can be used as a methylation marker specific for gastric cancer and/or esophageal cancer.
  • the methylation marker is within 2 kb upstream and 2 kb downstream of any of the genes described above.
  • the present invention provides isolated nucleic acids isolated from a sample of a subject.
  • isolated nucleic acids are isolated from cell-free DNA in the plasma of pancreatic cancer patients.
  • the isolated nucleic acid is one or more specific methylation markers, preferably pancreatic cancer tissue-specific methylation markers.
  • the methylation marker is the following region or the site of the region, which is the following gene and the 2.5kb upstream region and 2.5kb downstream region of the gene in the chromosome where it is located: gene TNFRSF14; gene PGM1; gene CELF3 ; Gene ATP2B4; Gene SF3B6; Gene CNNM4; Gene SP9; Gene C2orf82; Gene NEU4; Gene RPL35A; Gene HGFAC; Gene EXOC3; Gene GDNF; Gene NEUROG1; Gene HIST1H2BA; Gene OSTM1; Gene CCR6; Gene CCAR2; Gene TNFRSF10D; Gene TJP2; gene DAB2IP; gene NTMT1; gene MKI67; gene PTGDR2; gene CCDC77; gene MYL2; gene FRY; gene SMEK1; gene BTBD6; gene PIF1; gene SRL; gene SPNS1; gene DNM2; gene ZNF569; gene SDF2L1.
  • This site is a methylated site.
  • genes in the genome may have mutations, so it is conceivable that variants of these genes can also be used as methylation markers, as long as the methylation sites in the variants are not mutated.
  • Variants may comprise sequences that are at least 70% identical to the sequence of either gene.
  • the site selected as a marker may contain 1 or more CpGs, such as 2 CpGs, 3 CpGs, 4 CpGs, 5 CpGs, 6 CpGs, 10 CpGs, 20 CpGs, or 30 CpGs. Suitable sites may be 130bp-530bp in length.
  • the length of the site can be 140bp, 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp, 500bp, 510bp or 520bp.
  • a gene has the same or similar methylation level or status as its upstream and downstream regions. Therefore, when the present invention discovers a methylation site within a specific gene, it can be assumed that the gene and the 2.5kb upstream region and 2.5kb downstream region in the original chromosome also have the same or similar methylation level or status.
  • the present invention covers the gene described in the invention and the infection in which the gene is located.
  • the present invention uses the following nucleotide sequences as methylation markers.
  • the coordinates of the chromosome position are determined with reference to the human genome sequence hg19. Based on the screened pancreatic cancer tissue-specific methylation markers and the genes where they are located, those skilled in the art should understand Solution, the sites in the following items can be used as methylation markers: gene TNFRSF14 and its upstream or downstream region; gene PGM1 and its upstream or downstream region; gene CELF3 and its upstream or downstream region; gene ATP2B4 and its upstream region or downstream region; gene SF3B6 and its upstream region or downstream region; gene CNNM4 and its upstream region or downstream region; gene SP9 and its upstream region or downstream region; gene C2orf82 and its upstream region or downstream region; gene NEU4 and its upstream region or downstream region; gene RPL35A and its upstream region or downstream region; gene HGFAC and its upstream region or downstream region; gene EXOC3 and its upstream region or downstream region; gene GDNF and its upstream region or downstream region; gene NEUROG1 and its
  • Either a single methylation marker or a combination of multiple methylation markers can be used as pancreatic cancer-specific methylation markers.
  • the methylation marker is within 2 kb upstream and 2 kb downstream of any of the genes described above.
  • CpG island coast a pioneer in the field of epigenetics, once pointed out that most methylation changes in colon cancer occur not only in the promoter, nor only in the CpG island, but in the 2kb sequence upstream of it, which we call as the “CpG island coast” (Andy Fienberg et al., 2009).
  • CpG island shore methylation is closely related to gene expression, is highly conserved in mammals, and can differentiate tissue types. In subsequent studies, researchers not only discovered this phenomenon in intestinal cancer types, but also found that the adjacent regions of these target methylation sites are also important in breast cancer, gastric cancer, bladder cancer, and some tissue types.
  • cancer colorectal, lung, liver, breast, or stomach and/or esophageal cancer, or Pancreatic cancer tissue kit
  • kits or devices for detecting the methylation level or status of these markers for diagnosing colorectal cancer, or distinguishing colorectal cancer from other pancreatic cancers. Cancer species.
  • a kit or device may contain reagents or components for detecting the status and/or levels of one or more colorectal cancer tissue-specific methylation markers in nucleic acids from a sample.
  • those skilled in the art can prepare kits or devices for detecting the methylation level or status of these markers for diagnosing lung cancer, or distinguishing lung cancer from other pan-cancer species.
  • a kit or device may contain reagents or components for detecting the status and/or level of one or more lung cancer tissue-specific methylation markers in nucleic acids from a sample.
  • methylation markers of the present invention those skilled in the art can prepare kits or devices for detecting the methylation level or status of these markers for diagnosing liver cancer, or distinguishing liver cancer from other pan-cancer species.
  • a kit or device may contain reagents or components for detecting the status and/or level of one or more liver cancer tissue-specific methylation markers in nucleic acids from a sample.
  • kits or devices for detecting the methylation level or status of these markers for diagnosing breast cancer, or distinguishing breast cancer from other pan-cancer species.
  • a kit or device may contain reagents or components for detecting the status and/or level of one or more breast cancer tissue-specific methylation markers in nucleic acids from a sample.
  • those skilled in the art can prepare kits or devices for detecting the methylation levels or states of these markers for diagnosing gastric cancer and/or esophageal cancer, or distinguishing gastric cancer and/or Or esophageal cancer and other pan-cancer types.
  • a kit or device may contain reagents or components for detecting the status and/or level of one or more gastric cancer and/or esophageal cancer tissue-specific methylation markers in nucleic acids from a sample.
  • methylation markers of the present invention those skilled in the art can prepare kits or devices for detecting the methylation level or status of these markers for diagnosing pancreatic cancer, or distinguishing pancreatic cancer from other pan-cancer species.
  • a kit or device may contain reagents or components for detecting the status and/or level of one or more pancreatic cancer tissue-specific methylation markers in nucleic acids from a sample.
  • reagents or components may include those used in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme assays, fluorometric quantitation, Methylation-sensitive high-resolution melting curve methods and chip-based methylation profiling and mass spectrometry.
  • Reagents may include oligonucleotides for detecting methylation markers.
  • oligonucleotides are primers and/or probes.
  • the primer is a primer for detecting the methylation level/status of a site using a methylation sequencing method or a PCR primer for amplifying one or more methylation sites.
  • the reagents include bisulfite and its derivatives, PCR buffer, polymerase, dNTPs, primers, probes, Methylation-sensitive or insensitive restriction enzymes, digestion buffers, fluorescent dyes, fluorescent quenchers, fluorescent reporters, exonucleases, alkaline phosphatase, internal standards and/or controls.
  • the control may be the aforementioned specific methylation marker from a normal subject or a cancer patient other than colorectal cancer.
  • the non-colorectal cancer is lung cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • the control may be the aforementioned specific methylation markers from normal subjects or non-lung cancer patients.
  • the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • the control may be the aforementioned specific methylation marker from a normal subject or a cancer patient other than liver cancer.
  • the cancer other than liver cancer is colorectal cancer, lung cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • the control may be the aforementioned specific methylation marker from a normal subject or a cancer patient other than breast cancer.
  • the non-breast cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or lung cancer.
  • the control may be the aforementioned specific methylation marker from a normal subject or a patient with cancer other than gastric cancer and esophageal cancer.
  • cancers other than gastric cancer and esophageal cancer or pan-cancer include lung cancer, liver cancer, colorectal cancer, pancreatic cancer and/or breast cancer.
  • the control may be the aforementioned specific methylation marker from a normal subject or a cancer patient other than pancreatic cancer.
  • the non-pancreatic cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, breast cancer and/or lung cancer.
  • the present invention provides a method for diagnosing colorectal cancer in a subject, which includes: (1) measuring the alpha of one or more colorectal cancer tissue-specific methylation markers of the present invention in a sample of the subject; methylation status or level; and (2) determining colorectal cancer based on the measured methylation status or level.
  • the subject is a cancer patient or a subject at risk for cancer.
  • the non-colorectal cancer is lung, liver, gastric, esophageal, pancreatic, and/or breast cancer.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the method for obtaining the methylation level data may be any suitable method for determining the methylation level of a nucleic acid sequence, such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • a nucleic acid sequence such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • the present invention also provides a method for diagnosing colorectal cancer, comprising: (1) detecting the methylation level of the sequence described herein in a sample of a subject; (2) comparing it with a control sample, or calculating Score; (3) Identify the colorectal cancer of the subject based on the score.
  • the method also includes before step (1) Including: extraction of sample DNA and conversion of unmethylated cytosine on DNA into bases that do not bind to guanine.
  • the methylation level of the subject's sample is increased or decreased when compared to a control sample.
  • colorectal cancer is identified.
  • a score is obtained by mathematically analyzing the methylation levels of the tested genes. For the tested sample, when the score is greater than the threshold, the result is colorectal cancer, otherwise it is negative, that is, cancer other than colorectal cancer.
  • the invention also provides methods comprising: (1) obtaining methylation levels of methylation markers described herein in genomic DNA of colorectal cancer samples and non-colorectal cancer samples; and (2) using A logistic regression machine learning model was constructed using data on methylation levels of methylation markers.
  • the sample can be cells, tissue, fine needle aspiration biopsy, or plasma.
  • Genomic DNA can be cell-free DNA in plasma.
  • Step (1) can include MethylTitan's method to obtain methylation sequencing data of sample DNA
  • the method can be used to (1) distinguish colorectal cancer patients from non-colorectal cancer patients, (2) be used to diagnose or assist in the diagnosis of colorectal cancer; or (3) be used to detect colorectal cancer during pan-cancer screening. organizational traceability.
  • the present invention provides a method for diagnosing lung cancer in a subject, which includes: (1) determining the methylation status of one or more lung cancer tissue-specific methylation markers of the present invention in a sample of the subject; or levels; and (2) determining lung cancer based on measured lung cancer tissue-specific methylation status or levels.
  • the subject is a cancer patient or a subject at risk for cancer.
  • the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the method for obtaining the methylation level data may be any suitable method for determining the methylation level of a nucleic acid sequence, such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • a nucleic acid sequence such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • the present invention also provides a method for diagnosing lung cancer, including: (1) detecting the methylation level of the sequence described herein in a subject's sample; (2) comparing it with a control sample, or calculating a score; (3) Identify the subject's lung cancer based on the score.
  • the method further includes: extracting sample DNA and converting unmethylated cytosine on the DNA into bases that are not combined with guanine.
  • the methylation level of the subject's sample is increased or decreased when compared to a control sample.
  • lung cancer is identified.
  • a score is obtained by mathematically analyzing the methylation levels of the tested genes. For the tested sample, when the score is greater than the threshold, the result is lung cancer, otherwise it is negative, that is, cancer other than lung cancer.
  • the invention also provides methods, comprising: (1) obtaining methylation levels of methylation markers described herein in genomic DNA of lung cancer samples and non-lung cancer samples; and (2) using the methylation markers A logistic regression machine learning model was constructed using data on methylation levels.
  • the sample can be cells, tissue, fine needle aspiration biopsy, or plasma.
  • Genomic DNA can be cell-free DNA in plasma.
  • Step (1) can include MethylTitan's method to obtain methylation sequencing data of sample DNA
  • the method can be used to (1) differentiate lung cancer patients With non-lung cancer patients, (2) used to diagnose or assist in the diagnosis of lung cancer; or (3) used to trace the tissue origin of lung cancer during pan-cancer screening.
  • the invention provides a method for diagnosing liver cancer in a subject, comprising: (1) determining the methylation status or level of one or more methylation markers of the invention in a sample of the subject; and ( 2) Determine liver cancer based on measured methylation status or levels.
  • the subject is a cancer patient or a subject at risk for cancer.
  • the non-liver cancer is colorectal cancer, lung cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the method for obtaining the methylation level data may be any suitable method for determining the methylation level of a nucleic acid sequence, such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • a nucleic acid sequence such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • the present invention also provides a method for diagnosing liver cancer, including: (1) detecting the methylation level of the sequence described herein in a subject's sample; (2) comparing it with a control sample, or calculating a score; (3) Identify the liver cancer of the subject based on the score.
  • the method further includes: extracting sample DNA and converting unmethylated cytosine on the DNA into bases that are not combined with guanine.
  • the methylation level of the subject's sample is increased or decreased when compared to a control sample.
  • liver cancer is identified.
  • a score is obtained by mathematically analyzing the methylation levels of the tested genes. For the tested samples, when the score is greater than the threshold, the result is liver cancer, otherwise it is negative, that is, cancer other than liver cancer.
  • the invention also provides methods, comprising: (1) obtaining methylation levels of methylation markers described herein in genomic DNA of liver cancer samples and non-liver cancer cancer samples; and (2) using the methylation markers
  • a logistic regression machine learning model was constructed using data on methylation levels.
  • the sample can be cells, tissue, fine needle aspiration biopsy, or plasma.
  • Genomic DNA can be cell-free DNA in plasma.
  • Step (1) may include obtaining methylation sequencing data of the sample DNA (e.g. using MethylTitan's method), and step (2) may include using a logistic regression model (e.g.
  • the method can be used to (1) distinguish liver cancer patients from non-liver cancer cancer patients, (2) be used to diagnose or assist in the diagnosis of liver cancer; or (3) be used to trace the tissue origin of liver cancer during pan-cancer screening.
  • the invention provides a method for diagnosing breast cancer in a subject, comprising: (1) determining the methylation status or level of one or more methylation markers of the invention in a sample of the subject; and (2) Determine breast cancer based on measured methylation status or levels.
  • the subject is a cancer patient or a subject at risk for cancer.
  • the non-breast cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or lung cancer.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the method for obtaining the methylation level data may be any suitable method for determining the methylation level of a nucleic acid sequence, such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • a nucleic acid sequence such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • the present invention also provides a method for diagnosing breast cancer, comprising: (1) detecting the methylation level of the sequence described herein in a sample of a subject; (2) comparing it with a control sample, or calculating a score ; (3) Identify the subject's breast cancer based on the score.
  • the method further includes: extracting sample DNA and converting unmethylated cytosine on the DNA into bases that are not combined with guanine.
  • the methylation level of the subject's sample is increased or decreased when compared to a control sample.
  • methylation levels meet a certain threshold, breast cancer is identified.
  • a score is obtained by mathematically analyzing the methylation levels of the tested genes. For the tested sample, when the score is greater than the threshold, the result is judged to be breast cancer, otherwise it is negative, that is, cancer other than breast cancer.
  • the invention also provides methods, comprising: (1) obtaining methylation levels of methylation markers described herein in genomic DNA of breast cancer samples and non-breast cancer cancer samples; and (2) using methyl A logistic regression machine learning model was constructed using data on methylation levels of chemical markers.
  • the sample can be cells, tissue, fine needle aspiration biopsy, or plasma.
  • Genomic DNA can be cell-free DNA in plasma.
  • Step (1) can include MethylTitan's method to obtain methylation sequencing data of sample DNA
  • the method can be used to (1) distinguish breast cancer patients from non-breast cancer patients, (2) be used to diagnose or assist in the diagnosis of breast cancer; or (3) be used to trace the tissue origin of breast cancer during pan-cancer screening.
  • the present invention provides a method for diagnosing gastric cancer and/or esophageal cancer in a subject, which includes: (1) determining the methylation status of one or more methylation markers of the present invention in a sample of the subject or level; and (2) determining gastric cancer and/or esophageal cancer based on the measured methylation status or level.
  • the subject is a cancer patient or a subject at risk for cancer.
  • cancers other than gastric and esophageal cancer or pan-cancer include lung cancer, liver cancer, colorectal cancer, pancreatic cancer, and/or breast cancer.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the method for obtaining the methylation level data may be any suitable method for determining the methylation level of a nucleic acid sequence, such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • a nucleic acid sequence such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • the present invention also provides a method for diagnosing gastric cancer and/or esophageal cancer, including: (1) detecting The methylation level of the sequence described herein in the subject's sample; (2) compared with the control sample, or calculated to obtain a score; (3) identifying the subject's gastric cancer and/or esophageal cancer based on the score.
  • the method further includes: extracting sample DNA and converting unmethylated cytosine on the DNA into bases that are not combined with guanine.
  • the methylation level of the subject's sample is increased or decreased when compared to a control sample. When the methylation level meets a certain threshold, it is identified as gastric cancer and/or esophageal cancer.
  • a score is obtained by mathematically analyzing the methylation levels of the tested genes. For the tested sample, when the score is greater than the threshold, the result is judged to be gastric cancer and/or esophageal cancer, otherwise it is negative, that is, cancer other than gastric cancer and esophageal cancer.
  • Methods of conventional mathematical analysis and procedures for determining thresholds are known in the art.
  • the present invention also provides methods, comprising: (1) obtaining methylation levels of methylation markers described herein in genomic DNA of gastric cancer and/or esophageal cancer samples and cancer samples other than gastric cancer and esophageal cancer; and (2) construct a machine learning model of logistic regression using data on methylation levels of methylation markers.
  • the sample can be cells, tissue, fine needle aspiration biopsy, or plasma.
  • Genomic DNA can be cell-free DNA in plasma.
  • Prediction score use the prediction score and judge whether the sample is gastric cancer and/or esophageal cancer according to the threshold value. If the sample is greater than the threshold value, it is predicted to be gastric cancer and/or esophageal cancer. Otherwise, it is predicted to be other cancer types.
  • the method can be used (1) to distinguish patients with gastric cancer and/or esophageal cancer from patients with cancers other than gastric cancer and esophageal cancer, (2) to diagnose or assist in the diagnosis of gastric cancer and/or esophageal cancer; or (3) to be used for pan-cancer Tissue tracing of gastric and/or esophageal cancer during screening.
  • the invention provides a method for diagnosing pancreatic cancer in a subject, comprising: (1) determining the methylation status or level of one or more methylation markers of the invention in a sample of the subject; and (2) Determine pancreatic cancer based on measured methylation status or levels.
  • the subject is a cancer patient or a subject at risk for cancer.
  • the non-pancreatic cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, breast cancer and/or lung cancer.
  • the sample is cells, tissue, fine needle aspiration biopsy, or plasma.
  • the method for obtaining the methylation level data may be any suitable method for determining the methylation level of a nucleic acid sequence, such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • a nucleic acid sequence such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction Endonuclease analysis method, fluorescence quantification method, methylation-sensitive high-resolution melting curve method and chip-based methylation profile analysis and mass spectrometry.
  • the present invention also provides a method for diagnosing pancreatic cancer, comprising: (1) detecting the methylation level of the sequence described herein in a sample of a subject; (2) comparing it with a control sample, or calculating a score ; (3) Identify the subject's pancreatic cancer based on the score.
  • the method further includes: extracting sample DNA and converting unmethylated cytosine on the DNA into bases that are not combined with guanine.
  • the methylation level of the subject's sample is increased or decreased when compared to a control sample.
  • pancreatic cancer is identified.
  • a score is obtained by mathematically analyzing the methylation levels of the tested genes. For the tested sample, when the score is greater than the threshold, the result is determined to be pancreatic cancer, otherwise it is negative, that is, a cancer other than pancreatic cancer.
  • the invention also provides methods comprising: (1) obtaining methylation levels of methylation markers described herein in genomic DNA of pancreatic cancer samples and non-pancreatic cancer samples; and (2) using methyl A logistic regression machine learning model was constructed using data on methylation levels of chemical markers.
  • the sample can be cells, tissue, fine needle aspiration biopsy, or plasma.
  • Genomic DNA can be cell-free DNA in plasma.
  • Step (1) may include obtaining methylation sequencing data of the sample DNA (e.g. using MethylTitan's method), and step (2) may include using a logistic regression model (e.g.
  • AllModel LogisticRegression(), the formula of the model is as follows, where x is the methylation level value of the sample target marker, w is the coefficient of the methylation marker, and b is the intercept value, y is the model prediction score
  • the method can be used to (1) distinguish pancreatic cancer patients from non-pancreatic cancer patients, (2) be used to diagnose or assist in the diagnosis of pancreatic cancer; or (3) be used to trace the tissue origin of pancreatic cancer during pan-cancer screening.
  • the invention also provides a system or device.
  • a system or device may include computer-readable storage media or memory for storing programs or instructions.
  • Programs or instructions may be used to execute a predictive model for distinguishing colorectal cancer from other non-colorectal cancers constructed from one or more colorectal cancer tissue-specific methylation markers of the invention, or to execute methods of the invention .
  • the program or instructions may be used to perform the prediction model of the present invention for distinguishing lung cancer from other non-lung cancer, or to perform the method of the present invention.
  • the program or instructions may be used to perform the prediction model of the present invention for distinguishing liver cancer from other non-liver cancer, or to perform the method of the present invention.
  • Programs or instructions are used to perform the prediction model of the present invention for distinguishing breast cancer from other non-breast cancer, or to perform the method of the present invention.
  • Programs or instructions are used to perform the prediction model of the present invention for distinguishing pancreatic cancer from other cancers other than pancreatic cancer, or to perform the methods of the present invention.
  • Computer-readable storage media or memory includes, but is not limited to, tangible storage media, carrier wave media, or physical transmission media.
  • Non-volatile storage media includes, for example, optical or magnetic disks, such as any storage device in any computer or the like, and volatile storage media includes dynamic memory, such as the main memory of such computer platforms.
  • Tangible transmission media include coaxial cable; copper wire and fiber optics, including the wires that make up the buses within computer systems.
  • Carrier transmission media may take the form of electrical or electromagnetic signals, or acoustic or light waves, such as those generated during radio frequency and infrared data communications.
  • common forms of computer readable media include, for example: floppy disk, floppy disk, hard drive, magnetic tape, any other magnetic media, CD-ROM, DVD or DVD-ROM, any other optical media, punched cardstock tape, tape with hole pattern any other physical storage media, RAM, ROM, PROM and EPROM, FLASH-EPROM, any other memory chip or cartridge, transmit data or A carrier wave of instructions, a cable or link transmitting such a carrier wave, or any other medium from which a computer can read programming code and/or data.
  • Many of these forms of computer-readable media may be involved in conveying one or more sequences of one or more instructions to a processor for execution.
  • the memory and processor may be physically separate.
  • Wireless connections can use wireless LAN (WLAN) or the Internet. Wired connections are available through optical and non-optical cable connections between units. Cables used for wired connections are further suitable for high-throughput data transmission.
  • WLAN wireless LAN
  • Wired connections are available through optical and non-optical cable connections between units. Cables used for wired connections are further suitable for high-throughput data transmission.
  • the present invention also provides the use of isolated nucleic acids or reagents or components in the preparation of kits or devices for (1) distinguishing colorectal cancer patients from non-colorectal cancer cancer patients; (2) Used to diagnose or assist in the diagnosis of colorectal cancer; or (3) used to trace the tissue origin of colorectal cancer during pan-cancer screening.
  • the non-colorectal cancer is lung cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • a kit or device may contain reagents for determining methylation levels in a variety of available methods.
  • the present invention also provides the use of isolated nucleic acids or reagents or components in the preparation of kits or devices for (1) distinguishing lung cancer patients from non-lung cancer patients; (2) for diagnosis or To assist in the diagnosis of lung cancer; or (3) to be used to trace the tissue origin of lung cancer during pan-cancer screening.
  • the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • a kit or device may contain reagents for determining methylation levels in a variety of available methods.
  • the present invention also provides the use of isolated nucleic acids or reagents or components in the preparation of kits or devices for (1) distinguishing liver cancer patients from non-liver cancer cancer patients; (2) for diagnosis or Assist in the diagnosis of liver cancer; or (3) be used to trace the tissue origin of liver cancer during pan-cancer screening.
  • the cancer other than liver cancer is colorectal cancer, lung cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.
  • a kit or device may contain reagents for determining methylation levels in a variety of available methods.
  • the present invention also provides the use of isolated nucleic acids or reagents or components in the preparation of kits or devices for (1) distinguishing breast cancer patients from non-breast cancer patients; (2) for Diagnosis or auxiliary diagnosis of breast cancer; or (3) used for tissue traceability of breast cancer during pan-cancer screening.
  • the non-breast cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and /or lung cancer.
  • a kit or device may contain reagents for determining methylation levels in a variety of available methods.
  • the present invention also provides the use of isolated nucleic acids or reagents or components in the preparation of kits or devices for (1) distinguishing gastric cancer and/or esophageal cancer patients from cancers other than gastric cancer and esophageal cancer patients; (2) used to diagnose or assist in the diagnosis of gastric cancer and/or esophageal cancer; or (3) used to trace the tissue origin of gastric cancer and/or esophageal cancer during pan-cancer screening.
  • cancers other than gastric cancer and esophageal cancer or pan-cancer include lung cancer, liver cancer, colorectal cancer, pancreatic cancer and/or breast cancer.
  • a kit or device may contain reagents for determining methylation levels in a variety of available methods.
  • the present invention also provides the use of isolated nucleic acids or reagents or components in the preparation of kits or devices for (1) distinguishing pancreatic cancer patients from non-pancreatic cancer cancer patients; (2) for Diagnosis or auxiliary diagnosis of pancreatic cancer; or (3) used for tissue traceability of pancreatic cancer during pan-cancer screening.
  • the non-pancreatic cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, breast cancer and/or lung cancer.
  • a kit or device may contain reagents for determining methylation levels in a variety of available methods.
  • Example 1.1 Methylation-targeted sequencing to screen colorectal cancer-specific methylation sites
  • the MethylTitan TM method independently developed by the applicant is used to obtain the methylation sequencing data of plasma cfDNA of the target sample and identify the DNA methylation classification markers.
  • the process is as follows:
  • Pear (v0.6.0) software merges the paired-end sequencing data of the same fragment from the paired-end 150bp sequencing of the Illumina Hiseq X10/Nextseq 500/Novaseq sequencer into one sequence.
  • the minimum overlap length is 20bp, and the minimum length after merging is 30bp.
  • the reference genome data used in this article comes from the UCSC database (UCSC: HG19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
  • the nucleotide numbering of sites herein corresponds to the nucleotide position numbering of HG19.
  • a target methylation region may have multiple methylation haplotypes. This value needs to be calculated for each methylation haplotype in the target region.
  • An example of the MHF calculation formula is as follows:
  • i represents the target methylation interval
  • h represents the target methylation haplotype
  • N i represents the number of reads located in the target methylation interval
  • N i,h represents the target methylation haplotype. The number of reads.
  • the screened colorectal cancer tissue-specific methylation markers are detailed in Table 1.2.
  • methylation levels of these colorectal cancer tissue-specific methylation markers in colorectal cancer and other six cancer types are as follows in Table 1.2 and Figure 1.
  • Figure 2 shows that these colorectal cancer tissue-specific methylation markers have significant differences between colorectal cancer and other cancer types in both the training set and the test set (u test p value is less than 0.05), and methylation There are also big differences in levels.
  • Table 1.2 Average methylation levels of methylation markers in colorectal cancer and other six cancer types in the training set and test set
  • Example 1.2 Discriminative performance of individual colorectal cancer tissue-specific methylation markers
  • a logistic regression model was constructed using the data on the methylation level of a single colorectal cancer tissue-specific methylation marker in the training set divided in Example 1.1, After determining the threshold, prediction is made on the test set. Specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation level value of the sample target marker
  • w is the coefficient of different markers
  • b is the intercept value
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the data of the target methylation site in the test set sample, TestPred is the model prediction score, use this prediction score and judge whether the sample is colorectal cancer based on the above threshold.
  • each colorectal cancer tissue-specific methylation marker can reach a score of more than 0.70.
  • the AUC and accuracy reached more than 77%.
  • the lowest AUC of a single colorectal cancer tissue-specific methylation marker in the test set also reached more than 0.70, and the accuracy reached more than 70%. It can be seen that these colorectal cancer tissues are specific. Sexual methylation markers are good colorectal cancer tissue-specific markers and can better distinguish colorectal cancer from other cancer types.
  • Example 1.3 Machine learning model for all target colorectal cancer tissue-specific methylation markers
  • This example uses the methylation levels of all 39 colorectal cancer tissue-specific methylation markers to construct a logistic regression machine learning model to accurately distinguish colorectal cancer samples from multiple cancer types. Use the samples of the training set in Example 1.1 for model training, and then use the samples of the test set to test the effect of the model.
  • the specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation of the sample target methylation marker Level value
  • w is the coefficient of different methylation markers
  • b is the intercept value (the parameters are obtained by training the logistic regression model)
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data (methylation haplotype frequency), TestPred is the model prediction score, use this prediction score and judge whether the sample is colorectal cancer based on the above threshold.
  • the ROC curve is shown in Figure 6.
  • the AUC for distinguishing colorectal cancer from other cancer types reached 0.902, and the threshold was set to 0.076. A value greater than this value is predicted to be colorectal cancer, otherwise it is predicted to be other cancer types, and the specificity is 85 %, the sensitivity reached 66.7%, and the overall prediction accuracy of the sample reached 84.5%. Better discrimination distinguished colorectal cancer from 7 cancer samples.
  • Example 1.4 Colorectal cancer tissue-specific marker combination 1 machine learning model
  • this example selected Seq ID NO:52, Seq ID NO:59 from all 39 colorectal cancer tissue-specific methylation markers. ,Seq ID NO:62,Seq ID NO:64,Seq ID NO:73,Seq ID NO:83, a total of 6 colorectal cancer tissue-specific methylation markers were used to construct a new machine learning model.
  • Example 1.3 The method of constructing the machine learning model is consistent with Example 1.3.
  • the relevant samples only use the data of the 6 target colorectal cancer tissue-specific methylation sites.
  • the model scores of the model in the training set and test set are shown in Figure 7.
  • the AUC of the test set of this model reached 0.931, and the threshold was set to 0.055.
  • the specificity is 93.4%, the sensitivity reaches 66.7%, and the overall accuracy reaches 92.5%, indicating that colorectal cancer Good performance of tissue-specific marker panel construction models.
  • Example 1.5 Colorectal cancer tissue-specific marker combination 2 machine learning model
  • This example selects another combination of colorectal cancer tissue-specific methylation markers from 39 colorectal cancer tissue-specific methylation markers: Seq ID NO: 52, Seq ID NO: 54, Seq ID NO:61,Seq ID NO:64,Seq ID NO:66,Seq ID NO:69,Seq ID NO:71,Seq ID NO:74,Seq ID NO:76,Seq ID NO:87, a total of 10 results Construction of machine learning model for rectal cancer tissue-specific methylation markers.
  • the model construction method is also consistent with Example 1.3, and the relevant samples only use the data of the target 10 colorectal cancer tissue-specific methylation sites.
  • the AUC of the test set of this model reaches 0.902, and the threshold is set to At 0.059, when the specificity is 90.6%, the sensitivity reaches 66.7%, and the overall accuracy can reach 89.8%. It can also better distinguish colorectal cancer from other cancer types.
  • This application screened out 39 colorectal cancer-specific methylation markers from the methylation NGS sequencing data of 7 cancer types. According to the methylation levels of these colorectal cancer tissue-specific methylation markers, The machine learning model built on flat data can better distinguish colorectal cancer samples from the data of 7 cancer types, providing an important reference for the tissue traceability of colorectal cancer during the early screening process of pan-cancer types.
  • Example 2.1 Targeted methylation sequencing to screen lung cancer-specific methylation sites
  • the training set is used to construct the following machine learning model, and the test set is used to test the performance of the model.
  • the sample information is shown in Table 2.1 below.
  • the total number of lung cancer samples in the training set is 51, and the total number of lung cancer samples in the test set is 20.
  • the MethylTitan TM method independently developed by the applicant is used to obtain the methylation sequencing data of plasma cfDNA of the target sample and identify the DNA methylation classification markers.
  • the process is as follows:
  • Pear (v0.6.0) software merges the paired-end sequencing data of the same fragment from the paired-end 150bp sequencing of the Illumina Hiseq X10/Nextseq 500/Novaseq sequencer into one sequence.
  • the minimum overlap length is 20bp, and the minimum length after merging is 30bp.
  • the reference genome data used in this article comes from the UCSC database (UCSC: HG19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
  • the nucleotide numbering of sites herein corresponds to the nucleotide position numbering of HG19.
  • a target methylated region may have multiple methylated haplotypes. This value needs to be calculated for each methylated haplotype in the target region.
  • An example of the MHF calculation formula is as follows:
  • i represents the target methylation interval
  • h represents the target methylation haplotype
  • N i represents the number of reads located in the target methylation interval
  • N i, h represents the target methylation haplotype. The number of reads.
  • methylation levels of these methylation markers in lung cancer and six other cancer types are shown in Table 2.2 and Figures 11 and 12 below. These methylation markers have significant differences between lung cancer and other cancer types in the training set and test set (u test, p value is less than 0.05), and the methylation levels also have large differences.
  • Table 2.2 Average methylation levels of methylation markers in lung cancer and other six cancer types in the training set and test set
  • Example 2.2 Discrimination performance of single lung cancer tissue-specific methylation markers
  • the methylation level data of a single lung cancer tissue-specific methylation marker was used for training in the training set data of Example 2.1. model, and use test set samples to verify the performance of the model.
  • the specific steps are as follows:
  • AllModel LogisticRegression().
  • the formula of the model is as follows, where x is the sample target lung cancer tissue-specific methylation marker. The methylation level value of , w is the coefficient of different markers, b is the intercept value, and y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the data of the target methylation site in the test set sample, TestPred is the model prediction score, use this prediction score and judge whether the sample is lung cancer based on the above threshold.
  • the effect of the logistic regression model of a single lung cancer tissue-specific methylation marker is shown in Table 2.3. From this table, it can be seen that all lung cancer tissue-specific methylation markers can be used in both the test set and the training set. Reaching an AUC of more than 0.67 and an accuracy of more than 0.58 are both good lung cancer tissue-specific markers. Among them, outstanding markers such as Seq ID NO:132, Seq ID NO:111, and Seq ID NO:129 are all acceptable. A sensitivity of over 75% was achieved with a specificity of over 80% in the test set, and an overall accuracy of over 80%.
  • Example 2.3 Machine learning model for all target lung cancer tissue-specific methylation markers
  • This embodiment uses the methylation levels of all 48 lung cancer tissue-specific methylation markers to construct a logistic regression machine learning model to accurately distinguish lung cancer samples from multiple cancer types.
  • the specific steps are consistent with Example 2.2, except that the relevant samples bring in the data of all 48 target methylation markers. details as follows:
  • AllModel LogisticRegression().
  • x is the methylation of the sample target methylation marker Level value
  • w is the coefficient of different methylation markers
  • b is the intercept value (the parameters are obtained by training the logistic regression model)
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data (methylation haplotype frequency), TestPred is the model prediction score, use this prediction score and judge whether the sample is lung cancer based on the above threshold.
  • the ROC curve is shown in Figure 16.
  • the AUC for distinguishing lung cancer from other cancer types reached 0.903.
  • the threshold was set to 0.336. If it is greater than this value, it is predicted to be lung cancer. Otherwise, it is predicted to be other cancer types.
  • the specificity is 94.7%, The sensitivity reached 80.0%, and the overall prediction accuracy of the sample reached 85.0%, which can well distinguish lung cancer samples from 7 types of cancer samples.
  • Example 2.4 Lung cancer tissue-specific methylation marker combination 1 machine learning model
  • this example randomly selected a total of 10 lung cancer tissue-specific methylation markers Seq ID from all 48 lung cancer tissue-specific methylation markers. NO:92,Seq ID NO:95,Seq ID NO:99,Seq ID NO:103,Seq ID NO:112,Seq ID NO:76,Seq ID NO:126,Seq ID NO:128,Seq ID NO: 133, Seq ID NO:135 methylation level data to build a new machine learning model.
  • the method of constructing the machine learning model is also consistent with Example 2.2, but the relevant samples only use the data of 10 lung cancer tissue-specific methylation markers in this example.
  • the model scores of the model in the training set and test set are different.
  • the AUC of the test set of this model reached 0.895, and when the threshold was set to 0.226, If the predicted value is greater than the predicted value, it is lung cancer, and if the predicted value is less than the predicted value, it is other cancer types.
  • the specificity is 88.7%, the sensitivity reaches 80.0%, and the overall accuracy reaches 87.7%, which illustrates the good performance of the combined model.
  • Example 2.5 Lung cancer tissue-specific methylation marker combination 2 machine learning model
  • This example uses another lung cancer tissue-specific methylation marker combination: Seq ID NO:112, Seq ID NO:124, Seq ID NO:128, Seq ID NO:130, Seq ID NO:133 for a total of 5 lung cancers Construction of machine learning models for tissue-specific methylation markers.
  • the model construction method is also consistent with Example 2.2, but the relevant samples only use the data of the five markers in this example.
  • the threshold is set to 0.253
  • the specificity in the test set is 95.4%.
  • the sensitivity reaches 75.0%
  • the overall accuracy reaches 93.0%. It can also better distinguish lung cancer from other cancer types.
  • This application screened out 48 lung cancer-specific methylation markers from the methylation NGS sequencing data of 7 cancer types.
  • the machine learning model constructed based on the methylation level data of these methylation markers can be Lung cancer samples can be well distinguished from the data of 7 cancer types.
  • These methylation markers are good lung cancer tissue-specific methylation markers, which provide tissue traceability of lung cancer during the early screening of pan-cancer types. important reference.
  • Example 3.1 Targeted methylation sequencing to screen liver cancer-specific methylation sites
  • the training set is used to construct the following machine learning model, and the test set is used to test the performance of the model.
  • the sample information is shown in Table 3.1 below.
  • the total number of liver cancer samples in the training set is 104, and the total number of liver cancer samples in the test set is 59.
  • the MethylTitan TM method independently developed by the applicant is used to obtain the methylation sequencing data of plasma cfDNA of the target sample and identify the DNA methylation classification markers.
  • the process is as follows:
  • Pear (v0.6.0) software merges the paired-end sequencing data of the same fragment from the paired-end 150bp sequencing of the Illumina Hiseq X10/Nextseq 500/Novaseq sequencer into one sequence.
  • the minimum overlap length is 20bp, and the minimum length after merging is 30bp.
  • the reference genome data used in this article comes from the UCSC database (UCSC: HG19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
  • the nucleotide numbering of sites herein corresponds to the nucleotide position numbering of HG19.
  • a target methylation region may have multiple methylation haplotypes. This value needs to be calculated for each methylation haplotype in the target region.
  • An example of the MHF calculation formula is as follows:
  • i represents the target methylation interval
  • h represents the target methylation haplotype
  • N i represents the number of reads located in the target methylation interval
  • N i, h represents the reads containing the target methylation haplotype. number.
  • liver cancer tissue-specific methylation markers based on training set samples
  • liver cancer tissue The specific methylation markers selected for liver cancer tissue are shown in Table 3.2.
  • methylation levels of these methylation markers in liver cancer and other six cancer types are shown in Table 3.2 and Figure 21, Figure 22: These methylation markers are consistent with other cancer types in the training set and test set. There are significant differences in the ratios (u test p value is less than 0.05), and there are also large differences in methylation levels.
  • Example 3.2 Discriminative performance of a single liver cancer methylation marker
  • the methylation level data of a single liver cancer methylation marker was used to train the model in the training set data of Example 3.1, and the test set was used Samples are used to verify the performance of the model. The specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation level value of the sample target marker
  • w is the coefficient of different markers
  • b is the intercept value
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the data of the target methylation site in the test set sample, TestPred is the model prediction score, use this prediction score and judge whether the sample is liver cancer based on the above threshold.
  • liver cancer methylation marker logistic regression model the effect of the single liver cancer methylation marker logistic regression model is shown in Table 3.3. From this table It can be seen that all liver cancer methylation markers can achieve an AUC of more than 0.76 and an accuracy of more than 0.70 in the test set and training set. They are all better liver cancer tissue-specific markers. Among them, liver cancer has excellent performance. Markers such as Seq ID NO:156, Seq ID NO:145, and Seq ID NO:150 can achieve a sensitivity of more than 83% at a specificity of about 80%, and the overall accuracy reaches about 80%.
  • Example 3.3 Machine learning model for all target liver cancer methylation markers
  • This embodiment uses the methylation levels of all 37 liver cancer methylation markers to construct a logistic regression machine learning model to accurately distinguish liver cancer samples from multiple cancer types.
  • the specific steps are consistent with Example 3.2, except that the relevant data is brought into the data of all 37 target liver cancer methylation markers. Specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation of the sample target methylation marker Level value
  • w is the coefficient of different methylation markers
  • b is the intercept value (the parameters are obtained by training the logistic regression model)
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data (methylation haplotype frequency), TestPred is the model prediction score, use this prediction score and compare the Determine whether the sample is liver cancer.
  • the ROC curve is shown in Figure 26. In the test set, the AUC for distinguishing liver cancer from other cancer types reached 0.906, and the threshold was set to 0.297. If it is greater than this value, it is predicted to be liver cancer, otherwise it is predicted to be other cancer types. When the specificity is 91.5%, The sensitivity reached 76.3%, and the overall prediction accuracy of the sample reached 86.1%, which can well distinguish liver cancer samples from 7 types of cancer samples.
  • Example 3.4 Machine learning model of liver cancer methylation marker combination 1
  • this example randomly selected a total of 9 liver cancer methylation markers Seq ID NO: 18, Seq ID NO: 143, Seq ID NO from all 37 liver cancer methylation markers. :23,Seq ID NO:147,Seq ID NO:150,Seq ID NO:117,Seq ID NO:153,Seq ID NO:156,Seq ID NO:157 methylation level data to construct new machine learning Model.
  • the method of constructing the machine learning model is also consistent with Example 3.2, but the relevant samples only use the data of the 9 liver cancer methylation markers in this example.
  • the model scores of the model in the training set and test set are shown in Figure 27.
  • the AUC of the test set of this model reached 0.955, and when the threshold was set to 0.265, If the value is greater than this value, it is predicted to be liver cancer, and if it is less than this value, it is predicted to be other cancer types.
  • the specificity is 93.4%, the sensitivity reaches 76.3%, and the overall accuracy reaches 87.3%, which illustrates the good performance of the combined model.
  • Example 3.5 Liver cancer methylation marker combination 2 machine learning model
  • This example uses another liver cancer methylation marker combination: Seq ID NO: 138, Seq ID NO: 143, Seq ID NO: 23, Seq ID NO: 145, Seq ID NO: 150, Seq ID NO: 151, Seq ID NO:152, Seq ID NO:125, Seq ID NO:156, Seq ID NO:132.
  • a total of 10 liver cancer methylation markers were used to construct a machine learning model.
  • the model construction method is also consistent with Example 3.2, but the relevant samples only use the data of the 10 liver cancer methylation markers in this example.
  • the threshold is set to 0.279, the sensitivity reaches 74.6% when the specificity is 91.5%, and the overall accuracy can reach 85.5%. Similarly Can better distinguish liver cancer from other cancer types.
  • This application screened out 37 liver cancer-specific methylation markers from the methylation NGS sequencing data of 7 cancer types.
  • the machine learning model built based on the methylation level data of these methylation markers can be Liver cancer samples can be well distinguished from the data of 7 cancer types.
  • These methylation markers are good liver cancer tissue-specific methylation markers, which provide tissue traceability of liver cancer during the early screening process of pan-cancer types. important reference.
  • Example 4.1 Methylation-targeted sequencing to screen breast cancer-specific methylation sites
  • the training set is used to construct the following machine learning model, and the test set is used to test the performance of the model.
  • the sample information is shown in Table 4.1 below.
  • the total number of breast cancer samples in the training set There are 37, and the total number of breast cancer samples in the test set is 17.
  • the MethylTitan method was used to obtain the methylation sequencing data of plasma cfDNA of the target sample and identify the DNA methylation classification markers. The process is as follows:
  • Pear (v0.6.0) software merges the paired-end sequencing data of the same fragment from the paired-end 150bp sequencing of the Illumina Hiseq X10/Nextseq 500/Novaseq sequencer into one sequence.
  • the minimum overlap length is 20bp, and the minimum length after merging is 30bp.
  • the reference genome data used in this article comes from the UCSC database (UCSC: HG19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
  • the nucleotide numbering of sites herein corresponds to the nucleotide position numbering of HG19.
  • a target methylation region may have multiple methylation haplotypes. This value needs to be calculated for each methylation haplotype in the target region.
  • An example of the MHF calculation formula is as follows:
  • i represents the target methylation interval
  • h represents the target methylation haplotype
  • N i represents the number of reads located in the target methylation interval
  • N i, h represents the target methylation haplotype. The number of reads.
  • the specific methylation markers selected for breast cancer tissue are detailed in Table 4.2.
  • the methylation levels of these methylation markers in breast cancer and six other cancer types are shown in Table 4.2 and Figures 31 and 32 below. These methylation markers have significant differences between breast cancer and other cancer types in the training set and test set (u test p value is less than 0.05), and the methylation levels also have large differences.
  • Table 4.2 Average methylation levels of methylation markers in breast cancer and other six cancer types in the training set and test set
  • Example 4.2 Discrimination performance of single methylation markers
  • the methylation level data of a single methylation marker was used to train the model in the training set data of Example 4.1, and the test set samples were used To verify the performance of the model, the specific steps are as follows:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the data of the target methylation site in the test set sample, and TestPred is the model prediction score. Use this prediction score and judge whether the sample is breast cancer based on the above threshold.
  • the effect of the single methylation marker logistic regression model in this example is shown in Table 4.3. From this table, it can be seen that all methylation markers can achieve an AUC of more than 0.70 and an AUC of 0.73 in the test set and training set.
  • the above accuracy rates are all good breast cancer tissue-specific markers. Among them, outstanding markers such as Seq ID NO:31 and Seq ID NO:22 can reach a specificity of about 70% in the test set. % sensitivity, the AUC reaches about 0.85, and the overall accuracy reaches about 80%.
  • Example 4.3 Machine learning model for all target methylation markers
  • This embodiment uses the methylation levels of all 51 methylation markers to construct a logistic regression machine learning model to accurately distinguish breast cancer samples from multiple cancer types.
  • the specific steps are consistent with Example 4.2, except that the relevant samples bring in the data of all 51 target methylation markers. Specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation of the sample target methylation marker Level value
  • w is the coefficient of different methylation markers
  • b is the intercept value (the parameters are obtained by training the logistic regression model)
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data (methylation haplotype frequency), TestPred is the model prediction score, use this prediction score and judge whether the sample is breast cancer based on the above threshold.
  • the ROC curve is shown in Figure 36.
  • the AUC for distinguishing breast cancer from other cancer types reached 0.921.
  • the threshold was set to 0.178. If it is greater than this value, it is predicted to be breast cancer. Otherwise, it is predicted to be other cancer types.
  • the specificity is 90.4%. At this time, the sensitivity reached 85.7%, and the overall prediction accuracy of the sample reached 89.8%.
  • Breast cancer samples can be well distinguished from 7 types of cancer samples.
  • this example randomly selected a total of 8 methylation markers Seq ID NO: 16, Seq ID NO: 20, Seq ID from all 51 methylation markers.
  • the method of constructing the machine learning model is also consistent with Example 4.2, but the relevant samples only use the data of the 8 markers in this example.
  • the model scores of the model in the training set and test set are shown in Figure 37.
  • the AUC of the test set of this model reached 0.893, and when the threshold was set to 0.143 , greater than this value is predicted to be breast cancer, and less than this value is predicted to be other cancer types.
  • the specificity is 88.6%
  • the sensitivity reaches 66.7%
  • the overall accuracy reaches 86.1%, which illustrates the good performance of the combined model.
  • This example uses another methylation marker combination: Seq ID NO: 5, Seq ID NO: 11, Seq ID NO: 14, Seq ID NO: 27, Seq ID NO: 28, Seq ID NO: 32, Seq ID NO:45, Seq ID NO:49, Seq ID NO:51 A total of 9 methylation markers were used to construct the machine learning model.
  • the model construction method is also consistent with Example 4.2, but the relevant samples only use the data of the 9 markers in this example.
  • This patent screened out 51 breast cancer-specific methylation markers from the methylation NGS sequencing data of 7 cancer types.
  • the machine learning model built based on the methylation level data of these methylation markers can Breast cancer samples can be well distinguished from the data of 7 cancer types.
  • These methylation markers are good breast cancer tissue-specific methylation markers, which can be used for early screening of pan-cancer types of breast cancer. Organizational traceability provides an important reference.
  • Example 5.1 Targeted methylation sequencing to screen esophageal cancer/gastric cancer-specific methylation sites
  • the training set is used to construct the following machine learning model, and the test set is used to test the performance of the model.
  • the sample information is shown in Table 5.1 below. Among them, esophageal cancer and gastric cancer are Classified into one category, the total number of samples of this category in the training set is 71, and the total number of samples of this category in the test set is 40.
  • the MethylTitan TM method independently developed by the applicant is used to obtain the methylation sequencing data of plasma cfDNA of the target sample and identify the DNA methylation classification markers.
  • the process is as follows:
  • Pear (v0.6.0) software merges the paired-end sequencing data of the same fragment from the paired-end 150bp sequencing of the Illumina Hiseq X10/Nextseq 500/Novaseq sequencer into one sequence.
  • the minimum overlap length is 20bp, and the minimum length after merging is 30bp.
  • the reference genome data used in this article comes from the UCSC database (UCSC: HG19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
  • the nucleotide numbering of sites herein corresponds to the nucleotide position numbering of HG19.
  • a target methylation region may have multiple methylation haplotypes. This value needs to be calculated for each methylation haplotype in the target region.
  • An example of the MHF calculation formula is as follows:
  • i represents the target methylation interval
  • h represents the target methylation haplotype
  • N i represents the number of reads located in the target methylation interval
  • N i, h represents the target methylation haplotype. The number of reads.
  • the screened gastric cancer and/or esophageal cancer tissue-specific methylation markers are shown in Table 5.2.
  • the methylation levels of these methylation markers in gastric cancer and/or esophageal cancer and other five cancer types are as follows in Table 5.2 and Figure 41.
  • these methylation markers have significant differences between gastric cancer and/or esophageal cancer and other cancer types in both the training set and the test set (u test p value is less than 0.05), and the methylation levels There is also a big difference.
  • Example 5.2 Discrimination performance of single methylation markers
  • the methylation level data of a single methylation marker was used to train the model in the training set data of Example 5.1, and tested using Set samples to verify the performance of the model.
  • the specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation level value of the sample target marker
  • w is the coefficient of different markers
  • b is the intercept value
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the data of the target methylation site in the test set sample, TestPred is the model prediction score, use this prediction score and judge whether the sample is esophageal cancer/gastric cancer based on the above threshold.
  • the effect of the logistic regression model of a single marker in this example is shown in Table 5.3. It can be seen from this table that all markers can achieve an AUC of more than 0.59 and an accuracy of more than 0.56 in the test set and training set. They are all good tissue-specific markers for esophageal cancer and gastric cancer, among which the outstanding performance Markers such as Seq ID NO:172, Seq ID NO:173, and Seq ID NO:184 can achieve a sensitivity of 60% with a specificity of more than 70%, and an accuracy of about 70%.
  • Example 5.3 Machine learning model for all target methylation markers
  • This embodiment uses the methylation levels of all 34 methylation markers to construct a logistic regression machine learning model to accurately distinguish gastric cancer and/or esophageal cancer samples from multiple cancer types.
  • the specific steps are consistent with Example 5.2, except that the relevant data is brought in for all 34 target methylation markers. Specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation of the sample target methylation marker Level value
  • w is the coefficient of different methylation markers
  • b is the intercept value (the parameters are obtained by training the logistic regression model)
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data (methylation haplotype frequency), TestPred is the model prediction score, use this prediction score and judge whether the sample is esophageal cancer/gastric cancer based on the above threshold.
  • the ROC curve is shown in Figure 46.
  • the AUC for distinguishing gastric cancer and/or esophageal cancer from other cancer types reached 0.922, and the threshold was set to 0.346. If the value is greater than this value, it is predicted to be gastric cancer and/or esophageal cancer, and vice versa.
  • the specificity is 95.2%, the sensitivity reaches 75%, and the overall prediction accuracy of the sample reaches 89.7%, which can better distinguish gastric cancer and/or esophageal cancer from 7 types of cancer samples.
  • this example randomly selected a total of 7 methylation markers Seq ID NO: 165, Seq ID NO: 167, Seq ID NO: 169 from all 34 methylation markers. ,Seq ID NO:150,Seq ID NO:172,Seq ID NO:174,Seq ID NO:179 methylation level data to build a new machine learning model.
  • the method of constructing the machine learning model is also consistent with Example 5.2, but the relevant samples only use the data of the 7 markers in this example.
  • the model scores of the model in the training set and test set are shown in Figure 47.
  • the AUC of the test set of this model reached 0.917, the threshold When set to 0.30, a value greater than this value is predicted to be gastric cancer and/or esophageal cancer, and a value less than this value is predicted to be other cancer types.
  • the specificity is 91.4%, the sensitivity reaches 70%, and the overall accuracy reaches 85.5%, indicating that The combined model has good performance.
  • This example uses another methylation marker combination: Seq ID NO: 143, Seq ID NO: 23, Seq ID NO: 172, Seq ID NO: 174, Seq ID NO: 177, Seq ID NO: 178, Seq ID NO:180, Seq ID NO:183, Seq ID NO:186 A total of 9 methylation markers were used to construct the machine learning model.
  • the model construction method is also consistent with Example 5.2, but the relevant samples only use the data of the 9 markers in this example.
  • the threshold is set to 0.285, the specificity is At 91.4%, the sensitivity reaches 62.5%, and the overall accuracy reaches 83.4%. It can also better distinguish gastric cancer and/or esophageal cancer from other cancer types.
  • This application screened out 34 esophageal cancer and gastric cancer-specific methylation markers from the methylation NGS sequencing data of 7 cancer types.
  • Machine learning was constructed based on the methylation level data of these methylation markers.
  • the model can better distinguish gastric cancer and/or esophageal cancer samples from the data of 7 cancer types.
  • These methylation markers are good gastric cancer and/or esophageal cancer tissue-specific methylation markers. It provides an important reference for the tissue traceability of gastric cancer and/or esophageal cancer during early screening of pan-cancer species.
  • Example 6.1 Methylation-targeted sequencing to screen pancreatic cancer-specific methylation sites
  • the training set is used to construct the following machine learning model, and the test set is used to test the performance of the model.
  • the sample information is shown in Table 6.1 below.
  • the total number of pancreatic cancer samples in the training set There are 37, and the total number of pancreatic cancer samples in the test set is 17.
  • the MethylTitan TM method independently developed by the applicant is used to obtain the methylation sequencing data of plasma cfDNA of the target sample and identify the DNA methylation classification markers.
  • the process is as follows:
  • Pear (v0.6.0) software merges the paired-end sequencing data of the same fragment from the paired-end 150bp sequencing of the Illumina Hiseq X10/Nextseq 500/Novaseq sequencer into one sequence.
  • the minimum overlap length is 20bp, and the minimum length after merging is 30bp.
  • the reference genome data used in this article comes from the UCSC database (UCSC: HG19, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
  • the nucleotide numbering of sites herein corresponds to the nucleotide position numbering of HG19.
  • a target methylation region may have multiple methylation haplotypes. This value needs to be calculated for each methylation haplotype in the target region.
  • An example of the MHF calculation formula is as follows:
  • i represents the target methylation interval
  • h represents the target methylation haplotype
  • N i represents the number of reads located in the target methylation interval
  • N i, h represents the target methylation haplotype. The number of reads.
  • the specific methylation markers selected for pancreatic cancer tissue are shown in Table 6.2.
  • the relevant methylation markers are located within the target gene or in the upstream or downstream regions of the target gene, and one or a combination of multiple methylation markers can be used as pancreatic cancer-specific methylation markers.
  • methylation levels of these methylation markers in pancreatic cancer and other six cancer types are as follows in Table 6.2 and Figure 51. As shown in Figure 52, these methylation markers have significant differences between pancreatic cancer and other cancer types in both the training set and the test set (u test p value is less than 0.05), and the methylation levels also have a large difference.
  • Example 6.2 Discriminative performance of single pancreatic cancer methylation marker
  • the methylation level data of a single pancreatic cancer methylation marker was used to train the model in the training set data of Example 6.1, and Use test set samples to verify the performance of the model.
  • the specific steps are as follows:
  • AllModel LogisticRegression().
  • x is the methylation level value of the sample target marker
  • w is the coefficient of different pancreatic cancer markers
  • b is the intercept value
  • y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the data of the target methylation site in the test set sample, TestPred is the model prediction score, use this prediction score and judge whether the sample is pancreatic cancer based on the above threshold.
  • pancreatic cancer markers can reach a score of more than 0.60 in both the test set and the training set.
  • AUC and accuracy above 0.68 are both good pancreatic cancer tissue-specific markers.
  • pancreatic cancer markers with excellent performance such as Seq ID NO:194 and Seq ID NO:189 can be used in more than 75% of the test set.
  • the specificity reaches a sensitivity of more than 40%, and the overall accuracy reaches more than 73%.
  • Example 6.3 Machine learning model for all target pancreatic cancer methylation markers
  • This embodiment uses the methylation levels of all 36 pancreatic cancer methylation markers to construct a logistic regression machine learning model to accurately distinguish pancreatic cancer samples from multiple cancer types.
  • the specific steps are consistent with Example 6.2, except that the relevant samples bring in the data of all 36 target pancreatic cancer methylation markers.
  • AllModel LogisticRegression(), the formula of this model is as follows, where x is the methylation level value of the sample target pancreatic cancer methylation marker, w is the coefficient of different pancreatic cancer methylation markers, and b is the intercept value ( The parameters are obtained by training the logistic regression model), and y is the model prediction score:
  • TestPred AllModel.predict_proba(TestData)[:,1], where TestData is the test set data (methylation haplotype frequency), TestPred is the model prediction score, use this prediction score and judge whether the sample is pancreatic cancer based on the above threshold.
  • the ROC curve is shown in Figure 56.
  • the AUC for distinguishing pancreatic cancer from other cancer types reached 0.921.
  • the threshold was set to 0.124. If it is greater than this value, it is predicted to be pancreatic cancer. Otherwise, it is predicted to be other cancer types.
  • the specificity is 93.5%. At this time, the sensitivity reached 70.6%, and the overall prediction accuracy of the sample reached 91.4%, which can well distinguish pancreatic cancer samples from 7 types of cancer samples.
  • Example 6.4 Pancreatic cancer methylation marker combination 1 machine learning model
  • this example randomly selected a total of 11 pancreatic cancer methylation markers Seq ID NO: 190, Seq ID NO: 195, from all 36 pancreatic cancer methylation markers.
  • Seq ID NO:202,Seq ID NO:203,Seq ID NO:206,Seq ID NO:172,Seq ID NO:210,Seq ID NO:211,Seq ID NO:213,Seq ID NO:154,Seq ID Build a new machine learning model based on the methylation level data of NO:214.
  • the method of constructing the machine learning model is also consistent with Example 6.3, but the relevant samples only use the data of the 11 pancreatic cancer markers in this example.
  • the model scores of the model in the training set and test set are shown in Figure 57.
  • the AUC of the test set of this model reached 0.931, and when the threshold was set to 0.114 , greater than this value is predicted to be pancreatic cancer, less than this value
  • the sensitivity reached 64.7% and the overall accuracy reached 89.8%, indicating the good performance of the combined model.
  • Example 6.5 Pancreatic cancer methylation marker combination 2 machine learning model
  • This example uses another pancreatic cancer methylation marker panel: Seq ID NO: 195, Seq ID NO: 196, Seq ID NO: 199, Seq ID NO: 202, Seq ID NO: 203, Seq ID NO: 210 , Seq ID NO:211, Seq ID NO:213, Seq ID NO:154, Seq ID NO:216, a total of 10 pancreatic cancer methylation markers were used to construct a machine learning model.
  • the model construction method is also consistent with Example 6.3, but the relevant samples only use the data of the 10 markers in this example.
  • This application screened out 36 pancreatic cancer-specific methylation markers from the methylation NGS sequencing data of 7 cancer types. Machine learning was constructed based on the methylation level data of these pancreatic cancer methylation markers. The model can well distinguish pancreatic cancer samples from the data of 7 cancer types. These methylation markers are good pancreatic cancer tissue-specific methylation markers, which can be used for early screening of pan-cancer types. The tissue traceability of pancreatic cancer provides an important reference.
  • pancreatic cancer methylation markers is as follows:

Landscapes

  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biomedical Technology (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Analytical Chemistry (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

提供了特定癌症,如结直肠癌等的特异性甲基化标志物及其应用。涉及试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分特定癌症如结直肠癌患者与非特定癌症,如非结直肠癌的癌症患者,(2)用于诊断或辅助诊断癌症;或者(3)用于泛癌筛查过程中对特定癌症的组织溯源。例如试剂或组件包含检测结直肠癌组织特异性甲基化标志物诸如基因SFN,如SEQ ID No.52-90的甲基化水平的试剂或组件,用于泛癌种早期筛查过程中对结直肠癌等癌症的组织溯源,达到更好的区分结直肠癌等癌症的目的。

Description

癌症特异性甲基化标志物及其应用
本申请要求享有以下专利申请的优先权权益:申请日2022年7月4日,申请号202210787502.8,发明名称为“结直肠癌特异性甲基化标志物及其应用”的中国发明专利申请;申请日2022年7月4日,申请号202210787412.9,发明名称为“肺癌特异性甲基化标志物及其诊断肺癌的应用”的中国发明专利申请;申请日2022年7月4日,申请号202210787425.6,发明名称为“肝癌组织特异性甲基化标志物及其诊断肝癌的应用”的中国发明专利申请;申请日2022年7月4日,申请号202210786398.0,发明名称为“乳腺癌特异性甲基化标志物及其诊断乳腺癌的应用”的中国发明专利申请;申请日2022年7月4日,申请号202210787313.0,发明名称为“胃癌和/或食管癌特异性甲基化标志物及其应用”的中国发明专利申请;申请日2022年7月4日,申请号202210787623.2,发明名称为“胰腺癌特异性甲基化标志物及其诊断胰腺癌的应用”的中国发明专利申请。这些申请的内容通过引用方式并入本文。
技术领域
本发明属于分子辅助诊断领域,并且具体地涉及癌症特异性甲基化标志物及其应用,例如结直肠癌组织特异性甲基化标志物及其诊断结直肠癌的应用。
背景技术
结直肠癌是人类最常见的肿瘤之一,全球发病率居恶性肿瘤第三位,死亡率居第二位。在中国,结直肠癌的发病率也在不断升高。
癌症筛查通过检测癌症高危人群的早期相关信号,及时发现癌症早期患者,早期癌症患者可以通过手术切除达到完全治愈的目的,癌症筛查可以大大降低癌症患者的死亡率,早期结直肠癌的5年生存率为90%以上,晚期结直肠癌患者的5年生存率低于10%。从1990年到2015年,美国整体的癌症死亡率下降了25%,其中结直肠癌(男性降低了47%,女性降低了44%),乳腺癌(女性降低了39%)降低最多,癌症死亡率的降低有很重要的一部分原因就是癌症筛查技术的广泛应用(Byers T等人,2016)。
传统的结直肠癌筛查方法有免疫粪便潜血检测(FIT)、肠镜、肿瘤标志 物(癌胚抗原CEA,糖类抗原CA19-9)检测等,但是传统的方法都有一定的局限性,比如肠镜筛查虽然是消化道癌种的“金标准”,但是肠镜为侵入性检测,检查过程较为痛苦,患者依从性较差;FIT对结直肠癌前病变诊断效能有限;肿瘤标志物的性能一般较差,只能作为临床参考,难以大规模筛查应用。
近年来研究火热的液体活检,以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,相比传统方法具有取样方便,非侵入性,可实现泛癌种早筛以及克服了肿瘤异质性等优点,得到了大量的应用。ctDNA可以从多方面反映癌症的信息,如突变,片段化长度分布,甲基化等。ctDNA的甲基化以其出众的性能已经成为癌症早筛产品研究和开发的热点。已经有众多ctDNA甲基化早筛的应用,如泛癌种甲基化早筛应用PanSeer在96%的特异性下,在5个癌种(胃癌,食管癌,肝癌,结直肠癌,肺癌)中可以达到88%的敏感性,相比传统方法可以提前4年发现癌症(Xingdong Chen等人,2020)。结直肠癌中仅使用6个qPCR标志物构建的机器学习模型就可以在92%的特异性下达到86%的敏感性,达到远优于传统癌症筛查方法的效果(Guo-Xiang Cai等人,2021)。
癌症筛查尤其是泛癌种早筛不仅需要预测癌症信号的有无,还需要对阳性的样本进行组织溯源,而人体不同的位置的癌种具有不同的甲基化特征(Kundaje A等人,2015),利用这些组织特异的甲基化特征可以实现组织溯源。但是,组织特异性甲基化标志物的发现需要多个癌种的大量甲基化测序数据以及严格的筛选验证过程,是一项具有较大挑战性的工作。本领域中需要用于结直肠癌组织特异性甲基化标志物。
肺癌作为全球最高致死原因的癌症。尽管手术、化疗、靶向及免疫治疗的综合应用显著提高了肺癌的生存率,但是与其他癌症相比,肺癌患者的预后仍然相对较差。主要原因为大部分肺癌是在晚期被诊断出来的,这与缺乏普及的肺癌早期筛查有关。
癌症筛查通过检测癌症高危人群的早期相关信号,及时发现癌症早期患者,早期癌症患者可以通过手术切除达到完全治愈的目的,癌症筛查可以大大降低癌症患者的死亡率。约85%肺癌为非小细胞肺癌(NSCLC),早期原位癌患者五年生存率高达55.6%,而中晚期易发生转移,转移后患者五年生存率仅4.5%。早期NSCLC患者无明显症状,超80%的NSCLC患者确诊时,已 处于癌症中晚期,并伴随***扩散或远处转移,存活率较低(Weichert W等人,2014)。从1990年到2015年,美国整体的癌症死亡率下降了25%,其中结男性肺癌患者降幅高达45%。癌症死亡率的降低有很重要的一部分原因就是癌症筛查技术的广泛应用(Byers T等人,2016)。
传统的癌症筛查方法有内镜、影像学检测(CT、MRI等)、肿瘤标志物(如临床上辅助诊断原发性肝癌的甲胎蛋白,较为广谱的肿瘤标志物癌胚抗原,检测肺癌的肿瘤标志物细胞角蛋白19Cyfra21-1等)检测等,但是传统的方法都有一定的局限性。例如,目前临床应用最广泛肺癌早期筛查措施为低剂量CT(LDCT)。虽然LDCT一定程度能检测出早期NSCLC患者,但其特异性较低,且诊断阳性患者后续需长时间随访,不断复查或其他诊疗手段进行确诊,这些措施会显著增加患者痛苦,并因为过度诊疗造成医疗资源浪费。而目前肿瘤标志物的性能一般较差,只能作为临床参考,难以大规模筛查应用。
近年来研究火热的液体活检,以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,相比传统方法具有取样方便,非侵入性,可实现泛癌种早筛以及克服了肿瘤异质性等优点,得到了大量的应用。ctDNA可以从多方面反映癌症的信息,如突变,片段化长度分布,甲基化等,其中ctDNA的甲基化以其出众的性能已经成为癌症早筛产品研究和开发的热点,已经有众多ctDNA甲基化早筛的应用,如泛癌种甲基化早筛应用PanSeer在96%的特异性下,在5个癌种(胃癌,食管癌,肝癌,结直肠癌,肺癌)中可以达到88%的敏感性,相比传统方法可以提前4年发现癌症(Xingdong Chen等人,2020)。
癌症筛查尤其是泛癌种早筛不仅需要预测癌症信号的有无,还需要对阳性的样本进行组织溯源,而人体不同的位置的癌种具有不同的甲基化特征(Kundaje A等人,2015),利用这些组织特异的甲基化特征可以实现组织溯源。但是,组织特异性甲基化标志物的发现需要多个癌种的大量甲基化测序数据以及严格的筛选验证过程,是一项具有较大挑战性的工作。本领域中需要用于肺癌组织特异性甲基化标志物。
肝癌在早期往往没有明显的临床症状和体征,肿瘤肿块生长缓慢且迅速。大多数患者仅在晚期发现,导致治疗选择有限,预后极差。
最近的生存率数据显示,中国人群癌症登记处的肝癌5年生存率约为9.8%-12.1%(Zeng H M等,2018),医院癌症登记处的肝癌5年生存率为11.69% (Chen J G等,2018)。此外,1958-1970年、1971-1982年和1983-1994年接受手术切除的患者的5年生存率分别为4.8%、11.2%和45.4%;小肝癌切除术患者的死亡率为63.8%(Zhou X D等,1996)。在过去的4-50年中,AFP的应用价值和早期检测的筛查效益的结果还不明确(Chen JG等,2003;Bruix J等,2005;Amarapurkar D等,2009;Santi V等,2010;Kubota H等,2002)。到目前为止,还没有国际公认的肝癌筛查计划,学术界也没有形成科学共识。然而,病例报告和研究报告提供了证据,证明筛查是实现肝癌早期发现、早期诊断和早期治疗的有效途径。筛查对改善预后和降低死亡率具有积极而重要的意义,尤其是在乙型肝炎/肝癌流行区。
DNA甲基化检测技术被认为是最有潜力的无创癌症筛查手段,已经有技术被证明可以用来进行癌症筛查和组织溯源(E.A.Klein等,2021)。这样就可以设计出一款检测多重癌症的检测手段,同时对多重癌症进行早期检测。这极大地扩大了筛查范围,从某一种癌症的高危人群扩展到多种癌症的高危人群,尽可能在一次筛查之内对更广泛的人群进行检测,增加受检者的依从性和扩大可供筛查的人群数量。但是,这种检测的难点也在于高质量的检测靶点,找到最具信息的检测靶点是此类检测技术的重点和难点。
本领域中需要用于肝癌组织特异性甲基化标志物。
乳腺癌是女性的头号杀手,我国每年约27.88万人被诊断为乳腺癌,而且随着生活方式的改变,我国乳腺癌的发病率和死亡率不断上升。在欧美国家,乳腺癌的5年生存率可达90%,而我国同期数据显示,经济发达的上海地区乳腺癌患者的5年生存率为78%,有些地区只有58%(Fan L等人,2014),这很大程度上是归因于乳腺癌早期筛查的力度。在美国,40岁以上的女性筛查率达到了75%,而在我国,女性筛查率只有21%,84%的患者诊断时已是中晚期,错过了最佳治疗时间。世卫组织已经将早期乳腺癌列为可治愈性疾病,早期乳腺癌患者的5年生存率高达100%,而四期患者仅为21%(Li T等人,2016),因此早期筛查对于乳腺癌患者生存率的提升至关重要。
乳腺超声,乳腺X线检查(钼靶)和核磁共振是常用的乳腺癌筛查方法,但是这些传统的方法都有一定的技术限制,比较依赖于医生的操作水平,具有较高的漏诊误诊概率。
近年来研究火热的液体活检,以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,相比传统方法具有取样方便,非侵入性,可实现泛癌种 早筛以及克服了肿瘤异质性等优点,得到了大量的应用。ctDNA可以从多方面反映癌症的信息,如突变,片段化长度分布,甲基化等,其中ctDNA的甲基化以其出众的性能已经成为癌症早筛产品研究和开发的热点,已经有众多ctDNA甲基化早筛的应用,如泛癌种甲基化早筛应用PanSeer在96%的特异性下,在5个癌种(胃癌,食管癌,肝癌,结直肠癌,肺癌)中可以达到88%的敏感性,相比传统方法可以提前4年发现癌症(Xingdong Chen等人,2020);结直肠癌中仅使用6个qPCR标志物构建的机器学习模型就可以在92%的特异性下达到86%的敏感性,达到远优于传统癌症筛查方法的效果(Guo-Xiang Cai等人,2021)。
癌症筛查尤其是泛癌种早筛不仅需要预测癌症信号的有无,还需要对阳性的样本进行组织溯源,而人体不同的位置的癌种具有不同的甲基化特征(Kundaje A等人,2015),利用这些组织特异的甲基化特征可以实现组织溯源。但是,乳腺癌组织特异性甲基化标志物的发现需要多个癌种的大量甲基化测序数据以及严格的筛选验证过程,是一项具有较大挑战性的工作。本领域中需要用于乳腺癌组织特异性甲基化标志物。
胃癌和食管癌都是常见的消化道肿瘤。我国是胃癌和食管癌的高发国家。根据2015年中国癌症数据报告,我国胃癌发病率和致死率都在恶性肿瘤中排第二位,食管癌发病率和致死率在恶性肿瘤中分别排第四位和第五位。早期食管癌和癌前病变大部分可通过内镜下微创治疗达到根治效果,5年生存率可达到95%,早期胃癌的5年生存率也超过了90%(Sumyama K.等人2017),中晚期食管癌患者生存质量和预后都较差,总体5年生存率不足20%,进展期胃癌的5年生存率低于30%。目前我国食管癌和胃癌早诊率都比较低,早期食管癌和胃癌患者都缺乏典型的临床性状,大多数患者就诊时已是中晚期。因此,要想提高食管癌和胃癌患者的生存率,最有效的方法就是对高风险人群进行早期筛查。
胃癌的筛查方法主要有血清学筛查和内镜筛查,其中血清学筛查包括血清肿瘤标志物检测(癌胚抗原CEA,糖类抗原CA19-9等),血清胃蛋白酶原(pepsinogen,PG)检测,幽门螺旋杆菌感染检测等,但是血清学相关方法灵敏度和特异性都比较低,难以大规模人群筛查使用。食管癌的筛查方法以内镜为主。内镜及其活检是诊断胃癌和食管癌的金标准,但是内镜检查依赖设备和内镜医师资源,检查费用相对较高,且为侵入性检测,患者依从性较 差,难以大规模人群筛查使用。
近年来研究火热的液体活检,以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,相比传统方法具有取样方便,非侵入性,可实现泛癌种早筛以及克服了肿瘤异质性等优点,得到了大量的应用。ctDNA可以从多方面反映癌症的信息,如突变,片段化长度分布,甲基化等,其中ctDNA的甲基化以其出众的性能已经成为癌症早筛产品研究和开发的热点,已经有众多ctDNA甲基化早筛的应用,如泛癌种甲基化早筛应用PanSeer在96%的特异性下,在5个癌种(胃癌,食管癌,肝癌,结直肠癌,肺癌)中可以达到88%的敏感性,相比传统方法可以提前4年发现癌症(Xingdong Chen等人,2020);结直肠癌中仅使用6个qPCR标志物构建的机器学习模型就可以在92%的特异性下达到86%的敏感性,达到远优于传统癌症筛查方法的效果(Guo-Xiang Cai等人,2021)。
癌症筛查尤其是泛癌种早筛不仅需要预测癌症信号的有无,还需要对阳性的样本进行组织溯源,而人体不同的位置的癌种具有不同的甲基化特征(Kundaje A等人,2015),利用这些组织特异的甲基化特征可以实现组织溯源。但是,组织特异性甲基化标志物的发现需要多个癌种大量的甲基化测序数据以及严格的筛选验证过程,是一项具有较大挑战性的工作。
胃和食管是人体内临近的两个器官,相关检测阳性样本可以使用胃镜可以同时对食管和胃部的病变进行确认,因此在泛癌种筛查过程中的组织溯源阶段可以将食管癌和胃癌划分为一类,寻找两个癌种特异性的甲基化标志物,构建模型用以将食管癌和胃癌与其它癌种进行区分。
本领域中需要用于胃癌和/或食管癌组织特异性甲基化标志物。
癌症筛查通过检测癌症高危人群的早期相关信号,及时发现癌症早期患者,早期癌症患者可以通过手术切除达到完全治愈的目的,癌症筛查可以大大降低癌症患者的死亡率。胰腺癌是恶性程度最高的消化***肿瘤,早期发现并手术切除是治愈胰腺癌的唯一途径。据2018年全球肿瘤流行病数据,胰腺癌占所有肿瘤的2.7%,居第9位,现阶段胰腺癌总的5年生存率只有5%左右,主要原因就是胰腺癌难以早期诊断,待确诊时大多已达晚期,而I期或者肿瘤直径小于1cm的早期胰腺癌患者5年生存率可达75%,只有实现对该类患者的早期筛查,才能实现提高胰腺癌生存率的目的。
传统的胰腺癌筛查方法主要有影像学筛查(彩超,CT,核磁共振等)及 血液肿瘤标志物(主要是糖类抗原CA199检查)。如果彩超和CT有检查到胰腺肿块,或者肿瘤指标CA199明显升高的情况下,则考虑是胰腺癌的可能性。但是,CA199仅在65%的可切除胰腺癌患者中表达升高,不适用于大规模人群早筛。彩超可以发现直径2cm以上的肿瘤,CT/核磁共振可以发现1cm以上的胰腺肿瘤,对于低于1cm的胰腺癌早期肿瘤会有漏诊,同样难以应用于大规模人群筛查。
近年来研究火热的液体活检,以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,相比传统方法具有取样方便,非侵入性,可实现泛癌种早筛以及克服了肿瘤异质性等优点,得到了大量的应用。ctDNA可以从多方面反映癌症的信息,如突变,片段化长度分布,甲基化等,其中ctDNA的甲基化以其出众的性能已经成为癌症早筛产品研究和开发的热点,已经有众多ctDNA甲基化早筛的应用,如泛癌种甲基化早筛应用PanSeer在96%的特异性下,在5个癌种(胃癌,食管癌,肝癌,结直肠癌,肺癌)中可以达到88%的敏感性,相比传统方法可以提前4年发现癌症(Xingdong Chen等人,2020)。结直肠癌中仅使用6个qPCR标志物构建的机器学习模型就可以在92%的特异性下达到86%的敏感性,达到远优于传统癌症筛查方法的效果(Guo-Xiang Cai等人,2021)。
癌症筛查尤其是泛癌种早筛不仅需要预测癌症信号的有无,还需要对阳性的样本进行组织溯源,而人体不同的位置的癌种具有不同的甲基化特征(Kundaje A等人,2015),利用这些组织特异的甲基化特征可以实现组织溯源。但是组织特异性甲基化标志物的发现需要多个癌种大量的甲基化测序数据以及严格的筛选验证过程,是一项具有较大挑战性的工作。
本领域中需要用于胰腺癌组织特异性甲基化标志物。
发明内容
现有技术中结直肠癌诊断存在上述诸多缺陷。针对本领域中缺乏针对结直肠癌组织特异性甲基化标志物的现状,本发明人从7个癌种(肺癌,肝癌,结直肠癌,胃癌,食管癌,胰腺癌,乳腺癌)的大量下一代测序(NGS)cfDNA甲基化靶向测序数据中筛选到结直肠癌组织特异性的甲基化标志物。发明人使用筛选得到的甲基化标志物进行机器学习模型的构建和验证,用于泛癌种早期筛查过程中对结直肠癌的组织溯源,达到更好的区分结直肠癌的目的。
一方面,本发明提供了分离的核酸,其是一种或多种特异性甲基化标志物。在一个实施方案中,分离的核酸是结直肠癌组织特异性甲基化标志物。在一个实施方案中,分离的核酸是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区:基因SFN;基因GPR3;基因FCGR1B;基因FAM150B;基因RGPD3;基因NUP210;基因LMOD3;基因FOXF2;基因TBXT;基因PRR15;基因ELN;基因TFPI2;基因REPIN1;基因PDLIM2;基因SDC2;基因TRAPPC9;基因TJP2;基因DIP2C;基因DDIT4;基因MRPL23;基因PAX6;基因PLXNC1;基因MLNR;基因MYO16;基因TMEM179;基因GATM;基因CACNA1H;基因NLRC5;基因SHISA6;基因KCNJ12;基因PRAC1;基因MYO15B;基因CANT1;基因SALL3;基因THOP1;基因ZBTB7A;基因DNM2;基因LGALS4;基因WISP2;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,分离的核酸从样品分离。在一个实施方案中,样品是细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,分离的核酸是从结直肠癌患者获得的。例如,分离的核酸是从血浆中的游离DNA中获得的。在一个实施方案中,变体包含与任一种基因的序列具有至少50%同一性的序列。例如,变体包含与任一种基因的序列具有至少60%、65%、70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的序列。在一个实施方案中,所述区域是所述基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区。在一个实施方案中,上游区是基因上游的2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区。下游区是基因下游的2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp下游区。在一个实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是140bp-510bp。在一个实施方案中,位点的长度可以是200bp-470bp。在一个实施方案中,位点的长度可以是150bp、160bp、170bp、180bp、190bp、200bp、210bp、 220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。在一个实施方案中,分离的核酸包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体:SEQ ID No.52-90。在一个实施方案中,变体是与上述任一项或多项所示的核苷酸序列具有至少70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的变体序列。
在一个方面,本发明提供了试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分结直肠癌患者与非结直肠癌的癌症患者,(2)用于诊断或辅助诊断结直肠癌;或者(3)用于泛癌筛查过程中对结直肠癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中结直肠癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区:基因SFN;基因GPR3;基因FCGR1B;基因FAM150B;基因RGPD3;基因NUP210;基因LMOD3;基因FOXF2;基因TBXT;基因PRR15;基因ELN;基因TFPI2;基因REPIN1;基因PDLIM2;基因SDC2;基因TRAPPC9;基因TJP2;基因DIP2C;基因DDIT4;基因MRPL23;基因PAX6;基因PLXNC1;基因MLNR;基因MYO16;基因TMEM179;基因GATM;基因CACNA1H;基因NLRC5;基因SHISA6;基因KCNJ12;基因PRAC1;基因MYO15B;基因CANT1;基因SALL3;基因THOP1;基因ZBTB7A;基因DNM2;基因LGALS4;基因WISP2;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是140bp-510bp。在一个实施方案中,位点的长度可以是200bp-470bp。在一个实施方案中,位点的长度可以是150bp、160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。在一个实施方案中,非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者 其互补序列或变体序列:SEQ ID No.52-90。在一个实施方案中,试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。在一个实施方案中,试剂或组件包含用于检测甲基化标志物的引物和/或探针。在一个实施方案中,样品为细胞、组织、细针穿刺活检物和/或血浆。在一个实施方案中,样品基因组DNA是血浆中的游离DNA。
在一个方面,本发明提供了一种构建区分结直肠癌与其他非结直肠癌的预测模型的方法,其包括:(1)获得结直肠癌样品和非结直肠癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区:基因SFN;基因GPR3;基因FCGR1B;基因FAM150B;基因RGPD3;基因NUP210;基因LMOD3;基因FOXF2;基因TBXT;基因PRR15;基因ELN;基因TFPI2;基因REPIN1;基因PDLIM2;基因SDC2;基因TRAPPC9;基因TJP2;基因DIP2C;基因DDIT4;基因MRPL23;基因PAX6;基因PLXNC1;基因MLNR;基因MYO16;基因TMEM179;基因GATM;基因CACNA1H;基因NLRC5;基因SHISA6;基因KCNJ12;基因PRAC1;基因MYO15B;基因CANT1;基因SALL3;基因THOP1;基因ZBTB7A;基因DNM2;基因LGALS4;基因WISP2;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是140bp-510bp。在一个实施方案中,位点的长度可以是200bp-470bp。在一个实施方案中,位点的长度可以是150bp、160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。在一个实施方案中,非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。在一个实施方案中,方法包括(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学习模型。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,基因组DNA是血浆中的游离DNA。在一个实施方案中,步骤(1)包括获得样品DNA的甲 基化测序数据。在一个实施方案中,通过MethylTitan的方法获得样品DNA的甲基化测序数据。在一个实施方案中,步骤(2)包括使用逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。例如,可以使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标标志物的甲基化水平值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
可以使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中结直肠癌为1,其它癌种为0。可以根据训练集的样本确定模型的相关阈值。
在一个方面,本发明提供了本文的方法构建的结直肠癌预测模型。
在一个方面,本发明提供了诊断结直肠癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行本文所述的方法以构建结直肠癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是结直肠癌进行判断。可以使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是结直肠癌进行判断,大于阈值预测为结直肠癌,反之预测为其它癌种。
在一个方面,本发明提供了方法,其(1)区分结直肠癌患者与非结直肠癌的癌症患者,(2)用于诊断或辅助诊断结直肠癌;或者(3)用于泛癌筛查过程中对结直肠癌的组织溯源,包括测定样品基因组DNA中的本文中所述的一种或多种结直肠癌特异性甲基化标志物的甲基化水平。
在一个方面,本发明提供了一种试剂盒或装置,其在(1)区分结直肠癌患者与非结直肠癌的癌症患者,(2)用于诊断或辅助诊断结直肠癌;或者(3)用于泛癌筛查过程中对结直肠癌的组织溯源中应用。在一个实施方案中,该应用包括测定样品基因组DNA中的本文中所述的一种或多种结直肠癌特异性甲基化标志物的甲基化水平。
在另一个方面,本发明提供了一种用于检测结直肠癌组织特异性甲基化标志物的试剂盒或装置。在一个实施方案中,试剂盒或装置包含检测来自样品的基因组DNA中的本文所述的一种或多种结直肠癌组织特异性甲基化标志物状态和/或水平的试剂或组件。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,核酸是血浆中的游离DNA。在一个实施方案中,试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。在一个实施方案中,试剂包含用于检测结直肠癌特异性甲基化标志物的寡核苷酸。在一个实施方案中,寡核苷酸是引物和/或探针。在一个实施方案中,引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物。在一个实施方案中,试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非结直肠癌的癌症患者的前述特异性甲基化标志物。在一个实施方案中,非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
本发明的结直肠癌特异性甲基化标志物的优势包括:
1.本发明提供了新的结直肠癌特异性甲基化标志物,可以用于泛癌种早期筛查过程中对结直肠癌的组织溯源,达到更好的区分结直肠癌的目的;
2.以结直肠癌肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,为非侵入性方法,可实现结直肠癌早筛;
3.本发明的结直肠癌特异性甲基化标志物可以以高的敏感性和特异性检出结直肠癌。
针对本领域中缺乏针对肺癌组织特异性甲基化标志物的现状,本发明人从7个癌种(肺癌,肝癌,肺癌,胃癌,食管癌,胰腺癌,乳腺癌)的大量下一代测序(NGS)cfDNA甲基化靶向测序数据中筛选到肺癌组织特异性的甲基化标志物。发明人使用筛选得到的甲基化标志物进行机器学习模型的构建和验证,用于泛癌种早期筛查过程中对肺癌的组织溯源,达到更好的区分肺癌的目的。
一方面,本发明提供了试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分肺癌患者与非肺癌的癌症患者,(2)用于诊断或辅助诊断肺癌;或者(3)用于泛癌筛查过程中对肺癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中肺癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为120bp-500bp,优选200bp-480bp。在一个实施方案中,非肺癌的癌症或泛癌包括结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:24、65、76和91-135。在一个实施方案中,试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。在一个实施方案中,试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
在另一个方面,本发明提供了一种构建区分肺癌与其他非肺癌的癌症的预测模型的方法,其包括:
(1)获得肺癌样品和非肺癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点, 所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;和
(2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
在一个实施方案中,位点的长度为120bp-500bp,优选200bp-480bp。在一个实施方案中,非肺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:24、65、76和91-135。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,基因组DNA是血浆中的游离DNA。在一个实施方案中,步骤(1)包括获得样品DNA的甲基化测序数据。在一个实施方案中,步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。例如,可以使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样品中甲基化标志物的甲基化水平值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
可以使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中肺癌为1,其它癌种为0。可以据训练集的样本确定模型的相关阈值。
在另一个方面,提供了根据本发明的方法构建的肺癌预测模型。
在另一个方面,提供了诊断肺癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据本发明的方法以构建肺癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是肺癌进行判断,大于阈值预测为肺癌,反之预测为其它癌种。可以使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值。
在另一个方面,提供了用于检测肺癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种肺癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述肺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为120bp-500bp,优选200bp-480bp。在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:24、65、76和91-135。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,核酸是血浆中的游离DNA。在一个实施方案中,试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的 甲基化图谱分析和质谱法。在一个实施方案中,试剂包含用于检测甲基化标志物的寡核苷酸。在一个实施方案中,寡核苷酸是引物和/或探针。在一个实施方案中,引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物。在一个实施方案中,试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非肺癌的癌症患者的前述特异性甲基化标志物。在一个实施方案中,所述非肺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
本发明提供了分离的核酸,其是一种或多种特异性甲基化标志物。在一个实施方案中,分离的核酸是肺癌组织特异性甲基化标志物。在一个实施方案中,所述肺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为120bp-500bp,优选200bp-480bp。在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:24、65、76和91-135。在一个实施方案中,分离的核酸从样品分离。在一个实施方案中,样品是细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,分离的核酸是从肺癌患者获得的。例如,分离的核酸是从血浆中的游离DNA中获得的。
在本发明的各方面的实施方案中,变体包含与任一种基因的序列具有至少70%同一性的序列。例如,变体包含与任一种基因的序列具有至少75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的序列。
在本发明的各方面的实施方案中,所述区域是所述基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区。在一个实施方案中,上游区是基因上游的2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区。下游区是基因下游的2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp下游区。
在本发明的各方面的实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是120bp-500bp,优选200bp-480bp。在一个实施方案中,位点的长度可以是130bp、140bp、150bp、160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
在在本发明的各方面的实施方案中,变体是与上述任一项或多项所示的核苷酸序列具有至少70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的变体序列。
在一个方面,本发明提供了方法,其(1)区分肺癌患者与非肺癌的癌症患者,(2)用于诊断或辅助诊断肺癌;或者(3)用于泛癌筛查过程中对肺癌的组织溯源,包括测定样品基因组DNA中的本文中所述的一种或多种甲基化标志物的甲基化水平。在一个实施方案中,利用本发明的肺癌预测模型进行该方法。
本发明的肺癌组织特异性甲基化标志物的优势包括:
1.本发明提供了新的肺癌组织特异性甲基化标志物,可以用于泛癌种早期筛查过程中对肺癌的组织溯源,达到更好的区分肺癌的目的;
2.以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,为非侵入性方法,可实现肺癌早筛;
3.本发明的肺癌组织特异性甲基化标志物可以以高的敏感性和特异性检出肺癌。
急需用于针对肝癌的组织特异性甲基化标志物。本发明人从7个癌种(肺癌,结直肠癌,肝癌,胃癌,食管癌,胰腺癌,乳腺癌)的大量下一代测序(NGS)cfDNA甲基化靶向测序数据中筛选到肝癌组织特异性的甲基化标志物。发明人使用筛选得到的甲基化标志物进行机器学习模型的构建和验证,用于泛癌种早期筛查过程中对肝癌的组织溯源,达到更好的区分肝癌的目的。
一方面,本发明提供了试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分肝癌患者与非肝癌的癌症患者,(2)用于诊断或辅助诊断肝癌;或者(3)用于泛癌筛查过程中对肝癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中肝癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区:TAL1(T-cell acute lymphocytic leukemia protein 1)基因;TRIM58基因;LBH基因;ABCG5(ATP Binding Cassette Subfamily G Member 5)基因;PAX8(Paired Box 8)基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;WISP2基因;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为100bp-550bp。在一个实施方案中,位点的长度为150bp-480bp。
在一个实施方案中,非肝癌的癌症或泛癌包括结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:7、18、23、29、41、90、94、104、117、120、125、128、132和136-159。
在一个实施方案中,试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
在另一个方面,本发明提供了一种构建区分肝癌与其他非肝癌的预测模型的方法,其包括:
(1)获得肝癌样品和非肝癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;WISP2基因;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;和
(2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
在一个实施方案中,在一个实施方案中,位点的长度为100bp-550bp。在一个实施方案中,位点的长度为150bp-480bp。在一个实施方案中,非肝癌的癌症是结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:7、18、23、29、41、90、94、104、117、120、125、128、132和136-159。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一 个实施方案中,基因组DNA是血浆中的游离DNA。
在一个实施方案中,步骤(1)包括获得样品DNA的甲基化测序数据。
在一个实施方案中,步骤(2)包括建立逻辑回归模型(例如python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型),例如AllModel=LogisticRegression(),该模型的公式如下,其中x为样品中甲基化标志物的甲基化水平值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。例如,使用AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中肝癌为1,其它癌种为0。
在另一个方面,提供了根据本发明的方法构建的肝癌预测模型。
在另一个方面,提供了诊断肝癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据本发明的方法以构建肝癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是肝癌进行判断,大于阈值预测为肝癌,反之预测为其它癌种。模型预测分值可以使用TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值。
在另一个方面,提供了用于检测肝癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种肝癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述肝癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;WISP2基因;或任一种基因的互补序列或变体,只要变体中的甲基化位点未 发生突变。在一个实施方案中,位点的长度为100bp-550bp。在一个实施方案中,位点的长度为150bp-480bp。
在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:7、18、23、29、41、90、94、104、117、120、125、128、132和136-159。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,核酸是血浆中的游离DNA。
在一个实施方案中,试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂包含用于检测甲基化标志物的寡核苷酸。在一个实施方案中,寡核苷酸是引物和/或探针;
在一个实施方案中,引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物。
在一个实施方案中,试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非肝癌的癌症患者的前述特异性甲基化标志物。在一个实施方案中,所述非肝癌的癌症是结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
本发明提供了分离的核酸,其是一种或多种特异性甲基化标志物。在一个实施方案中,分离的核酸是肝癌组织特异性甲基化标志物。在一个实施方案中,所述肝癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C 基因;RASAL3基因;CYP2F1基因;WISP2基因;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为100bp-550bp。在一个实施方案中,位点的长度为150bp-480bp。在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:7、18、23、29、41、90、94、104、117、120、125、128、132和136-159。在一个实施方案中,分离的核酸从样品分离。在一个实施方案中,样品是细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,分离的核酸是从肝癌患者获得的。例如,分离的核酸是从血浆中的游离DNA中获得的。
在本发明的各方面的实施方案中,变体包含与任一种基因的序列具有至少60%同一性的序列。例如,变体包含与任一种基因的序列具有至少65%、70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的序列。
在本发明的各方面的实施方案中,所述区域是所述基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区。在一个实施方案中,上游区是基因上游的2.9kb、2.8kb、2.7kb、2.6kb、2.5kb、2.4kb、2.3kb、2.2kb、2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区。下游区是基因下游的2.9kb、2.8kb、2.7kb、2.6kb、2.5kb、2.4kb、2.3kb、2.2kb、2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp下游区。
在本发明的各方面的实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度为100bp-550bp。在一个实施方案中,位点的长度为150bp-480bp。在一个实施方案中,位点的长度可以是110bp、120bp、130bp、140bp、150bp、160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp、500bp、510bp、520bp、530bp 或540bp。
在在本发明的各方面的实施方案中,变体是与上述任一项或多项所示的核苷酸序列具有至少60%、65%、70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的变体序列。
在一个方面,本发明提供了方法,其(1)区分肝癌患者与非肝癌的癌症患者,(2)用于诊断或辅助诊断肝癌;或者(3)用于泛癌筛查过程中对肝癌的组织溯源,包括测定样品基因组DNA中的本文中所述的一种或多种甲基化标志物的甲基化水平。在一个实施方案中,利用本发明的肝癌预测模型进行该方法。
本发明的肝癌甲基化标志物的优势包括:
1.本发明提供了新的甲基化标志物,可以用于泛癌种早期筛查过程中对肝癌的组织溯源,达到更好的区分肝癌的目的;
2.以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,为非侵入性方法,可实现肝癌早筛;
3.本发明的甲基化标志物可以以高的敏感性和特异性检出肝癌。
乳腺超声,乳腺X线检查(钼靶)和核磁共振是常用的乳腺癌筛查方法,但是这些传统的方法都有一定的技术限制,比较依赖于医生的操作水平。本领域中缺乏针对乳腺癌组织特异性甲基化标志物。针对这些技术问题,发明人从7个癌种(肺癌,肝癌,胃癌,食管癌,胰腺癌,乳腺癌)的大量下一代测序(NGS)cfDNA甲基化靶向测序数据中筛选到乳腺癌组织特异性的甲基化标志物。发明人使用筛选得到的甲基化标志物进行机器学习模型的构建和验证,用于泛癌种早期筛查过程中对乳腺癌的组织溯源,达到更好的区分乳腺癌的目的。本发明的乳腺癌组织特异性甲基化标志物是先前没有描述的。
一方面,本发明提供了试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分乳腺癌患者与非乳腺癌的癌症患者,(2)用于诊断或辅助诊断乳腺癌;或者(3)用于泛癌筛查过程中对乳腺癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中乳腺癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区: 基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为150bp-500bp。在一个实施方案中,位点的长度为200bp-470bp。
在一个实施方案中,非乳腺癌的癌症或泛癌包括结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或肺癌。
在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:1-51。
在一个实施方案中,试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
在另一个方面,本发明提供了一种构建区分乳腺癌与其他非乳腺癌的预测模型的方法,其包括:
(1)获得乳腺癌样品和非乳腺癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因 FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;和
(2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
在一个实施方案中,位点的长度为150bp-500bp,优选200bp-470bp。在一个实施方案中,非乳腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或肺癌。
在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:1-51。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,基因组DNA是血浆中的游离DNA。
在一个实施方案中,步骤(1)包括获得样品DNA的甲基化测序数据。
在一个实施方案中,步骤(2)包括建立逻辑回归模型以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练并根据训练集的样本确定模型的相关阈值。
例如,使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样品中甲基化标志物的甲基化水平值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中乳腺癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。
在另一个方面,提供了根据本发明的方法构建的乳腺癌预测模型。
在另一个方面,提供了诊断乳腺癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据本发明的方法以构建乳腺癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测 试集以获得预测分值并根据阈值对样本是否是乳腺癌进行判断。例如,使用TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是乳腺癌进行判断,大于阈值预测为乳腺癌,反之预测为其它癌种。
在另一个方面,提供了用于检测乳腺癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种乳腺癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述乳腺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为150bp-500bp。在一个实施方案中,位点的长度为200bp-470bp。
在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:1-51。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,核酸是血浆中的游离DNA。
在一个实施方案中,试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂包含用于检测甲基化标志物的寡核苷酸。在一个实施方案中,寡核苷酸是引物和/或探针。
在一个实施方案中,引物是利用甲基化测序法检测位点的甲基化水平/ 状态的引物或用于扩增一个或多个甲基化位点的PCR引物。
在一个实施方案中,试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非乳腺癌的癌症患者的前述特异性甲基化标志物。在一个实施方案中,所述非乳腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或肺癌。
本发明提供了分离的核酸,其是一种或多种特异性甲基化标志物。在一个实施方案中,分离的核酸是乳腺癌组织特异性甲基化标志物。在一个实施方案中,所述乳腺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为150bp-500bp。在一个实施方案中,位点的长度为200bp-470bp。在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:1-51。在一个实施方案中,分离的核酸从样品分离。在一个实施方案中,样品是细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,分离的核酸是从乳腺癌患者获得的。例如,分离的核酸是从血浆中的游离DNA中获得的。
在本发明的各方面的实施方案中,变体包含与任一种基因的序列具有至少70%同一性的序列。例如,变体包含与任一种基因的序列具有至少75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99% 同一性的序列。
在本发明的各方面的实施方案中,所述区域是所述基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区。在一个实施方案中,上游区是基因上游的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区。下游区是基因下游的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp下游区。
在本发明的各方面的实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是150bp-500bp。在一个实施方案中,位点的长度可以是200bp-470bp。在一个实施方案中,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
在在本发明的各方面的实施方案中,变体是与上述任一项或多项所示的核苷酸序列具有至少70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的变体序列。
在一个方面,本发明提供了方法,其(1)区分乳腺癌患者与非乳腺癌的癌症患者,(2)用于诊断或辅助诊断乳腺癌;或者(3)用于泛癌筛查过程中对乳腺癌的组织溯源,包括测定样品基因组DNA中的本文中所述的一种或多种甲基化标志物的甲基化水平。在一个实施方案中,利用本发明的乳腺癌预测模型进行该方法。
本发明的优势包括:
1.本发明提供了新的甲基化标志物,可以用于泛癌种早期筛查过程中对乳腺癌的组织溯源,达到更好的区分乳腺癌的目的;
2.以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,为非侵入性方法,可实现乳腺癌早筛;
3.本发明的甲基化标志物可以以高的敏感性和特异性检出乳腺癌。
针对本领域中缺乏针对胃癌和/或食管癌组织特异性甲基化标志物的现状,本发明人从7个癌种(肺癌,肝癌,结直肠癌,胃癌,食管癌,胰腺癌,乳腺癌)的大量下一代测序(NGS)cfDNA甲基化靶向测序数据中筛选到胃癌和/或食管癌组织特异性的甲基化标志物。发明人使用筛选得到的甲基化标志物进行机器学习模型的构建和验证,用于泛癌种早期筛查过程中对胃癌和/或食管癌的组织溯源,达到更好的区分胃癌和/或食管癌的目的。
一方面,本发明提供了分离的核酸,其是一种或多种特异性甲基化标志物。在一个实施方案中,分离的核酸是胃癌和/或食管癌组织特异性甲基化标志物。在一个实施方案中,分离的核酸是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因TAL1;基因VAV3;基因PMF1;基因ATP2B4;基因SH3YL1;基因SLC9A3;基因CXXC5;基因PCDHGA11;基因FOXF2;基因ZNF273;基因KLRG2;基因CRB2;基因SEC16A;基因GPAM;基因ASCL2;基因PAX6;基因PTGDR2;基因PLEKHB1;基因TBX5;基因STX2;基因FBRSL1;基因ATP11A;基因BTBD6;基因CRIP2;基因ONECUT1;基因ZNF764;基因IGHV3OR16-17;基因SALL1;基因ACTG1;基因GATA6;基因KCTD1;基因CYP2F1;基因TPTE;基因CLDN5;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,分离的核酸从样品分离。在一个实施方案中,样品是细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,分离的核酸是从胃癌和/或食管癌患者获得的。例如,分离的核酸是从血浆中的游离DNA中获得的。
在一个实施方案中,变体包含与任一种胃癌和/或食管癌组织特异性甲基化标志物基因的序列具有至少70%同一性的序列。例如,变体包含与任一种基因的序列具有至少75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的序列。
在一个实施方案中,所述区域是所述基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区。在一个实施方案中,上游区是基因上游的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、 60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区。下游区是基因下游的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp下游区。
在一个实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是150bp-500bp。在一个实施方案中,位点的长度可以是200bp-470bp。在一个实施方案中,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
在一个实施方案中,分离的核酸包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体:SEQ ID No.23、72、143、150、152、157和160-187。
在一个实施方案中,变体是与上述任一项或多项所示的核苷酸序列具有至少60%、65%、70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的变体序列。
在一个方面,本发明提供了试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分胃癌和/或食管癌患者与除胃癌和食管癌以外的癌症患者,(2)用于诊断或辅助诊断胃癌和/或食管癌;或者(3)用于泛癌筛查过程中对胃癌和/或食管癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中胃癌和/或食管癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因TAL1;基因VAV3;基因PMF1;基因ATP2B4;基因SH3YL1;基因SLC9A3;基因CXXC5;基因PCDHGA11;基因FOXF2;基因ZNF273;基因KLRG2;基因CRB2;基因SEC16A;基因GPAM;基因ASCL2;基因PAX6;基因PTGDR2;基因PLEKHB1;基因TBX5;基因STX2;基因FBRSL1;基因ATP11A;基因BTBD6;基因CRIP2;基因ONECUT1;基因ZNF764;基因IGHV3OR16-17;基因SALL1;基因ACTG1;基因GATA6;基因KCTD1;基因CYP2F1;基因TPTE;基因CLDN5;或任一种基因的互补序列或变体,只要变体中的甲基化位点未 发生突变。
在一个实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是150bp-500bp。在一个实施方案中,位点的长度可以是200bp-470bp。在一个实施方案中,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
在一个实施方案中,除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。
在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID No.23、72、143、150、152、157和160-187。
在一个实施方案中,试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂或组件包含用于检测甲基化标志物的引物和/或探针。在一个实施方案中,样品为细胞、组织、细针穿刺活检物和/或血浆。在一个实施方案中,样品基因组DNA是血浆中的游离DNA。
在一个方面,本发明提供了一种构建区分胃癌和/或食管癌与除胃癌和食管癌以外的癌症的预测模型的方法,其包括:(1)获得胃癌和/或食管癌样品和除胃癌和食管癌以外的癌症样品的基因组DNA中甲基化标志物的甲基化水平;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因TAL1;基因VAV3;基因PMF1;基因ATP2B4;基因SH3YL1;基因SLC9A3;基因CXXC5;基因PCDHGA11;基因FOXF2;基因ZNF273;基因KLRG2;基因CRB2;基因SEC16A;基因GPAM;基因ASCL2;基因PAX6;基因PTGDR2;基因PLEKHB1;基因TBX5;基因STX2;基因FBRSL1;基因ATP11A;基因BTBD6;基因CRIP2;基因ONECUT1;基因ZNF764;基因IGHV3OR16-17;基因SALL1;基因ACTG1;基因GATA6;基因KCTD1;基因CYP2F1;基因 TPTE;基因CLDN5;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。
在一个实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是150bp-500bp。在一个实施方案中,位点的长度可以是200bp-470bp。在一个实施方案中,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
在一个实施方案中,除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。
在一个实施方案中,方法包括(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学习模型。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。
在一个实施方案中,基因组DNA是血浆中的游离DNA。
在一个实施方案中,步骤(1)包括获得样品DNA的甲基化测序数据。在一个实施方案中,通过MethylTitan的方法获得样品DNA的甲基化测序数据。
在一个实施方案中,步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。例如,可以使用逻辑回归模型(例如python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型):AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标标志物的甲基化水平值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
可以使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中胃癌和/或食管癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。
在一个方面,本发明提供了本文的方法构建的胃癌和/或食管癌预测模型。
在一个方面,本发明提供了诊断胃癌和/或食管癌的装置,其包含存储器 和处理存储器存储的指令的处理器,所述指令执行本文所述的方法以构建胃癌和/或食管癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是胃癌和/或食管癌进行判断,大于阈值预测为胃癌和/或食管癌,反之预测为其它癌种。可以使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是胃癌和/或食管癌进行判断,大于阈值预测为胃癌和/或食管癌,反之预测为其它癌种。
在一个方面,本发明提供了方法,其(1)区分胃癌和/或食管癌患者与除胃癌和食管癌以外的癌症患者,(2)用于诊断或辅助诊断胃癌和/或食管癌;或者(3)用于泛癌筛查过程中对胃癌和/或食管癌的组织溯源,包括测定样品基因组DNA中的本文中所述的一种或多种甲基化标志物的甲基化水平。
在一个方面,本发明提供了一种试剂盒或装置,其在(1)区分胃癌和/或食管癌癌患者与除胃癌和食管癌以外的癌症患者,(2)用于诊断或辅助诊断胃癌和/或食管癌;或者(3)用于泛癌筛查过程中对胃癌和/或食管癌的组织溯源中应用。在一个实施方案中,该应用包括测定样品基因组DNA中的本文中所述的一种或多种甲基化标志物的甲基化水平。
在另一个方面,本发明提供了一种用于检测胃癌和/或食管癌组织特异性甲基化标志物的试剂盒或装置。
在一个实施方案中,试剂盒或装置包含检测来自样品的基因组DNA中的本文所述的一种或多种胃癌和/或食管癌组织特异性甲基化标志物状态和/或水平的试剂或组件。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,核酸是血浆中的游离DNA。
在一个实施方案中,试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂包含用于检测甲基化标志物的寡核苷酸。在一个实施方案中,寡核苷酸是引物和/或探针。
在一个实施方案中,引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物。
在一个实施方案中,试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或除胃癌和食管癌以外的癌症患者的前述特异性甲基化标志物。在一个实施方案中,除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。
本发明的优势包括:
1.本发明提供了新的胃癌和/或食管癌组织特异性甲基化标志物,可以用于泛癌种早期筛查过程中对胃癌和/或食管癌的组织溯源,达到更好的区分胃癌和/或食管癌的目的;
2.以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,为非侵入性方法,可实现胃癌和/或食管癌早筛;
3.本发明的胃癌和/或食管癌组织特异性甲基化标志物可以以高的敏感性和特异性检出胃癌和/或食管癌。
针对本领域中缺乏针对胰腺癌组织特异性甲基化标志物的现状,本发明人从7个癌种(肺癌,肝癌,胃癌,食管癌,胰腺癌,乳腺癌,结直肠癌)的大量下一代测序(NGS)cfDNA甲基化靶向测序数据中筛选到胰腺癌组织特异性的甲基化标志物。发明人使用筛选得到的甲基化标志物进行机器学习模型的构建和验证,用于泛癌种早期筛查过程中对胰腺癌的组织溯源,达到更好的区分胰腺癌的目的。
一方面,本发明提供了试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分胰腺癌患者与非胰腺癌的癌症患者,(2)用于诊断或辅助诊断胰腺癌;或者(3)用于泛癌筛查过程中对胰腺癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中胰腺癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因PGM1(Phosphoglucomutase 1);基因CELF3(CUGBP Elav-Like Family Member 3);基因ATP2B4(ATPase Plasma Membrane Ca2+Transporting4);基因SF3B6(Splicing Factor 3b Subunit 6);基因CNNM4 (Cyclin And CBS Domain Divalent Metal Cation Transport Mediator 4);基因SP9(Sp9Transcription Factor);基因C2orf82(chromosome 2 open reading frame 82);基因NEU4(Neuraminidase 4);基因RPL35A(Ribosomal Protein L35a);基因HGFAC;基因EXOC3(Exocyst Complex Component 3);基因GDNF(Glial cell line-derived neurotrophic factor);基因NEUROG1(Neurogenin 1);基因HIST1H2BA;基因OSTM1(Osteoclastogenesis Associated Transmembrane Protein 1);基因CCR6(C-C Motif Chemokine Receptor);基因CCAR2;基因TNFRSF10D(TNF Receptor Superfamily Member 10d);基因TJP2(Tight Junction Protein 2);基因DAB2IP(DAB2Interacting Protein);基因NTMT1(N-Terminal Xaa-Pro-Lys N-Methyltransferase 1);基因MKI67(Marker Of Proliferation Ki-67);基因PTGDR2(Prostaglandin D2Receptor 2);基因CCDC77(Coiled-Coil Domain Containing 77);基因MYL2(Myosin Light Chain2);基因FRY;基因SMEK1;基因BTBD6(BTB Domain Containing 6);基因PIF1;基因SRL;基因SPNS1;基因DNM2(Dynamin 2);基因ZNF569(Zinc Finger Protein 569);基因SDF2L1(Stromal Cell Derived Factor 2Like 1);或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为130bp-530bp。在一个实施方案中,位点的长度为150bp-480bp。
在一个实施方案中,非胰腺癌的癌症或泛癌包括结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌。
在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:68、88、154、163、172、177和188-217。
在一个实施方案中,试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
在另一个方面,本发明提供了一种构建区分胰腺癌与其他非胰腺癌的癌 症的预测模型的方法,其包括:
(1)获得胰腺癌样品和非胰腺癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;基因SDF2L1;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;和
(2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
在一个实施方案中,位点的长度为130bp-530bp,优选150bp-480bp。在一个实施方案中,非胰腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌。
在一个实施方案中,甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:68、88、154、163、172、177和188-217。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,基因组DNA是血浆中的游离DNA。
在一个实施方案中,步骤(1)包括获得样品DNA的甲基化测序数据。
在一个实施方案中,步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练并根据训练集的样本确定模型的阈值。
在一个实施方案中,步骤(2)包括使用逻辑回归模型(python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型):AllModel=LogisticRegression(),该模型的公式如下,其中x为样品中甲基化标志物的甲基化水平值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训 练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中胰腺癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。
在另一个方面,提供了根据本发明的方法构建的胰腺癌预测模型。
在另一个方面,提供了诊断胰腺癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据本发明的方法以构建胰腺癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以获得模型预测分值,使用预测分值并根据阈值对样本是否是胰腺癌进行判断。在一个实施方案中,使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是胰腺癌进行判断,大于阈值预测为胰腺癌,反之预测为其它癌种。
在另一个方面,提供了用于检测胰腺癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种胰腺癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述胰腺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;基因SDF2L1;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为130bp-530bp。在一个实施方案中,位点的长度为150bp-480bp。
在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:68、88、154、163、172、177和188-217。
在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一 个实施方案中,核酸是血浆中的游离DNA。
在一个实施方案中,试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
在一个实施方案中,试剂包含用于检测甲基化标志物的寡核苷酸。在一个实施方案中,寡核苷酸是引物和/或探针;
在一个实施方案中,引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物。
在一个实施方案中,试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非胰腺癌的癌症患者的前述特异性甲基化标志物。在一个实施方案中,所述非胰腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌。
本发明提供了分离的核酸,其是一种或多种特异性甲基化标志物。在一个实施方案中,分离的核酸是胰腺癌组织特异性甲基化标志物。在一个实施方案中,所述胰腺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;基因SDF2L1;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变。在一个实施方案中,位点的长度为130bp-530bp。在一个实施方案中,位点的长度为150bp-480bp。在一个实施方案中,甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:68、88、154、163、172、177和188-217。在一个实施方案中,分离的核酸从样品分离。在一个实施方案中, 样品是细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,分离的核酸是从胰腺癌患者获得的。例如,分离的核酸是从血浆中的游离DNA中获得的。
在本发明的各方面的实施方案中,变体包含与任一种基因的序列具有至少70%同一性的序列。例如,变体包含与任一种基因的序列具有至少75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的序列。
在本发明的各方面的实施方案中,所述区域是所述基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区。在一个实施方案中,上游区是基因上游的2.4kb、2.3kb、2.2kb、2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区。下游区是基因下游的2.4kb、2.3kb、2.2kb、2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp下游区。
在本发明的各方面的实施方案中,位点的长度可以有所变化。在一个实施方案中,位点的长度可以是130bp-530bp。在一个实施方案中,位点的长度可以是150bp-480bp。在一个实施方案中,位点的长度可以是140bp、150bp、160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp、500bp、510bp或520bp。
在在本发明的各方面的实施方案中,变体是与上述任一项或多项所示的核苷酸序列具有至少70%、75%、76%、77%、78%、79%、80%、81%、82%、83%、84%、85%、86%、87%、88%、89%、90%、91%、92%、93%、94%、95%、96%、97%、98%或99%同一性的变体序列。
在一个方面,本发明提供了方法,其(1)区分胰腺癌患者与非胰腺癌的癌症患者,(2)用于诊断或辅助诊断胰腺癌;或者(3)用于泛癌筛查过程中对胰 腺癌的组织溯源,包括测定样品基因组DNA中的本文中所述的一种或多种甲基化标志物的甲基化水平。在一个实施方案中,利用本发明的胰腺癌预测模型进行该方法。
本发明的优势包括:
1.本发明提供了新的甲基化标志物,可以用于泛癌种早期筛查过程中对胰腺癌的组织溯源,达到更好的区分胰腺癌的目的;
2.以肿瘤细胞释放到血浆中的游离DNA(ctDNA)为基础,为非侵入性方法,可实现胰腺癌早筛;
3.本发明的甲基化标志物可以以高的敏感性和特异性检出胰腺癌。
附图说明
图1:所选结直肠癌特异性标志物在训练集中甲基化水平。
图2:所选结直肠癌特异性标志物在测试集中甲基化水平。
图3:结直肠癌(附图中也称肠癌)特异性Seq ID NO:52在训练集各个癌种中的甲基化水平。
图4:结直肠癌特异性Seq ID NO:52在测试集各个癌种中的甲基化水平。
图5:AllModel在训练集和测试集中结直肠癌和其它癌种模型分值分布。
图6:AllModel在训练集和测试集中的ROC曲线。
图7:结直肠癌特异性标志物组合1模型的分值。
图8:结直肠癌特异性标志物组合1模型的ROC曲线。
图9:结直肠癌特异性标志物组合2模型分值。
图10:结直肠癌特异性标志物组合2模型ROC曲线。
图11:所选肺癌组织特异性甲基化标志物在训练集中甲基化水平。
图12:所选肺癌组织特异性甲基化标志物在测试集中甲基化水平。
图13:肺癌组织特异性甲基化标志物Seq ID NO:91在训练集各个癌种中的甲基化水平。
图14:肺癌组织特异性甲基化标志物Seq ID NO:91在测试集各个癌种中的甲基化水平。
图15:所有肺癌组织特异性甲基化标志物在训练集和测试集中肺癌和其它癌种模型分值分布。
图16:所有肺癌组织特异性甲基化标志物在训练集和测试集中的ROC曲 线。
图17:肺癌组织特异性甲基化标志物组合1模型的分值。
图18:肺癌组织特异性甲基化标志物组合1模型的ROC曲线。
图19:肺癌组织特异性甲基化标志物组合2模型分值。
图20:肺癌组织特异性甲基化标志物组合2模型ROC曲线。
图21:肝癌甲基化标志物在训练集中甲基化水平。
图22:肝癌甲基化标志物在测试集中甲基化水平。
图23:肝癌甲基化标志物Seq ID NO:137在训练集各个癌种中的甲基化水平。
图24:肝癌甲基化标志物Seq ID NO:137在测试集各个癌种中的甲基化水平。
图25:所有肝癌标志物在训练集和测试集中肝癌和其它癌种模型分值分布。
图26:所有肝癌甲基化标志物在训练集和测试集中的ROC曲线。
图27:肝癌甲基化标志物组合1模型分值。
图28:肝癌甲基化标志物组合1模型的ROC曲线。
图29:肝癌甲基化标志物组合2模型分值。
图30:肝癌甲基化标志物组合2模型ROC曲线。
图31:所选乳腺癌甲基化标志物在训练集中甲基化水平。
图32:所选乳腺癌甲基化标志物在测试集中甲基化水平。
图33:乳腺癌甲基化标志物Seq ID NO:21在训练集各个癌种中的甲基化水平。
图34:乳腺癌甲基化标志物Seq ID NO:21在测试集各个癌种中的甲基化水平。
图35:所有乳腺癌甲基化标志物在训练集和测试集中乳腺癌和其它癌种模型分值分布。
图36:所有乳腺癌甲基化标志物在训练集和测试集中的ROC曲线。
图37:乳腺癌甲基化标志物组合1模型分值。
图38:乳腺癌甲基化标志物组合1模型的ROC曲线。
图39:乳腺癌甲基化标志物组合2模型分值。
图40:乳腺癌甲基化标志物组合2模型ROC曲线。
图41:所选胃癌和/或食管癌组织特异性甲基化标志物在训练集中甲基化水平。
图42:所选胃癌和/或食管癌组织特异性甲基化标志物在测试集中甲基化水平。
图43:胃癌和/或食管癌组织特异性甲基化标志物Seq ID NO:172在训练集各个癌种中的甲基化水平。
图44:胃癌和/或食管癌组织特异性甲基化标志物Seq ID NO:172在测试集各个癌种中的甲基化水平。
图45:所有胃癌和/或食管癌组织特异性甲基化标志物在训练集和测试集中胃癌和/或食管癌和其它癌种模型分值分布。
图46:所有胃癌和/或食管癌组织特异性甲基化标志物在训练集和测试集中的ROC曲线。
图47:胃癌和/或食管癌组织特异性甲基化标志物组合1模型的分值。
图48:胃癌和/或食管癌组织特异性甲基化标志物组合1模型的ROC曲线。
图49:胃癌和/或食管癌组织特异性甲基化标志物组合2模型分值。
图50:胃癌和/或食管癌组织特异性甲基化标志物组合2模型ROC曲线。
图51:胰腺癌标志物在训练集中甲基化水平。
图52:胰腺癌标志物在测试集中甲基化水平。
图53:胰腺癌标志物Seq ID NO:202在训练集的各个癌种中的甲基化水平。
图54:胰腺癌标志物Seq ID NO:202在测试集的各个癌种中的甲基化水平。
图55:所有胰腺癌标志物在训练集和测试集中胰腺癌和其它癌种模型分值分布。
图56:所有胰腺癌标志物在训练集和测试集中的ROC曲线。
图57:胰腺癌标志物组合1模型分值。
图58:胰腺癌标志物组合1模型的ROC曲线。
图59:胰腺癌标志物组合2模型分值。
图60:胰腺癌标志物组合2模型ROC曲线。
具体实施方式
本发明人从7个癌种大量的NGS甲基化测序数据中筛选到了结直肠癌组织特异性的甲基化标志物,并且在相关验证数据中能达到很好的组织溯源效果,为泛癌种早筛过程中结直肠癌的组织溯源提供了重要的技术支持。
本发明人从7个癌种大量的NGS甲基化测序数据中筛选到了肺癌组织特异性的甲基化标志物,并且在相关验证数据中能达到很好的组织溯源效果,为泛癌种早筛过程中肺癌的组织溯源提供了重要的技术支持。
本发明从7个癌种的大量NGS甲基化测序数据中筛选到了肝癌组织特异性的甲基化标志物,并且在相关验证数据中能达到很好的组织溯源效果,为泛癌种早筛过程中肝癌的组织溯源提供了重要的技术支持。
本发明从7个癌种的大量NGS甲基化测序数据中筛选到了乳腺癌组织特异性的甲基化标志物,并且在相关验证数据中能达到很好的组织溯源效果,为泛癌种早筛过程中乳腺癌的组织溯源提供了重要的技术支持。
本发明人从7个癌种的大量的NGS甲基化测序数据中筛选到了胃癌和/或食管癌组织特异性的甲基化标志物,并且在相关验证数据中能达到很好的组织溯源效果,为泛癌种早筛过程中胃癌和/或食管癌的组织溯源提供了重要的技术支持。发明人发现,胃癌和/或食管癌与以下基因区域的甲基化水平相关:SEQ ID No.23、72、143、150、152、157和160-187。
本发明从7个癌种大量的NGS甲基化测序数据中筛选到了胰腺癌组织特异性的甲基化标志物,并且在相关验证数据中能达到很好的组织溯源效果,为泛癌种早筛过程中胰腺癌的组织溯源提供了重要的技术支持。
机器学习建模是为输入的数据特征寻找最合适的表现形式的过程,使其能够解决具体问题,例如分类问题。经过建模之后的数据要比每一个输入的单个数据特征具备更佳的区分能力。本文展示了最佳模型以及模型中每个标志物的分类效果,选择任意的特征组合进行建模的区分效果介于最优模型与单个特征之间。如本文中所示,每一个单独的标志物都具备区分效果,在本专利申请实施例中也展示了随机选择标志物进行分类的结果。因此,本专利申请对全部标志物组合模型进行保护。
发明人发现,结直肠癌与以下基因区域(SEQ ID No.52-90)的甲基化水平相关:第1号染色体第27189993-27190207位;第1号染色体第
27732194-27732394位;第1号染色体第121260989-121261197位;第2号染色 体第469568-469933位;第2号染色体第106959197-106959397位;第3号染色体第13323366-13323566位;第3号染色体第69230395-69230599位;第6号染色体第1393206-1393469位;第6号染色体第166580183-166580476位;第7号染色体第29605610-29605810位;第7号染色体第73407894-73408161位;第7号染色体第93519986-93520213位;第7号染色体第150069569-150069875位;第8号染色体第22438141-22438341位;第8号染色体第97506340-97506540位;第8号染色体第141231103-141231303位;第9号染色体第71788926-71789126位;第10号染色体第518081-518444位;第10号染色体第74069147-74069510位;第11号染色体第-1955139-1955372位;第11号染色体第31848632-31848877位;第12号染色体第94605804-94606004位;第13号染色体第49795241-49795441位;第13号染色体第109147964-109148164位;第14号染色体第105102434-105102644位;第15号染色体第45670805-45671005位;第16号染色体第1202353-1202553位;第16号染色体第57025884-57026193位;第17号染色体第11143843-11144043位;第17号染色体第21300616-21300930位;第17号染色体第46796372-46796572位;第17号染色体第73607909-73608115位;第17号染色体第76991129-76991518位;第18号染色体第76150778-76150991位;第19号染色体第2790947-2791147位;第19号染色体第4059528-4059746位;第19号染色体第10823485-10823947位;第19号染色体第39306255-39306455位;第20号染色体第43331809-43332099位,其中甲基化标志物的物理位置是参照人全基因组序列hg19确定的。
发明人发现,肺癌与以下基因区域或其上下游区域的甲基化水平相关:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9。
发明人发现,肝癌与以下基因区域或其上下游区域的甲基化水平相关:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;或WISP2基因。
发明人发现,乳腺癌与以下基因区域或其上下游区域的甲基化水平相关:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81。
发明人发现,胰腺癌与以下基因区域或其上下游区域的甲基化水平相关:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;基因SDF2L1。
DNA甲基化是表观遗传的一种机制,是真核细胞基因组常见的表观遗传学修饰,能够在不改变DNA序列的情况下改变遗传表现。所谓DNA甲基化是指在DNA甲基化转移酶(methyltransferase)的作用下,在基因组CpG二核苷酸的胞嘧啶5号碳位共价结合一个甲基基团。DNA甲基化在细胞增殖、分 化、发育等方面起重要作用,与肿瘤的发生、发展关系密切,其效应有转录抑制、染色质结构调节、X染色体失活、基因组印记等。DNA甲基化异常可以通过影响染色质结构以及癌基因和抑癌基因的表达而参与肿瘤的发生和进展。
如本文所用,“引物”是指在核苷酸聚合作用起始时引导合成的具有特定核苷酸序列的核酸分子。引物通常是人工合成的两段寡核苷酸序列,一个引物与靶区域一端的一条DNA模板链互补,另一个引物与靶区域另一端的另一条DNA模板链互补,其功能是作为核苷酸聚合作用的起始点。体外人工设计的引物广泛用于聚合酶链反应(PCR)、qPCR、测序和探针合成等。通常,引物设计为扩增的产物长度为50-150bp、60-140、70-130、80-120bp。本文试剂中所含引物可以是基因组测序的引物,例如全基因组测序引物或针对基因组某一区域的测序引物,也可以是用于扩增特定区域的PCR引物或用于扩增区域中一个或多个甲基化位点的PCR引物。引物可以是全基因组测序引物,所述引物可以获得很多扩增产物,这些扩增产物可以包含所述区域或在拼接后包含所述区域。根据全基因组测序结果,在测序后获得该区域中的每个甲基化位点(CpG)的甲基化状态,从而获取整个区域的甲基化水平。引物与感兴趣的基因或区域是互补或基本上互补的。
如本文所用,术语“变体”是指与参照序列相比,通过一个或多个核苷酸的***、缺失或取代使核酸序列发生变化,同时保留其与其他核酸杂交能力的多核苷酸。本文任一实施方案所述的变体包括与参照序列或参照基因具有至少70%,优选至少80%,优选至少85%,优选至少90%,优选至少95%,优选至少97%的序列同一性并保留参照序列或参照基因的甲基化位点的核苷酸序列。可采用例如NCBI的BLASTn计算两条比对的序列之间的序列同一性。变体还包括在参照序列的核苷酸序列中具有一个或多个突变(***、缺失或取代)、同时仍保留参照序列甲基化位点的核苷酸序列。多个突变通常指1-10个以内,例如1-8个、1-5个或1-3个。取代可以是嘌呤核苷酸与嘧啶核苷酸之间的取代,也可以是嘌呤核苷酸之间或嘧啶核苷酸之间的取代。取代优选是保守性取代。例如,在本领域中,用性能相近或相似的核苷酸进行保守性取代时,通常不会改变多核苷酸的稳定性和功能。保守性取代例如嘌呤核苷酸之间的(A与G)的互换,嘧啶核苷酸之间的(T或U与C)的互换。因此,在本发明多核苷酸中用来自同一残基替换一个或几个位点,将不会在实质上 影响其活性。此外,本发明的变体中所含有的本文所述的甲基化位点未发生突变。即本发明方法检测的是相应序列中的甲基化位点的甲基化情况,对于这些位点之外的碱基可以发生突变。
如本文所用,术语“生物样品”或“样品”通常指从感兴趣的生物来源(例如组织或生物体或细胞培养物)获得或衍生的样品。在一些实施方案中,作为样品来源的生物体是动物或人,优选是人。在一些实施方案中,样品是或包括生物组织或流体。在一些实施方案中,生物样品可以是或包括细胞、组织或体液。在一些实施方案中,生物样品可以是或包括血液、血细胞、无细胞DNA、游离的漂浮核酸、腹水、活组织检查样品、外科样品、含细胞体液、痰、唾液、粪便、尿液、脑脊液、腹膜液、胸膜液、淋巴液、妇科液、分泌物、***物、皮肤拭子、***拭子、口腔拭子、鼻拭子、洗液如导管洗液或支气管肺泡洗液、吸出物、刮片等。在一些实施方案中,生物样品是或包括从单个受试者或从多个受试者获得的细胞。样品可以是直接从生物来源获得的“初级样品”,或者可以是“处理过的样品”。
如本文所用,术语“癌症”用于指细胞表现出异常、失控和/或自主生长,使得它们表现出异常升高的增殖速率和/或异常生长表型的疾病或病症。在本发明中,感兴趣的癌症可以是结直肠癌。在本发明中,感兴趣的癌症可以是肺癌。在本发明中,感兴趣的癌症可以是肝癌。在本发明中,感兴趣的癌症可以是乳腺癌。在本发明中,感兴趣的癌症可以是胃癌和/或食管癌。在本发明中,感兴趣的癌症可以是胰腺癌。
如本文所用,术语“诊断”是指确定受试者是否患有或有风险形成癌症的定量概率和/或定性概率。例如,在癌症的诊断中,诊断可包括关于癌症的风险、类型、阶段、恶性等的确定。
如本文所用,术语“标志物”与其在本领域中的用途一致,是指其存在,水平或形式与特定的感兴趣的生物事件或状态相关联的实体,从而认为是该事件或状态的“标志”。本领域技术人员将认识到,在甲基化标志物的上下文中,甲基化标志物可以是或包括基因座(例如一个或多个甲基化基因座)和/或基因座的状态(例如一个或多个甲基化基因座的状态)。标志物可以是或包括特定疾病的标志物,或者可以是特定疾病在受试者中发展、发生或复发的定量概率的标志物。本发明的甲基化标志物可以是结直肠癌、肺癌、肝癌、乳腺癌、胃癌和/或食管癌,以及胰腺癌之一的预测、预后和/或诊断 的标志物。
如本文所用,“DNA区域”或“区域”是指较大DNA分子的任何连续部分。在本文中,DNA区域是指感兴趣的基因以及其上游和下游的区域。基因或区域的“上游”是指相对于基因或区域5’端的区域。基因或区域的“下游”是指相对于基因或区域3’端的区域。
如本文所用,术语“同一性”是指核酸分子(例如DNA分子和/或RNA分子)之间的总体相关性。用于计算两个提供的序列之间的同一性百分比的方法是本领域已知的。例如,可以如下计算两个核酸的同一性百分比:比对两个序列以达到最佳的比较目的(例如,可以在第一和第二序列中的一个或两个序列中引入缺口以进行最佳比对,并且为了比较的目的可以忽略不相同的序列);然后比较相应位置的核苷酸;当第一序列中的位置被与第二序列中的相应位置相同的残基(例如核苷酸或氨基酸)占据时,那么分子在该位置是相同的。两个序列之间的同一性百分数是序列共享的相同位置的数目的函数(考虑到为了最佳比对引入的缺口的数目和每个缺口的长度)。序列的比较和两个序列之间同一性百分比的确定可以使用诸如BLAST(基本局部比对搜索工具)之类的计算算法来完成。
如本文所用,术语“甲基化”包括(i)胞嘧啶的任何C5位;(i i)胞嘧啶的N4位;(iii)腺嘌呤的N6位的甲基化;和(iv)其它类型的核苷酸甲基化。甲基化的核苷酸可以称作“甲基化核苷酸”或“甲基化核苷酸碱基”。在某些实施方案中,如本文所述的甲基化具体指胞嘧啶残基的甲基化。在一些情况下,甲基化指存在于CpG位点中的胞嘧啶残基的甲基化。
如本文所用,术语“甲基化分析”指可用于确定甲基化位点的甲基化状态或水平的任何技术。
如本文所用,术语“甲基化标志物”指至少一个甲基化位点和/或至少一个甲基化位点的甲基化状态(例如超甲基化位点)的标志物。特别地,甲基化标志物的特征在于一个或多个核酸位点的甲基化状态在第一状态和第二状态(例如,在癌变状态和非癌变状态之间)之间变化。
如本文所用,“甲基化状态”指甲基化基因座内的甲基化位点的甲基化数量,频率或模式。因此,在第一状态和第二状态之间甲基化状态的变化可以是或包括甲基化位点的数目,频率或模式的增加,或者可以是或包括甲基化位点的数目,频率或模式的减少。在各种情况下,甲基化状态的改变是甲 基化值的改变。在本文中,甲基化状态可以以甲基化单倍型频率表示。
如本文所用,术语“甲基化值”是指甲基化状态的数字表示,例如,以表示甲基化基因座的甲基化频率或比率的数字的形式。在一些情况下,甲基化值可以通过如下的方法产生,该方法包括在用甲基化依赖性限制性内切酶限制性消化样品之后定量样品中存在的完整核酸的量。在一些情况下,甲基化值可以通过包括比较样品的亚硫酸氢盐反应后的扩增概况的方法产生。在一些情况下,可以通过比较亚硫酸氢盐处理和未处理核酸的序列来产生甲基化值。在一些情况下,甲基化值是定量PCR结果,包括定量PCR结果或基于定量PCR结果。本文中,甲基化水平代表一个或多个位点处于甲基化状态的比例。一个区域(或一组位点)的甲基化水平是该区域中所有位点(或组中所有位点)的甲基水平的均值。因此,区域的甲基化水平上升或下降并不表示区域中所有甲基化位点的甲基化水平都上升或下降。本领域知晓将检测DNA甲基化的方法(例如简化甲基化测序)所得结果转化为甲基化水平的过程。例如,可以利用软件Bismark(v0.17.0)获得CpG位点的甲基化水平。检测DNA甲基化的方法在本领域中是已知的,包括但不限于基于重亚硫酸盐转化的PCR(例如甲基化特异性PCR(Methylation-specific PCR,MSP))、DNA测序(如亚硫酸氢盐测序(Bisulfite sequencing,BS)、全基因组甲基化测序(Whole-genome bisulfite sequencing,WGBS)、简化甲基化测序(Reduced Representation Bisulfite Sequencing,RRBS))、甲基化敏感的限制性内切酶分析法(Methylation-Sensitive Dependent Restriction Enzymes)、荧光定量法、甲基化敏感性高分辨率熔解曲线法(Methylation-sensitivity High-resolution Melting,MS-HRM)、基于芯片的甲基化图谱分析或质谱(例如飞行质谱)、大规模平行测序技术(例如下一代测序技术),例如合成测序、实时(例如单分子)测序、珠粒乳液测序、纳米孔测序等。在一个或多个实施方案中,检测包括检测基因或位点处的任一条链。也可以使用简化基因组甲基化测序(RRBS)检测DNA甲基化。简化基因组甲基化测序是利用限制性内切酶对基因组进行酶切,经亚硫酸氢盐处理,对基因组CpG区域进行测序的技术。例如,简化基因组甲基化测序所用试剂包括:血浆核酸纯化试剂盒、连接酶、重亚硫酸盐及其衍生物、dNTP、聚合酶、引物、无核酸酶水和/或磁珠等。
如本文所用,标志物的“特异性”是指以不存在感兴趣的事件或状态为特征的样品的百分比,其中标志物的测量精确地指示不存在感兴趣的事件或状态(真实阴性率)。在各种实施方案中,阴性样品的表征不依赖于标志物,并且可以通过任何相关的测量,例如本领域技术人员已知的任何相关测量来实现。因此,特异性反映当在不表征感兴趣的事件或状态的样品中测量时标志物将检测到感兴趣的事件或状态的不存在的概率。在感兴趣的事件或状态是结直肠癌的特定实施方案中,特异性指标志物将检测缺乏结直肠癌的受试者中结直肠癌的不存在的概率。结直肠癌的不存在可以例如通过组织学来确定。在感兴趣的事件或状态是肺癌的特定实施方案中,特异性指标志物将检测缺乏肺癌的受试者中肺癌的不存在的概率。肺癌的不存在可以例如通过组织学来确定。在感兴趣的事件或状态是肝癌的特定实施方案中,特异性指标志物将检测缺乏肝癌的受试者中肝癌的不存在的概率。肝癌的不存在可以例如通过组织学来确定。在感兴趣的事件或状态是乳腺癌的特定实施方案中,特异性指标志物将检测缺乏乳腺癌的受试者中乳腺癌的不存在的概率。乳腺癌的不存在可以例如通过组织学来确定。在感兴趣的事件或状态是胃癌和/或食管癌的特定实施方案中,特异性指标志物将检测缺乏胃癌和/或食管癌的受试者中胃癌和/或食管癌的不存在的概率。胃癌和/或食管癌的不存在可以例如通过组织学来确定。在感兴趣的事件或状态是胰腺癌的特定实施方案中,特异性指标志物将检测缺乏胰腺癌的受试者中胰腺癌的不存在的概率。胰腺癌的不存在可以例如通过组织学来确定。
如本文所用,标志物的“敏感性”是指以存在感兴趣的事件或状态为特征的样品的百分比,其中标志物的测量精确地指示存在感兴趣的事件或状态(真实阳性率)。在各种实施方案中,阳性样品的表征不依赖于标志物,并且可以通过任何相关的测量,例如本领域技术人员已知的任何相关测量来实现。因此,敏感性反映了当在以感兴趣事件或状态的存在为特征的样品中测量时标志物将检测到感兴趣的事件或状态的存在的概率。在感兴趣的事件或状态是结直肠癌的特定实施方案中,敏感性指标志物将检测患有结直肠癌的受试者中结直肠癌的存在的概率。结直肠癌的存在可以例如通过组织学来确定。在感兴趣的事件或状态是肺癌的特定实施方案中,敏感性指标志物将检测患有肺癌的受试者中肺癌的存在的概率。肺癌的存在可以例如通过组织学来确定。在感兴趣的事件或状态是肝癌的特定实施方案中,敏感性指标志物 将检测患有肝癌的受试者中肝癌的存在的概率。肝癌的存在可以例如通过组织学来确定。在感兴趣的事件或状态是乳腺癌的特定实施方案中,敏感性指标志物将检测患有乳腺癌的受试者中乳腺癌的存在的概率。乳腺癌的存在可以例如通过组织学来确定。在感兴趣的事件或状态是胃癌和/或食管癌的特定实施方案中,敏感性指标志物将检测患有胃癌和/或食管癌的受试者中胃癌和/或食管癌的存在的概率。胃癌和/或食管癌的存在可以例如通过组织学来确定。在感兴趣的事件或状态是胰腺癌的特定实施方案中,敏感性指标志物将检测患有胰腺癌的受试者中胰腺癌的存在的概率。胰腺癌的存在可以例如通过组织学来确定。
本文所用术语“受试者”指的是生物体,通常是哺乳动物(例如人)。在一些实施方案中,在一个实施方案中,受试者患有癌症。在一个实施方案中,受试者患有结直肠癌。在一个实施方案中,受试者患有肺癌。在一个实施方案中,受试者患有肝癌。在一个实施方案中,受试者患有乳腺癌。在一个实施方案中,受试者患有胃癌和/或食管癌。在一个实施方案中,受试者患有胰腺癌。
从结直肠癌患者分离的核酸
本发明提供了分离的核酸,其是从受试者的样品分离的。例如,分离的核酸是从结直肠癌患者血浆中的游离DNA分离的。分离的核酸是一种或多种特异性甲基化标志物,优选结直肠癌组织特异性甲基化标志物。甲基化标志物是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区:基因SFN;基因GPR3;基因FCGR1B;基因FAM150B;基因RGPD3;基因NUP210;基因LMOD3;基因FOXF2;基因TBXT;基因PRR15;基因ELN;基因TFPI2;基因REPIN1;基因PDLIM2;基因SDC2;基因TRAPPC9;基因TJP2;基因DIP2C;基因DDIT4;基因MRPL23;基因PAX6;基因PLXNC1;基因MLNR;基因MYO16;基因TMEM179;基因GATM;基因CACNA1H;基因NLRC5;基因SHISA6;基因KCNJ12;基因PRAC1;基因MYO15B;基因CANT1;基因SALL3;基因THOP1;基因ZBTB7A;基因DNM2;基因LGALS4;基因WISP2。该位点是甲基化的位点。本领域技术人员应当理解基因组的基因可以存在突变,因此可以想到这些基因的变体也可以作为甲基化标志物,只要变体中的甲基化 位点未发生突变。变体可以包含与任一种基因的序列具有至少70%同一性的序列。选择作为标志物的位点可以包含1个或多个CpG,例如2个CpG、3个CpG、4个CpG、5个CpG、6个CpG、10个CpG、20个CpG或30个CpG。合适的位点的长度可以是150bp-500bp。例如,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
本领域技术人员理解基因与其上游和下游的区域具备相同或相似的甲基化水平或状态。因此,当本发明发现特定基因内的甲基化位点后可以设想该基因以及在染色体原位的2.3kb上游区和2.3kb下游区也具备相同或相似的甲基化水平或状态。本发明涵盖本发明所述的基因以及该基因在其所处的染色体中的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区和下游区。
在本文中,本发明使用了以下核苷酸序列作为甲基化标志物。

其中染色***置的坐标是参照人全基因组序列hg19确定的。根据筛选出的结直肠癌组织特异性的甲基化标志物以及其所处的基因,本领域技术人员应当理解,以下各项内的位点可用作甲基化标志物:位于基因SFN区域内或上下 游;位于基因GPR3区域内或上下游;位于基因FCGR1B区域内或上下游;位于基因FAM150B区域内或上下游的;位于基因RGPD3区域内或上下游;位于基因NUP210区域内或上下游;位于基因LMOD3区域内或上下游;位于基因FOXF2区域内或上下游;位于基因TBXT区域内或上下游;位于基因PRR15区域内或上下游;位于基因ELN区域内或上下游;位于基因TFPI2区域内或上下游;位于基因REPIN1区域内或上下游;位于基因PDLIM2区域内或上下游;位于基因SDC2区域内或上下游;位于基因TRAPPC9区域内或上下游;位于基因TJP2区域内或上下游;位于基因DIP2C区域内或上下游;位于基因DDIT4区域内或上下游;位于基因MRPL23区域内或上下游;位于基因PAX6区域内或上下游;位于基因PLXNC1区域内或上下游;位于基因MLNR区域内或上下游;位于基因MYO16区域内或上下游;位于基因TMEM179区域内或上下游;位于基因GATM区域内或上下游;位于基因CACNA1H区域内或上下游;位于基因NLRC5区域内或上下游;位于基因SHISA6区域内或上下游;位于基因KCNJ12区域内或上下游;位于基因PRAC1区域内或上下游;位于基因MYO15B区域内或上下游;位于基因CANT1区域内或上下游;位于基因SALL3区域内或上下游;位于基因THOP1区域内或上下游;位于基因ZBTB7A区域内或上下游;位于基因DNM2区域内或上下游;位于基因LGALS4区域内或上下游;位于基因WISP2区域内或上下游。单独一个或者多个甲基化标志物的组合都可以用作结直肠癌特异性的甲基化标志物。在一个实施方案中,甲基化标志物在上述任一基因的2kb上游和2kb下游区内。
从肺癌患者分离的核酸
本发明提供了分离的核酸,其是从受试者的样品分离的。例如,分离的核酸是从肺癌患者血浆中的游离DNA分离的。分离的核酸是一种或多种特异性甲基化标志物,优选肺癌组织特异性甲基化标志物。甲基化标志物是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40; 基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9。该位点是甲基化的位点。本领域技术人员应当理解基因组的基因可以存在突变,因此可以想到这些基因的变体也可以作为甲基化标志物,只要变体中的甲基化位点未发生突变。变体可以包含与任一种基因的序列具有至少70%同一性的序列。选择作为标志物的位点可以包含1个或多个CpG,例如2个CpG、3个CpG、4个CpG、5个CpG、6个CpG、10个CpG、20个CpG或30个CpG。合适的位点的长度可以是150bp-500bp。例如,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
本领域技术人员理解基因与其上游和下游的区域具备相同或相似的甲基化水平或状态。因此,当本发明人发现特定基因内的甲基化位点后可以设想该基因以及在染色体原位的2.2kb上游区和2.2kb下游区也具备相同或相似的甲基化水平或状态。本发明涵盖本发明所述的基因以及该基因在其所处的染色体中的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区和下游区。
在本文中,本发明使用了以下核苷酸序列作为甲基化标志物。


其中染色***置的坐标是参照人全基因组序列hg19确定的。根据筛选出的肺癌组织特异性的甲基化标志物以及其所处的基因,本领域技术人员应当理解,以下各项内的位点可用作甲基化标志物:位于基因ARHGEF16内或者上游区或下游区;位于基因CASZ1内或者上游区或下游区;位于基因MAP3K6内或者上游区或下游区;位于基因TRIM58内或者上游区或下游区;位于基因ARHGEF33内或者上游区或下游区;位于基因PSD4内或者上游区或下游区;位于基因HOXD4内或者上游区或下游区;位于基因SLC12A8内或者上游区或下游区;位于基因DGKG内或者上游区或下游区;位于基因TERT内或者上游区或下游区;位于基因NR2F1内或者上游区或下游区;位于基因PCDHGC5内或者上游区或下游区;位于基因KCNMB1内或者上游区或下游区;位于基因FOXC1内或者上游区或下游区;位于基因HIST1H4F内或者上游区或下游区;位于基因TYW1内或者上游区或下游区;位于基因LRRC4内或者上游区或下游区;位于基因DGKI内或者上游区或下游区;位于基因PDLIM2内或者上游区或下游区;位于基因RHOBTB2内或者上游区或下游区;位于基因TMEM75内或者上游区或下游区;位于基因OPLAH内或者上游区或下游区;位于基因NR5A1内或者上游区或下游区;位于基因SPAG6内或者上游区或下游区;位于基因WAPAL内或者上游区或下游区;位于基因BTBD16内或者上游区或下游区;位于基因DPYSL4内或者上游区或下游区;位于基因TTC40内或者上游区或下游区;位于基因ADAM8内或 者上游区或下游区;位于基因SLC22A11内或者上游区或下游区;位于基因CPT1A内或者上游区或下游区;位于基因B4GALNT1内或者上游区或下游区;位于基因FBRSL1内或者上游区或下游区;位于基因XPO4内或者上游区或下游区;位于基因TFDP1内或者上游区或下游区;位于基因GCH1内或者上游区或下游区;位于基因TMEM179内或者上游区或下游区;位于基因ITPKA内或者上游区或下游区;位于基因SOX8内或者上游区或下游区;位于基因SLC9A3R2内或者上游区或下游区;位于基因SEPT-9内或者上游区或下游区;位于基因MBP内或者上游区或下游区;位于基因NFATC1内或者上游区或下游区;位于基因DNM2内或者上游区或下游区;位于基因RASAL3内或者上游区或下游区;位于基因TAF4内或者上游区或下游区;位于基因NTSR1内或者上游区或下游区;位于基因SLC17A9内或者上游区或下游区。单独一个或者多个甲基化标志物的组合都可以用作肺癌特异性的甲基化标志物。在一个实施方案中,甲基化标志物在上述任一基因的2kb上游和2kb下游区内。
从肝癌患者分离的核酸
本发明提供了分离的核酸,其是从受试者的样品分离的。例如,分离的核酸是从肝癌患者血浆中的游离DNA分离的。分离的核酸是一种或多种特异性甲基化标志物,优选肝癌组织特异性甲基化标志物。甲基化标志物是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;或WISP2基因。该位点是甲基化的位点。本领域技术人员应当理解基因组的基因可以存在突变,因此可以想到这些基因的变体也可以作为甲基化标志物,只要变体中的甲基化位点未发生突变。变体可以包含与任一种基因的序列具有至少70%同一性的序列。选择作为标志物的位点可以包含1个或多个CpG,例如2个CpG、3个CpG、4个CpG、5个CpG、6个CpG、10个CpG、20 个CpG或30个CpG。合适的位点的长度可以是100bp-550bp。例如,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
本领域技术人员理解基因与其上游和下游的区域具备相同或相似的甲基化水平或状态。因此,当本发明发现特定基因内的甲基化位点后可以设想该基因以及在染色体原位的3kb上游区和3kb下游区也具备相同或相似的甲基化水平或状态。本发明涵盖本发明所述的基因以及该基因在其所处的染色体中的2.9kb、2.8kb、2.7kb、2.6kb、2.5kb、2.4kb、2.3kb、2.2kb、2.1kb、2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区和下游区。
在本文中,本发明使用了以下核苷酸序列作为甲基化标志物。

其中染色***置的坐标是参照人全基因组序列hg19确定的。根据筛选出的肝癌组织特异性的甲基化标志物以及其所处的基因,本领域技术人员应当理解,以下各项内的位点可用作甲基化标志物:TAL1基因内以及其上游区或下游区;TRIM58基因内以及其上游区或下游区;LBH基因内以及其上游区或下游区;ABCG5基因内以及其上游区或下游区;PAX8基因内以及其上游区或下游区;DLEC1基因内以及其上游区或下游区;AMIGO3基因内以及其上游区或下游区;RASSF1基因内以及其上游区或下游区;CLDN11基因内以及其上游区或下游区;SLC2A9基因内以及其上游区或下游区;SLC9A3基因 内以及其上游区或下游区;CXXC5基因内以及其上游区或下游区;FOXC1基因内以及其上游区或下游区;HIST1H4F基因内以及其上游区或下游区;TRIM40基因内以及其上游区或下游区;HOXA13基因内以及其上游区或下游区;CRHR2基因内以及其上游区或下游区;AGPAT6基因内以及其上游区或下游区;TCF24基因内以及其上游区或下游区;OPLAH基因内以及其上游区或下游区;GPAM基因内以及其上游区或下游区;ADAM8基因内以及其上游区或下游区;GRASP基因内以及其上游区或下游区;B4GALNT1基因内以及其上游区或下游区;STX2基因内以及其上游区或下游区;ATL1基因内以及其上游区或下游区;ITPKA基因内以及其上游区或下游区;PIF1基因内以及其上游区或下游区;ZFHX3基因内以及其上游区或下游区;C1QL1基因内以及其上游区或下游区;SEPT-9基因内以及其上游区或下游区;KCTD1基因内以及其上游区或下游区;PIP5K1C基因内以及其上游区或下游区;RASAL3基因内以及其上游区或下游区;CYP2F1基因内以及其上游区或下游区;WISP2基因内以及其上游区或下游区。单独一个或者多个甲基化标志物的组合都可以用作肝癌特异性的甲基化标志物。在一个实施方案中,甲基化标志物在上述任一基因的3kb或2kb上游和3kb或2kb下游区内。
从乳腺癌患者分离的核酸
本发明提供了分离的核酸,其是从受试者的样品分离的。例如,分离的核酸是从乳腺癌患者血浆中的游离DNA分离的。分离的核酸是一种或多种特异性甲基化标志物,优选乳腺癌组织特异性甲基化标志物。甲基化标志物是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81。 该位点是甲基化的位点。本领域技术人员应当理解基因组的基因可以存在突变,因此可以想到这些基因的变体也可以作为甲基化标志物,只要变体中的甲基化位点未发生突变。变体可以包含与任一种基因的序列具有至少70%同一性的序列。选择作为标志物的位点可以包含1个或多个CpG,例如2个CpG、3个CpG、4个CpG、5个CpG、6个CpG、10个CpG、20个CpG或30个CpG。合适的位点的长度可以是150bp-500bp。例如,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
本领域技术人员理解基因与其上游和下游的区域具备相同或相似的甲基化水平或状态。因此,当本发明发现特定基因内的甲基化位点后可以设想该基因以及在染色体原位的2kb上游区和2kb下游区也具备相同或相似的甲基化水平或状态。本发明涵盖本发明所述的基因以及该基因在其所处的染色体中的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区和下游区。
在本文中,本发明使用了以下核苷酸序列作为甲基化标志物。


其中染色***置的坐标是参照人全基因组序列hg19确定的。根据筛选出的乳腺癌组织特异性的甲基化标志物以及其所处的基因,本领域技术人员应当理解,以下各项内的位点可用作甲基化标志物:基因BARHL2以及其上游区或下游区;基因ALX3以及其上游区或下游区;基因TBX15以及其上游区或下游区;基因C2CD4D以及其上游区或下游区;基因RYR2以及其上游区或下游区;基因LBH以及其上游区或下游区;SIX3以及其上游区或下游区;基因SIX2以及其上游区或下游区;基因OTX1以及其上游区或下游区;基因EMX1以及其上游区或下游区;基因LBX2以及其上游区或下游区;基因BCL2L11以及其上游区或下游区;基因PAX8以及其上游区或下游区;基因HOXD1以及其上游区或下游区;基因SATB2以及其上游区或下游区;基因VILL以及其上游区或下游区;基因CLDN11以及其上游区或下游区;基因EPHB3以及其上游区或下游区;基因NKX3-2以及其上游区或下游区;基因KCTD8以及其上游区或下游区;基因PITX1以及其上游区或下游区;基因CXXC5以及其上游区或下游区;基因FOXC1以及其上游区或下游区;基因NRN1以及其上游区或下游区;基因HOXA9以及其上游区或下游区;基因DLX6以及其上游区或下游区;基因MOS以及其上游区或下游区;基因TCF24以及其上游区或下游区;基因CA3以及其上游区或下游区;基因GDF6以及其上游区或下游区;基因FOXD4以及其上游区或下游区;基因PTF1A以及其上游区或下游区;基因TLX1以及其上游区或下游区;基因INA以及其上游区或下游区;基因NKX6-2以及其上游区或下游区;基因PAX6以及其上游区或下游区;基因BCAT1以 及其上游区或下游区;基因FAIM2以及其上游区或下游区;基因GRASP以及其上游区或下游区;基因CCNA1以及其上游区或下游区;基因SIX1以及其上游区或下游区;基因PRKCB以及其上游区或下游区;基因SOX9以及其上游区或下游区;基因ST8SIA5以及其上游区或下游区;基因NFIX以及其上游区或下游区;基因EPS8L1以及其上游区或下游区;基因ZIK1以及其上游区或下游区;基因KAL1以及其上游区或下游区;基因ZNF81。单独一个或者多个甲基化标志物的组合都可以用作乳腺癌特异性的甲基化标志物。在一个实施方案中,甲基化标志物在上述任一基因的2kb上游和2kb下游区内。
从胃癌和/或食管癌患者分离的核酸
本发明提供了分离的核酸,其是从受试者的样品分离的。例如,分离的核酸是从胃癌和/或食管癌患者血浆中的游离DNA分离的。分离的核酸是一种或多种特异性甲基化标志物,优选胃癌和/或食管癌组织特异性甲基化标志物。甲基化标志物是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因TAL1;基因VAV3;基因PMF1;基因ATP2B4;基因SH3YL1;基因SLC9A3;基因CXXC5;基因PCDHGA11;基因FOXF2;基因ZNF273;基因KLRG2;基因CRB2;基因SEC16A;基因GPAM;基因ASCL2;基因PAX6;基因PTGDR2;基因PLEKHB1;基因TBX5;基因STX2;基因FBRSL1;基因ATP11A;基因BTBD6;基因CRIP2;基因ONECUT1;基因ZNF764;基因IGHV3OR16-17;基因SALL1;基因ACTG1;基因GATA6;基因KCTD1;基因CYP2F1;基因TPTE;基因CLDN5。该位点是甲基化的位点。本领域技术人员应当理解基因组的基因可以存在突变,因此可以想到这些基因的变体也可以作为甲基化标志物,只要变体中的甲基化位点未发生突变。变体可以包含与任一种基因的序列具有至少70%同一性的序列。选择作为标志物的位点可以包含1个或多个CpG,例如2个CpG、3个CpG、4个CpG、5个CpG、6个CpG、10个CpG、20个CpG或30个CpG。合适的位点的长度可以是150bp-500bp。例如,位点的长度可以是160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp或500bp。
本领域技术人员理解基因与其上游和下游的区域具备相同或相似的甲 基化水平或状态。因此,当本发明发现特定基因内的甲基化位点后可以设想该基因以及在染色体原位的2kb上游区和2kb下游区也具备相同或相似的甲基化水平或状态。本发明涵盖本发明所述的基因以及该基因在其所处的染色体中的1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区和下游区。
在本文中,本发明使用了以下核苷酸序列作为甲基化标志物。

其中染色***置的坐标是参照人全基因组序列hg19确定的。根据筛选出的胃癌和/或食管癌组织特异性的甲基化标志物以及其所处的基因,本领域技术人员应当理解,以下各项内的位点可用作甲基化标志物:基因TAL1区域内或上游区和下游区;基因VAV3区域内或上游区和下游区;基因PMF1区域内或上游区和下游区;基因ATP2B4区域内或上游区和下游区;基因SH3YL1区域内或上游区和下游区;基因SLC9A3区域内或上游区和下游区;基因CXXC5区域内或上游区和下游区;基因PCDHGA11区域内或上游区和下游区;基因FOXF2区域内或上游区和下游区;基因ZNF273区域内或上游区和下游区;基因KLRG2区域内或上游区和下游区;基因CRB2区域内或上游区和下游区;基因SEC16A区域内或上游区和下游区;基因GPAM区域内或上游区和下游区;基因ASCL2区域内或上游区和下游区;基因PAX6区域内或上游区和下游区;基因PTGDR2区域内或上游区和下游区;基因PLEKHB1区域内或上游区和下游区;基因TBX5区域内或上游区和下游区;基因STX2区域内或上游区和下游区;基因FBRSL1区域内或上游区和下游区;基因ATP11A区域内或上游区和下游区;基因BTBD6区域内或上游区和下游区;基因CRIP2区域内或上游区和下游区;基因ONECUT1区域内或上游区和下游区;基因ZNF764区域内或上游区和下游区;基因IGHV3OR16-17区域内或上游区和下游区;基因SALL1区域内或上游区和下游区;基因ACTG1区域内或上游区和下游区;基因GATA6区域内或上游区和下游区;基因KCTD1区域内或上游区和 下游区;基因CYP2F1区域内或上游区和下游区;基因TPTE区域内或上游区和下游区;基因CLDN5内或上游区和下游区。单独一个或者多个甲基化标志物的组合都可以用作胃癌和/或食管癌特异性的甲基化标志物。在一个实施方案中,甲基化标志物在上述任一基因的2kb上游和2kb下游区内。
从胰腺癌患者分离的核酸
本发明提供了分离的核酸,其是从受试者的样品分离的。例如,分离的核酸是从胰腺癌患者血浆中的游离DNA分离的。分离的核酸是一种或多种特异性甲基化标志物,优选胰腺癌组织特异性甲基化标志物。甲基化标志物是以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;基因SDF2L1。该位点是甲基化的位点。本领域技术人员应当理解基因组的基因可以存在突变,因此可以想到这些基因的变体也可以作为甲基化标志物,只要变体中的甲基化位点未发生突变。变体可以包含与任一种基因的序列具有至少70%同一性的序列。选择作为标志物的位点可以包含1个或多个CpG,例如2个CpG、3个CpG、4个CpG、5个CpG、6个CpG、10个CpG、20个CpG或30个CpG。合适的位点的长度可以是130bp-530bp。例如,位点的长度可以是140bp、150bp、160bp、170bp、180bp、190bp、200bp、210bp、220bp、230bp、240bp、250bp、260bp、270bp、280bp、290bp、300bp、310bp、320bp、330bp、340bp、350bp、360bp、370bp、380bp、390bp、400bp、410bp、420bp、430bp、440bp、450bp、460bp、470bp、480bp、490bp、500bp、510bp或520bp。
本领域技术人员理解基因与其上游和下游的区域具备相同或相似的甲基化水平或状态。因此,当本发明发现特定基因内的甲基化位点后可以设想该基因以及在染色体原位的2.5kb上游区和2.5kb下游区也具备相同或相似的甲基化水平或状态。本发明涵盖本发明所述的基因以及该基因在其所处的染 色体中的2kb、1.9kb、1.8kb、1.7kb、1.6kb、1.5kb、1.4kb、1.3kb、1.2kb、1.1kb、1kb、900bp、800bp、700bp、600bp、500bp、400bp、300bp、200bp、100bp、90bp、80bp、70bp、60bp、50bp、40bp、30bp、20bp、10bp或5bp上游区和下游区。
在本文中,本发明使用了以下核苷酸序列作为甲基化标志物。
其中染色***置的坐标是参照人全基因组序列hg19确定的。根据筛选出的胰腺癌组织特异性的甲基化标志物以及其所处的基因,本领域技术人员应当理 解,以下各项内的位点可用作甲基化标志物:基因TNFRSF14以及其上游区或下游区;基因PGM1以及其上游区或下游区;基因CELF3以及其上游区或下游区;基因ATP2B4以及其上游区或下游区;基因SF3B6以及其上游区或下游区;基因CNNM4以及其上游区或下游区;基因SP9以及其上游区或下游区;基因C2orf82以及其上游区或下游区;基因NEU4以及其上游区或下游区;基因RPL35A以及其上游区或下游区;基因HGFAC以及其上游区或下游区;基因EXOC3以及其上游区或下游区;基因GDNF以及其上游区或下游区;基因NEUROG1以及其上游区或下游区;基因HIST1H2BA以及其上游区或下游区;基因OSTM1以及其上游区或下游区;基因CCR6以及其上游区或下游区;基因CCAR2以及其上游区或下游区;基因TNFRSF10D以及其上游区或下游区;基因TJP2以及其上游区或下游区;基因DAB2IP以及其上游区或下游区;基因NTMT1以及其上游区或下游区;基因MKI67以及其上游区或下游区;基因PTGDR2以及其上游区或下游区;基因CCDC77以及其上游区或下游区;基因MYL2以及其上游区或下游区;基因FRY以及其上游区或下游区;基因SMEK1以及其上游区或下游区;基因BTBD6以及其上游区或下游区;基因PIF1以及其上游区或下游区;基因SRL以及其上游区或下游区;基因SPNS1以及其上游区或下游区;基因DNM2以及其上游区或下游区;基因ZNF569以及其上游区或下游区;基因SDF2L1以及其上游区或下游区。单独一个或者多个甲基化标志物的组合都可以用作胰腺癌特异性的甲基化标志物。在一个实施方案中,甲基化标志物在上述任一基因的2kb上游和2kb下游区内。
表观遗传界的先驱Andy Fienberg曾经指出结肠癌中的大多数甲基化改变不仅发生在启动子中,也不仅是发生在CpG岛上,而是发生在其上游2kb的序列中,我们称之为“CpG岛海岸”(Andy Fienberg等人,2009)。CpG岛岸甲基化与基因表达密切相关,在哺乳动物中高度保守,可以区分组织类型。在随后的研究中,研究者们不仅在肠癌种发现了这一现象,在乳腺癌、胃癌、膀胱癌以及一些组织分型中均发现了这些目标甲基化位点的临近区域同样具有重要作用(Guo YL等人,2016;Rao X等人,2013;Dudziec E等人,2011;Chae H等人,2016)。因此,对这些邻近区域的保护和目标区域的保护同样重要。
用于诊断癌症(结直肠癌、肺癌、肝癌、乳腺癌、或胃癌和/或食管癌、或 胰腺癌之一)组织的试剂盒
根据本发明的甲基化标志物,本领域技术人员可以制备用于检测这些标志物的甲基化水平或状态的试剂盒或装置,用于诊断结直肠癌,或区分结直肠癌与其他泛癌种。试剂盒或装置可以包含检测来自样品的核酸中的一种或多种结直肠癌组织特异性甲基化标志物状态和/或水平的试剂或组件。根据本发明的甲基化标志物,本领域技术人员可以制备用于检测这些标志物的甲基化水平或状态的试剂盒或装置,用于诊断肺癌,或区分肺癌与其他泛癌种。试剂盒或装置可以包含检测来自样品的核酸中的一种或多种肺癌组织特异性甲基化标志物状态和/或水平的试剂或组件。根据本发明的甲基化标志物,本领域技术人员可以制备用于检测这些标志物的甲基化水平或状态的试剂盒或装置,用于诊断肝癌,或区分肝癌与其他泛癌种。试剂盒或装置可以包含检测来自样品的核酸中的一种或多种肝癌组织特异性甲基化标志物状态和/或水平的试剂或组件。根据本发明的甲基化标志物,本领域技术人员可以制备用于检测这些标志物的甲基化水平或状态的试剂盒或装置,用于诊断乳腺癌,或区分乳腺癌与其他泛癌种。试剂盒或装置可以包含检测来自样品的核酸中的一种或多种乳腺癌组织特异性甲基化标志物状态和/或水平的试剂或组件。根据本发明的甲基化标志物,本领域技术人员可以制备用于检测这些标志物的甲基化水平或状态的试剂盒或装置,用于诊断胃癌和/或食管癌,或区分胃癌和/或食管癌与其他泛癌种。试剂盒或装置可以包含检测来自样品的核酸中的一种或多种胃癌和/或食管癌组织特异性甲基化标志物状态和/或水平的试剂或组件。根据本发明的甲基化标志物,本领域技术人员可以制备用于检测这些标志物的甲基化水平或状态的试剂盒或装置,用于诊断胰腺癌,或区分胰腺癌与其他泛癌种。试剂盒或装置可以包含检测来自样品的核酸中的一种或多种胰腺癌组织特异性甲基化标志物状态和/或水平的试剂或组件。例如,试剂或组件可以包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。试剂可以包含用于检测甲基化标志物的寡核苷酸。例如,寡核苷酸是引物和/或探针。优选地,引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物。优选地,试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、 甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物。对照物可以是来自正常受试者或非结直肠癌的癌症患者的前述特异性甲基化标志物。优选地,非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。对照物可以来自正常受试者或非肺癌的癌症患者的前述特异性甲基化标志物。优选地,非肺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。对照物可以是来自正常受试者或非肝癌的癌症患者的前述特异性甲基化标志物。优选地,非肝癌的癌症是结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。对照物可以是来自正常受试者或非乳腺癌的癌症患者的前述特异性甲基化标志物。优选地,非乳腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或肺癌。对照物可以是来自正常受试者或除胃癌和食管癌以外的癌症患者的前述特异性甲基化标志物。优选地,除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。对照物可以是来自正常受试者或非胰腺癌的癌症患者的前述特异性甲基化标志物。优选地,所述非胰腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌。
用于诊断结直肠癌组织的方法
本发明提供了诊断受试者的结直肠癌的方法,其包括:(1)在受试者的样品中测定本发明的一种或多种结直肠癌组织特异性甲基化标志物的甲基化状态或水平;和(2)基于测定的甲基化状态或水平确定结直肠癌。在一个实施方案中,受试者是癌症患者或有癌症风险的受试者。在一个实施方案中,非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,获得所述甲基化水平数据方法可以是测定核酸序列的甲基化水平的任何合适的方法,例如基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
本发明还提供一种用于诊断结直肠癌的方法,包括:(1)检测受试者的样品中本文所述序列的甲基化水平;(2)与对照样品比较,或者通过计算得出评分;(3)根据评分鉴定对象的结直肠癌。通常,所述方法在步骤(1)之前还包 括:样品DNA的提取和将DNA上未甲基化的胞嘧啶转化为不与鸟嘌呤结合的碱基。在一个或多个实施方案中,与对照样品比较时,受试者样品的甲基化水平升高或降低。当甲基化水平满足某一阈值时,则鉴定为结直肠癌。对所测基因的甲基化水平进行数学分析,获得得分。对于检测的样品而言,当得分大于阈值,则判定结果为结直肠癌,否则为阴性,即除结直肠癌外的癌症。本领域知晓常规数学分析的方法以及确定阈值的过程。
本发明还提供了方法,其包括:(1)获得结直肠癌样品和非结直肠癌的癌症样品的基因组DNA中本文所述的甲基化标志物的甲基化水平;和(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学***值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中结直肠癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。方法还包括使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是结直肠癌进行判断,大于阈值预测为结直肠癌,反之预测为其它癌种。方法可以用于(1)区分结直肠癌患者与非结直肠癌的癌症患者,(2)用于诊断或辅助诊断结直肠癌;或者(3)用于泛癌筛查过程中对结直肠癌的组织溯源。
用于诊断肺癌的方法
本发明提供了诊断受试者的肺癌的方法,其包括:(1)在受试者的样品中测定本发明的一种或多种肺癌组织特异性甲基化标志物的甲基化状态或水平;和(2)基于测定的肺癌组织特异性甲基化状态或水平确定肺癌。在一个实施方案中,受试者是癌症患者或有癌症风险的受试者。在一个实施方案中,非肺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。 在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,获得所述甲基化水平数据方法可以是测定核酸序列的甲基化水平的任何合适的方法,例如基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
本发明还提供一种用于诊断肺癌的方法,包括:(1)检测受试者的样品中本文所述序列的甲基化水平;(2)与对照样品比较,或者通过计算得出评分;(3)根据评分鉴定对象的肺癌。通常,所述方法在步骤(1)之前还包括:样品DNA的提取和将DNA上未甲基化的胞嘧啶转化为不与鸟嘌呤结合的碱基。在一个或多个实施方案中,与对照样品比较时,受试者样品的甲基化水平升高或降低。当甲基化水平满足某一阈值时,则鉴定为肺癌。对所测基因的甲基化水平进行数学分析,获得得分。对于检测的样品而言,当得分大于阈值,则判定结果为肺癌,否则为阴性,即除肺癌外的癌症。本领域知晓常规数学分析的方法以及确定阈值的过程。
本发明还提供了方法,其包括:(1)获得肺癌样品和非肺癌的癌症样品的基因组DNA中本文所述的甲基化标志物的甲基化水平;和(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学***值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中肺癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。方法还包括使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是肺癌进行判断,大于阈值预测为肺癌,反之预测为其它癌种。方法可以用于(1)区分肺癌患者 与非肺癌的癌症患者,(2)用于诊断或辅助诊断肺癌;或者(3)用于泛癌筛查过程中对肺癌的组织溯源。
用于诊断肝癌的方法
本发明提供了诊断受试者的肝癌的方法,其包括:(1)在受试者的样品中测定本发明的一种或多种甲基化标志物的甲基化状态或水平;和(2)基于测定的甲基化状态或水平确定肝癌。在一个实施方案中,受试者是癌症患者或有癌症风险的受试者。在一个实施方案中,非肝癌的癌症是结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,获得所述甲基化水平数据方法可以是测定核酸序列的甲基化水平的任何合适的方法,例如基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
本发明还提供一种用于诊断肝癌的方法,包括:(1)检测受试者的样品中本文所述序列的甲基化水平;(2)与对照样品比较,或者通过计算得出评分;(3)根据评分鉴定对象的肝癌。通常,所述方法在步骤(1)之前还包括:样品DNA的提取和将DNA上未甲基化的胞嘧啶转化为不与鸟嘌呤结合的碱基。在一个或多个实施方案中,与对照样品比较时,受试者样品的甲基化水平升高或降低。当甲基化水平满足某一阈值时,则鉴定为肝癌。对所测基因的甲基化水平进行数学分析,获得得分。对于检测的样品而言,当得分大于阈值,则判定结果为肝癌,否则为阴性,即除肝癌外的癌症。本领域知晓常规数学分析的方法以及确定阈值的过程。
本发明还提供了方法,其包括:(1)获得肝癌样品和非肝癌的癌症样品的基因组DNA中本文所述的甲基化标志物的甲基化水平;和(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学***值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中肝癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。方法还包括使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是肝癌进行判断,大于阈值预测为肝癌,反之预测为其它癌种。方法可以用于(1)区分肝癌患者与非肝癌的癌症患者,(2)用于诊断或辅助诊断肝癌;或者(3)用于泛癌筛查过程中对肝癌的组织溯源。
诊断受试者的乳腺癌的方法
本发明提供了诊断受试者的乳腺癌的方法,其包括:(1)在受试者的样品中测定本发明的一种或多种甲基化标志物的甲基化状态或水平;和(2)基于测定的甲基化状态或水平确定乳腺癌。在一个实施方案中,受试者是癌症患者或有癌症风险的受试者。在一个实施方案中,非乳腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或肺癌。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,获得所述甲基化水平数据方法可以是测定核酸序列的甲基化水平的任何合适的方法,例如基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
本发明还提供一种用于诊断乳腺癌的方法,包括:(1)检测受试者的样品中本文所述序列的甲基化水平;(2)与对照样品比较,或者通过计算得出评分;(3)根据评分鉴定受试者的乳腺癌。通常,所述方法在步骤(1)之前还包括:样品DNA的提取和将DNA上未甲基化的胞嘧啶转化为不与鸟嘌呤结合的碱基。在一个或多个实施方案中,与对照样品比较时,受试者样品的甲基化水平升高或降低。当甲基化水平满足某一阈值时,则鉴定为乳腺癌。对所测基因的甲基化水平进行数学分析,获得得分。对于检测的样品而言,当得分大于阈值,则判定结果为乳腺癌,否则为阴性,即除乳腺癌外的癌症。本领域知晓常规数学分析的方法以及确定阈值的过程。
本发明还提供了方法,其包括:(1)获得乳腺癌样品和非乳腺癌的癌症样品的基因组DNA中本文所述的甲基化标志物的甲基化水平;和(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学***值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中乳腺癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。方法还包括使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是乳腺癌进行判断,大于阈值预测为乳腺癌,反之预测为其它癌种。方法可以用于(1)区分乳腺癌患者与非乳腺癌的癌症患者,(2)用于诊断或辅助诊断乳腺癌;或者(3)用于泛癌筛查过程中对乳腺癌的组织溯源。
诊断受试者的胃癌和/或食管癌的方法
本发明提供了诊断受试者的胃癌和/或食管癌的方法,其包括:(1)在受试者的样品中测定本发明的一种或多种甲基化标志物的甲基化状态或水平;和(2)基于测定的甲基化状态或水平确定胃癌和/或食管癌。在一个实施方案中,受试者是癌症患者或有癌症风险的受试者。在一个实施方案中,除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,获得所述甲基化水平数据方法可以是测定核酸序列的甲基化水平的任何合适的方法,例如基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
本发明还提供一种用于诊断胃癌和/或食管癌的方法,包括:(1)检测受 试者的样品中本文所述序列的甲基化水平;(2)与对照样品比较,或者通过计算得出评分;(3)根据评分鉴定受试者的胃癌和/或食管癌。通常,所述方法在步骤(1)之前还包括:样品DNA的提取和将DNA上未甲基化的胞嘧啶转化为不与鸟嘌呤结合的碱基。在一个或多个实施方案中,与对照样品比较时,受试者样品的甲基化水平升高或降低。当甲基化水平满足某一阈值时,则鉴定为胃癌和/或食管癌。对所测基因的甲基化水平进行数学分析,获得得分。对于检测的样品而言,当得分大于阈值,则判定结果为胃癌和/或食管癌,否则为阴性,即除胃癌和食管癌外的癌症。本领域知晓常规数学分析的方法以及确定阈值的过程。
本发明还提供了方法,其包括:(1)获得胃癌和/或食管癌样品和除胃癌和食管癌以外的癌症样品的基因组DNA中本文所述的甲基化标志物的甲基化水平;和(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学***值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据,TrainPheno是训练集样本的性状,其中胃癌和/或食管癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。方法还包括使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是胃癌和/或食管癌进行判断,大于阈值预测为胃癌和/或食管癌,反之预测为其它癌种。方法可以用于(1)区分胃癌和/或食管癌患者与除胃癌和食管癌以外的癌症患者,(2)用于诊断或辅助诊断胃癌和/或食管癌;或者(3)用于泛癌筛查过程中对胃癌和/或食管癌的组织溯源。
用于诊断胰腺癌的方法
本发明提供了诊断受试者的胰腺癌的方法,其包括:(1)在受试者的样品中测定本发明的一种或多种甲基化标志物的甲基化状态或水平;和(2)基于测定的甲基化状态或水平确定胰腺癌。在一个实施方案中,受试者是癌症患者或有癌症风险的受试者。在一个实施方案中,非胰腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌。在一个实施方案中,样品为细胞、组织、细针穿刺活检物或血浆。在一个实施方案中,获得所述甲基化水平数据方法可以是测定核酸序列的甲基化水平的任何合适的方法,例如基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
本发明还提供一种用于诊断胰腺癌的方法,包括:(1)检测受试者的样品中本文所述序列的甲基化水平;(2)与对照样品比较,或者通过计算得出评分;(3)根据评分鉴定受试者的胰腺癌。通常,所述方法在步骤(1)之前还包括:样品DNA的提取和将DNA上未甲基化的胞嘧啶转化为不与鸟嘌呤结合的碱基。在一个或多个实施方案中,与对照样品比较时,受试者样品的甲基化水平升高或降低。当甲基化水平满足某一阈值时,则鉴定为胰腺癌。对所测基因的甲基化水平进行数学分析,获得得分。对于检测的样品而言,当得分大于阈值,则判定结果为胰腺癌,否则为阴性,即非胰腺癌的癌症。本领域知晓常规数学分析的方法以及确定阈值的过程。
本发明还提供了方法,其包括:(1)获得胰腺癌样品和非胰腺癌的癌症样品的基因组DNA中本文所述的甲基化标志物的甲基化水平;和(2)使用甲基化标志物甲基化水平的数据构建逻辑回归的机器学***值,w为甲基化标志物的系数,b为截距值,y为模型预测分值
以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集 的数据,TrainPheno是训练集样本的性状,其中胰腺癌为1,其它癌种为0,并根据训练集的样本确定模型的相关阈值。方法还包括使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据,TestPred为模型预测分值,使用预测分值并根据阈值对样本是否是胰腺癌进行判断,大于阈值预测为胰腺癌,反之预测为其它癌种。方法可以用于(1)区分胰腺癌患者与非胰腺癌的癌症患者,(2)用于诊断或辅助诊断胰腺癌;或者(3)用于泛癌筛查过程中对胰腺癌的组织溯源。
***或装置
本发明还提供了***或装置。***或装置可以包含计算机可读存储介质或存储器,用于存储程序或指令。程序或指令可以用于执行由本发明的一种或多种结直肠癌组织特异性甲基化标志物构建的区分结直肠癌与其他非结直肠癌的预测模型,或者用于执行本发明的方法。程序或指令可以用于执行由本发明的区分肺癌与其他非肺癌的预测模型,或者用于执行本发明的方法。程序或指令可以用于执行由本发明的区分肝癌与其他非肝癌的预测模型,或者用于执行本发明的方法。程序或指令用于执行由本发明的区分乳腺癌与其他非乳腺癌的预测模型,或者用于执行本发明的方法。程序或指令用于执行由本发明的一种或多种甲基化标志物构建的区分胃癌和/或食管癌与除胃癌和食管癌外的癌症的预测模型,或者用于执行本发明的方法。程序或指令用于执行由本发明的区分胰腺癌与其他非胰腺癌的癌症的预测模型,或者用于执行本发明的方法。计算机可读存储介质或存储器包括但不限于有形存储介质、载波介质或物理传输介质。非易失性存储介质包括例如光盘或磁盘,诸如在任何计算机等中的任何存储设备,易失性存储介质包括动态存储器,诸如此类计算机平台的主存储器。有形的传输介质包括同轴电缆;铜线和光纤,包括构成计算机***内的总线的导线。载波传输介质可以采取电信号或电磁信号或者声波或光波的形式,诸如在射频和红外数据通信期间生成的那些。因此,计算机可读介质的常见形式包括例如:软盘、软性磁盘、硬盘、磁带、任何其他磁介质、CD-ROM、DVD或DVD-ROM、任何其他光学介质、穿孔卡片纸带、具有孔模式的任何其他物理存储介质、RAM、ROM、PROM和EPROM、FLASH-EPROM、任何其他存储器芯片或盒、传输数据或 指令的载波、传输此类载波的缆线或链路,或者计算机可以从其读取编程代码和/或数据的任何其他介质。这些计算机可读介质的形式中的许多形式可以参与向处理器传送一个或更多个指令的一个或更多个序列以用于执行。存储器和处理器可为物理上分离的。在这种情况下,可以经由允许数据传输的单元之间的有线和无线连接来实现操作连接。无线连接可使用无线LAN(WLAN)或互联网。有线连接可通过单元之间的光学和非光学电缆连接实现。用于有线连接的电缆进一步适于高通量数据传输。
诊断结直肠癌的用途
本发明还提供了分离的核酸或试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分结直肠癌患者与非结直肠癌的癌症患者;(2)用于诊断或辅助诊断结直肠癌;或者(3)用于泛癌筛查过程中对结直肠癌的组织溯源。优选地,非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。试剂盒或装置可以包含用于以各种可用的方法测定甲基化水平的试剂。
用于诊断肺癌的用途
本发明还提供了分离的核酸或试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分肺癌患者与非肺癌的癌症患者;(2)用于诊断或辅助诊断肺癌;或者(3)用于泛癌筛查过程中对肺癌的组织溯源。优选地,非肺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。试剂盒或装置可以包含用于以各种可用的方法测定甲基化水平的试剂。
用于诊断肝癌的用途
本发明还提供了分离的核酸或试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分肝癌患者与非肝癌的癌症患者;(2)用于诊断或辅助诊断肝癌;或者(3)用于泛癌筛查过程中对肝癌的组织溯源。优选地,非肝癌的癌症是结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。试剂盒或装置可以包含用于以各种可用的方法测定甲基化水平的试剂。
用于诊断乳腺癌的用途
本发明还提供了分离的核酸或试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分乳腺癌患者与非乳腺癌的癌症患者;(2)用于诊断或辅助诊断乳腺癌;或者(3)用于泛癌筛查过程中对乳腺癌的组织溯源。优选地,非乳腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和 /或肺癌。试剂盒或装置可以包含用于以各种可用的方法测定甲基化水平的试剂。
诊断胃癌和/或食管癌的用途
本发明还提供了分离的核酸或试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分胃癌和/或食管癌患者与除胃癌和食管癌以外的癌症患者;(2)用于诊断或辅助诊断胃癌和/或食管癌;或者(3)用于泛癌筛查过程中对胃癌和/或食管癌的组织溯源。优选地,除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。试剂盒或装置可以包含用于以各种可用的方法测定甲基化水平的试剂。
用于诊断胰腺癌的用途
本发明还提供了分离的核酸或试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分胰腺癌患者与非胰腺癌的癌症患者;(2)用于诊断或辅助诊断胰腺癌;或者(3)用于泛癌筛查过程中对胰腺癌的组织溯源。优选地,非胰腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌。试剂盒或装置可以包含用于以各种可用的方法测定甲基化水平的试剂。
实施例
下面结合附图和具体实施例对本发明作进一步详细的说明。下列实施例中,未注明具体条件的实验方法,通常按常规条件中所述的方法进行。
实施例1.1:甲基化靶向测序筛选结直肠癌特异性的甲基化位点
发明人收集了总计539个各个癌种的患者,所有入组患者签署知情同意书。将这些样本按照一定的比例分为训练集和测试集,其中训练集用于下述机器学习模型的构建,测试集用于模型的性能测试,样本信息见下表1.1。
表1.1:各个癌种血浆样本数量统计表

通过申请人自主研发的MethylTitanTM的方法获得目标样本血浆cfDNA的甲基化测序数据,鉴别出其中的DNA甲基化分类标志物。过程如下:
1、血浆cfDNA样本的提取
采用streck血液收集管收集患者2ml全血样本,及时离心分离血浆(3天内),转运至实验室后,采用QIAGEN QIAamp Circulating Nucleic Acid Kit试剂盒根据说明书提取cfDNA。
2、Illumina常规测序及数据预处理
a)文库用Illumina Nextseq 500测序仪进行双端测序。
b)Pear(v0.6.0)软件将Illumina Hiseq X10/Nextseq 500/Novaseq测序仪下机的双端150bp测序的同一片段双端测序数据合并成一条序列,最短重叠长度20bp,合并之后最短30bp。
c)使用Trim_galore v 0.6.0、cutadapt v1.8.1软件对合并后的测序数据进行去接头处理。在序列的5’端去除接头序列为“AGATCGGAAGAGCAC”,并去除两端测序质量值低于20的碱基。
3、测序数据比对
本文使用的参考基因组数据来自UCSC数据库(UCSC:HG19,http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)。
a)首先将HG19使用Bismark软件分别进行胞嘧啶到胸腺嘧啶(CT)和腺嘌呤到鸟嘌呤(GA)的转化,并且分别对转换后的基因组使用Bowtie2软件构建索引。
b)将Illumina Nextseq 500测序仪的下机数据同样进行CT和GA转化。
c)使用Bowtie2软件分别将转化后的序列比对到转化后的HG19参考基因组,最短种子序列长度20,种子序列不允许错配。
4、甲基化单倍型频率(MHF)的计算
对于每个目标区域HG19的CpG位点,根据上述比对结果,获取每个位点对应的甲基化状态。本文中位点的核苷酸编号对应于HG19的核苷酸位置编号。一个目标甲基化区域可能有多个甲基化单倍型,对于目标区域内的每一个甲基化单倍型都需要进行该值的计算,MHF的计算公式示例如下:
其中i表示目标甲基化区间,h表示目标的甲基化单倍型,Ni表示位于目标甲基化区间的读段(reads)数目,Ni,h表示包含目标甲基化单倍型的读段数目。
5、甲基化数据矩阵
a)将训练集和测试集的各个样本的甲基化测序数据(甲基化单倍型频率)分别合并成数据矩阵,对每个深度低于100的位点做缺失值处理。
b)去除缺失值比例高于10%的位点。
c)对于数据矩阵的缺失值,利用KNN算法进行缺失数据插补。
6、根据训练集样本找出结直肠癌组织特异性甲基化标志物
a)计算每一个甲基化单倍型标志物在训练集中结直肠癌与其它癌种相比的AUC并从高到低排序,筛选出可较好区分结直肠癌与其它癌种的甲基化标志物作为候选标志物;
b)使用上一步构建的甲基化标志物在训练集构建逻辑回归模型,然后使用测试集样本验证模型的效果。该步骤主要基于python3 sklearn包linear_model模块的LogisticsRegression函数进行,具体步骤:
1.使用StandardScaler对训练集数据进行标准化,并保存标准化转换公式,其中公式为:x*=(x-u)/σ,μ为所有样本数据的均值,σ为所有样本数据的标准差;
2.将标准化之后的数据输入LogisticsRegression函数,训练逻辑回归模型;
3.将标准化公式应用到测试集数据对测试集进行标准化;
4.将训练好的逻辑回归模型应用于测试集样本进行测试。
筛选出的结直肠癌组织特异性的甲基化标志物具体见表1.2。
这些结直肠癌组织特异性甲基化标志物在结直肠癌与其他6种癌种中的甲基化水平如下表1.2和图1。图2显示了这些结直肠癌组织特异性甲基化标志物在训练集和测试集中结直肠癌与其它癌种相比都具有显著性的差异(u检验p值小于0.05),且甲基化水平也具有较大差别。
表1.2在训练集和测试集中甲基化标志物在结直肠癌和其他6种癌种中的甲基化水平均值

以单个结直肠癌组织特异性甲基化标志物Seq ID NO:52为例,查看该结直肠癌组织特异性标志物在七个癌种中甲基化水平在训练集和测试集中的分布分别如图3和图4所示,可看出该结直肠癌组织特异性标志物的甲基化水平在结直肠癌中和其他癌种相比具有显著性的差异(wilcox检验:P<=0.05),是良好的结直肠癌组织特异性甲基化标志物。
实施例1.2:单个结直肠癌组织特异性甲基化标志物的判别性能
为了验证单个结直肠癌组织特异性甲基化标志物的判别性能,在实施例1.1划分的训练集中使用单个结直肠癌组织特异性甲基化标志物甲基化水平的数据构建逻辑回归模型,并确定阈值后,然后在测试集进行预测。具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标标志物的甲基化水平值,w为不同标志物的系数,b为截距值,y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集样本中目标甲基化位点的数据,TrainPheno是训练集样本的性状(结直肠癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集样本中目标甲基化位点的数据,TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是结直肠癌进行判断。
4.统计模型的AUC,并根据确定的阈值统计敏感性、特异性,准确性等指标。
39个结直肠癌组织特异性的甲基化标志物在训练集和测试集中的表现如表1.3所示,在训练集中每个结直肠癌组织特异性甲基化标志物都可以达到0.70以上的AUC,准确率达到了77%以上,在测试集中单个结直肠癌组织特异性甲基化标志物最低AUC也达到了0.70以上,准确率达到了70%以上,可看出这些结直肠癌组织特异性甲基化标志物都是较好的结直肠癌组织特异性的标志物,可以较好地区分结直肠癌与其它癌种。
表1.3单个结直肠癌组织特异性甲基化标志物的判别性能


实施例1.3:所有目标结直肠癌组织特异性甲基化标志物的机器学习模型
本实施例使用所有的39个结直肠癌组织特异性甲基化标志物的甲基化水平构建了逻辑回归的机器学习模型,用以从多个癌种数据中准确区分结直肠癌的样本。使用实施例1.1中训练集的样本进行模型训练,再使用测试集的样本对模型的效果进行测试,具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标甲基化标志物的甲基化水平值,w为不同甲基化标志物的系数,b为截距值(参数是通过训练逻辑回归模型得到的),y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据(甲基化单倍型频率),TrainPheno是训练集样本的性状(结直肠癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据(甲基化单倍型频率),TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是结直肠癌进行判断。
训练集和测试集中模型预测分值分布分别见图5,从图中可看出结直肠癌和其它癌种样本模型分值具有显著的差异(wilcox test:P<=0.05)。ROC曲线见图6,在测试集中,结直肠癌与其它癌种区分的AUC达到了0.902,设置阈值为0.076,大于该值预测为结直肠癌,反之预测为其它癌种,在特异性为85%时,敏感性达到了66.7%,样本整体预测的准确率达到了84.5%,可以 较好的区分从7种癌症样本中区分出结直肠癌。
实施例1.4:结直肠癌组织特异性标志物组合1机器学习模型
为了验证相关结直肠癌组织特异性甲基化标志物组合的效果,本实施例从所有39个结直肠癌组织特异性甲基化标志物中选取了Seq ID NO:52,Seq ID NO:59,Seq ID NO:62,Seq ID NO:64,Seq ID NO:73,Seq ID NO:83,一共6个结直肠癌组织特异性甲基化标志物构建新的机器学习模型。
机器学习模型构建的方法同实施例1.3一致,相关样本只选用了目标的6个结直肠癌组织特异性甲基化位点的数据,该模型在训练集和测试集中的模型得分见图7,该模型ROC曲线见图8。可看出该模型在训练集和测试集中,结直肠癌样本分值同其他癌种分值具有显著差异(wilcox test:P<=0.05),该模型测试集AUC达到了0.931,阈值设成0.055时,大于该值预测为结直肠癌,小于该值预测为其他癌种,在特异性为93.4%时,敏感性达到了66.7%,整体的准确率达到了92.5%,说明了该结直肠癌组织特异性标志物组合构建模型良好的性能。
实施例1.5:结直肠癌组织特异性标志物组合2机器学习模型
该实施例从39个结直肠癌组织特异性甲基化标志物中选择了另一个结直肠癌组织特异性甲基化标志物的组合:Seq ID NO:52,Seq ID NO:54,Seq ID NO:61,Seq ID NO:64,Seq ID NO:66,Seq ID NO:69,Seq ID NO:71,Seq ID NO:74,Seq ID NO:76,Seq ID NO:87,一共10个结直肠癌组织特异性甲基化标志物进行机器学习模型的构建。
该模型构建方法同样与实施例1.3一致,相关样本只使用了目标10个结直肠癌组织特异性甲基化位点的数据。该模型在训练集和测试集中的模型得分见图9,ROC曲线见图10。从图中可看出该模型在训练集和测试集中,结直肠癌样本得分显著高于其它癌种得分(wilcox test:P<=0.05),该模型测试集的AUC达到了0.902,阈值设置为0.059时,在特异性为90.6%时,敏感性达到了66.7%,整体的准确性可达到89.8%,同样可以较好的区分结直肠癌和其它癌种。
本申请从7个癌种的甲基化NGS测序数据中筛选出了39个结直肠癌特异性的甲基化标志物,根据这些结直肠癌组织特异性甲基化标志物的甲基化水 平数据构建的机器学习模型可以从7个癌种的数据中较好的区分出结直肠癌的样本,对泛癌种早筛过程中结直肠癌的组织溯源提供了重要的参考。




实施例2.1:甲基化靶向测序筛选肺癌特异性的甲基化位点
发明人收集了总计490例各个癌种的患者,所有入组患者签署知情同意书。将这些样本按照一定的比例分为训练集和测试集,其中训练集用于下述机器学习模型的构建,测试集用于模型的性能测试,样本信息见下表2.1,训练集中肺癌样本总数为51个,测试集中肺癌样本总数为20个。
表2.1各个癌种血浆样本数量统计表

通过申请人自主研发的MethylTitanTM的方法获得目标样本血浆cfDNA的甲基化测序数据,鉴别出其中的DNA甲基化分类标志物。过程如下:
1、血浆cfDNA样本的提取
采用streck血液收集管收集患者2ml全血样本,及时离心分离血浆(3天内),转运至实验室后,采用QIAGEN QIAamp Circulating Nucleic Acid Kit试剂盒根据说明书提取cfDNA。
2、测序及数据预处理
a)文库用Illumina Nextseq 500测序仪进行双端测序。
b)Pear(v0.6.0)软件将Illumina Hiseq X10/Nextseq 500/Novaseq测序仪下机的双端150bp测序的同一片段双端测序数据合并成一条序列,最短重叠长度20bp,合并之后最短30bp。
c)使用Trim_galore v 0.6.0、cutadapt v1.8.1软件对合并后的测序数据进行去接头处理。在序列的5’端去除接头序列为“AGATCGGAAGAGCAC”,并去除两端测序质量值低于20的碱基。
3、测序数据比对
本文使用的参考基因组数据来自UCSC数据库(UCSC:HG19,http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)。
a)首先将HG19使用Bismark软件分别进行胞嘧啶到胸腺嘧啶(CT)和腺嘌呤到鸟嘌呤(GA)的转化,并且分别对转换后的基因组使用Bowtie2软件构建索引。
b)将Illumina Nextseq 500测序仪的下机数据预处理的数据同样进行CT和GA转化。
c)使用Bowtie2软件分别将转化后的序列比对到转化后的HG19参考基因组,最短种子序列长度20,种子序列不允许错配。
4、甲基化单倍型频率(MHF)的计算
对于每个目标区域HG19的CpG位点,根据上述比对结果,获取每个位点对应的甲基化状态。本文中位点的核苷酸编号对应于HG19的核苷酸位置编号。一个目标甲基化区域可能有多个甲基化haplotype,对于目标区域内的每一个甲基化haplotype都需要进行该值的计算,MHF的计算公式示例如下:
其中i表示目标甲基化区间,h表示目标的甲基化单倍型,Ni表示位于目标甲基化区间的读段(reads)数目,Ni,h表示包含目标甲基化单倍型的读段数目。
5、甲基化数据矩阵
a)将训练集和测试集的各个样本的甲基化测序数据(甲基化单倍型频率)分别合并成数据矩阵,对每个深度低于200的位点做缺失值处理。
b)去除缺失值比例高于10%的位点。
c)对于数据矩阵的缺失值,利用KNN算法进行缺失数据插补。
6.根据训练集样本找出肺癌组织特异性甲基化标志物
a)计算每一个甲基化单倍型标志物在训练集中肺癌与其它癌种相比的AUC并从高到低排序,筛选出可较好区分肺癌与其它癌种的甲基化标志物作为候选标志物;
b)使用上一步构建的甲基化标志物在训练集构建逻辑回归模型,然后使用测试集样本验证模型的效果。该步骤主要基于python3 sklearn包linear_model模块的LogisticsRegression函数进行,具体步骤:
1.使用StandardScaler对训练集数据进行标准化,并保存标准化转换公式,其中公式为:x*=(x-u)/σ,μ为所有样本数据的均值,σ为所有样本数据的标准差;
2.将标准化之后的数据输入LogisticsRegression函数,训练逻辑回归模型;
3.将标准化公式应用到测试集数据对测试集进行标准化;
4.将训练好的逻辑回归模型应用于测试集样本进行测试。
这些甲基化标志物在肺癌与其他6种癌种中的甲基化水平如下表2.2和图11和图12所示。这些甲基化标志物在训练集和测试集中肺癌与其它癌种相比都具有显著性的差异(u检验,p值小于0.05),且甲基化水平也具有较大差别。
表2.2在训练集和测试集中甲基化标志物在肺癌与其他6种癌种中的甲基化水平均值


以单个肺癌组织特异性甲基化标志物Seq ID NO:91为例查看该肺癌组织特异性标志物在七个癌种中甲基化水平在训练集和测试集中的分布分别如图13和图14所示,可看出该肺癌组织特异性标志物的甲基化水平在肺癌中相比其它6个癌种都具有显著性的差异(wilcox test:P<=0.05),是良好的肺癌组织特异性甲基化标志物。
实施例2.2:单个肺癌组织特异性甲基化标志物判别性能
为了验证单个肺癌组织特异性甲基化标志物的区分肺癌与其它6个癌种的潜力,使用单个肺癌组织特异性甲基化标志物的甲基化水平数据在实施例2.1训练集数据中训练模型,并使用测试集样本对模型的性能进行验证,具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标肺癌组织特异性甲基化标志物的甲基化水平值,w为不同标志物的系数,b为截距值,y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集样本中目标甲基化位点的数据,TrainPheno是训练集样本的性状(肺癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集样本中目标甲基化位点的数据,TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是肺癌进行判断。
4.统计模型的AUC,并根据确定的阈值统计敏感性、特异性,准确性等指标。
本实施例中单个肺癌组织特异性甲基化标志物逻辑回归模型的效果见表2.3,从该表中可看出,所有的肺癌组织特异性甲基化标志物在测试集和训练集都可以达到0.67以上的AUC和0.58以上的准确率,都是较好的肺癌组织特异性标志物,其中表现优异的标志物如Seq ID NO:132,Seq ID NO:111,Seq ID NO:129都可以在测试集中80%以上的特异性下达到75%以上的敏感性,整体准确性达到80%以上。
表2.3单个肺癌组织特异性甲基化标志物逻辑回归模型的表现


实施例2.3:所有目标肺癌组织特异性甲基化标志物的机器学习模型
本实施例使用所有的48个肺癌组织特异性甲基化标志物的甲基化水平构建了逻辑回归的机器学习模型,用以从多个癌种数据中准确区分出肺癌的样本。具体的步骤与实施例2.2一致,只是相关样本带入了所有48个目标甲基化标志物的数据。具体如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标甲基化标志物的甲基化水平值,w为不同甲基化标志物的系数,b为截距值(参数是通过训练逻辑回归模型得到的),y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据(甲基化单倍型频率),TrainPheno是训练集样本的性状(肺癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据(甲基化单倍型频率),TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是肺癌进行判断。
训练集和测试集中模型预测分值分布见图15,从图中可看出肺癌和其它癌种样本模型分值都具有显著的差异(wilcox test:P<=0.05)。ROC曲线见图16,在测试集中,肺癌与其它癌种区分的AUC达到了0.903,设置阈值为0.336,大于该值则预测为肺癌,反之预测为其它癌种,在特异性为94.7%时,敏感性达到了80.0%,样本整体预测的准确率达到了85.0%,可以很好地从7种癌症样本中区分出肺癌样本。
实施例2.4:肺癌组织特异性甲基化标志物组合1机器学习模型
为了验证相关肺癌组织特异性甲基化标志物组合的效果,本实施例从所有48个肺癌组织特异性甲基化标志物中随机选取了一共10个肺癌组织特异性甲基化标志物Seq ID NO:92,Seq ID NO:95,Seq ID NO:99,Seq ID NO:103,Seq ID NO:112,Seq ID NO:76,Seq ID NO:126,Seq ID NO:128,Seq ID NO:133,Seq ID NO:135的甲基化水平的数据构建新的机器学习模型。
机器学习模型构建的方法也同实施例2.2一致,但相关样本只使用了该实施例中的10个肺癌组织特异性甲基化标志物的数据,该模型在训练集和测试集中的模型得分见图17,该模型ROC曲线见图18。可看出该模型在训练集和测试集中,肺癌样本分值同其他癌种分值具有显著差异(wilcox test:P<=0.05),该模型测试集AUC达到了0.895,阈值设成0.226时,大于该预测值为肺癌,小于该预测值为其他癌种,特异性为88.7%时,敏感性达到了80.0%,整体的准确率达到了87.7%,说明了该组合模型良好的性能。
实施例2.5:肺癌组织特异性甲基化标志物组合2机器学习模型
该实施例使用另一肺癌组织特异性甲基化标志物组合:Seq ID NO:112,Seq ID NO:124,Seq ID NO:128,Seq ID NO:130,Seq ID NO:133一共5个肺癌组织特异性甲基化标志物进行机器学习模型的构建。
该模型构建方法同样与实施例2.2一致,但相关样本只使用了该实施例中的5个标志物的数据。该模型在训练集和测试集中的模型得分见图19,ROC 曲线见图20。从图中可看出该模型在训练集和测试集中,肺癌样本得分显著高于其它癌种得分(wilcox test:P<=0.05),阈值设置为0.253时,测试集中在特异性为95.4%时,敏感性达到了75.0%,整体的准确性可达到93.0%,同样可以较好的区分肺癌与其它癌种。
本申请从7个癌种的甲基化NGS测序数据中筛选出了48个肺癌特异性的甲基化标志物,根据这些甲基化标志物的甲基化水平数据构建的机器学习模型可以从7个癌种的数据中很好地区分出肺癌的样本,这些甲基化标志物都是良好的肺癌组织特异性的甲基化标志物,对泛癌种早筛过程中肺癌的组织溯源提供了重要的参考。
本文中使用的序列:





实施例3.1:甲基化靶向测序筛选肝癌特异性的甲基化位点
发明人收集了总计494个各个癌种的患者,所有入组患者签署知情同意书。将这些样本按照一定的比例分为训练集和测试集,其中训练集用于下述机器学习模型的构建,测试集用于模型的性能测试,样本信息见下表3.1,训练集中肝癌样本总数为104个,测试集中肝癌样本总数为59个。
表3.1各个癌种血浆样本数量统计表
通过申请人自主研发的MethylTitanTM的方法获得目标样本血浆cfDNA的甲基化测序数据,鉴别出其中的DNA甲基化分类标志物。过程如下:
1、血浆cfDNA样本的提取
采用streck血液收集管收集患者2ml全血样本,及时离心分离血浆(3天内),转运至实验室后,采用QIAGEN QIAamp Circulating Nucleic Acid Kit试剂盒根据说明书提取cfDNA。
2.Illumina测序及数据预处理
a)文库用Illumina Nextseq 500测序仪进行双端测序。
b)Pear(v0.6.0)软件将Illumina Hiseq X10/Nextseq 500/Novaseq测序仪下机的双端150bp测序的同一片段双端测序数据合并成一条序列,最短重叠长度20bp,合并之后最短30bp。
c)使用Trim_galore v 0.6.0、cutadapt v1.8.1软件对合并后的测序数据进行去接头处理。在序列的5’端去除接头序列为“AGATCGGAAGAGCAC”,并去除两端测序质量值低于20的碱基。
3、测序数据比对
本文使用的参考基因组数据来自UCSC数据库(UCSC:HG19,http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)。
a)首先将HG19使用Bismark软件分别进行胞嘧啶到胸腺嘧啶(CT)和腺嘌呤到鸟嘌呤(GA)的转化,并且分别对转换后的基因组使用Bowtie2软件构建索引。
b)将Illumina Nextseq 500测序仪的下机数据同样进行CT和GA转化。
c)使用Bowtie2软件分别将转化后的序列比对到转化后的HG19参考基因组,最短种子序列长度20,种子序列不允许错配。
4、甲基化单倍型频率(MHF)的计算
对于每个目标区域HG19的CpG位点,根据上述比对结果,获取每个位点对应的甲基化状态。本文中位点的核苷酸编号对应于HG19的核苷酸位置编号。一个目标甲基化区域可能有多个甲基化单倍型,对于目标区域内的每一个甲基化单倍型都需要进行该值的计算,MHF的计算公式示例如下:
其中i表示目标甲基化区间,h表示目标的甲基化单倍型,Ni表示位于目标甲基化区间的读段数目,Ni,h表示包含目标甲基化单倍型的读段数目。
5、甲基化数据矩阵
a)将训练集和测试集的各个样本的甲基化测序数据(甲基化单倍型频率)分别合并成数据矩阵,对每个深度低于200的位点做缺失值处理。
b)去除缺失值比例高于10%的位点。
c)对于数据矩阵的缺失值,利用KNN算法进行缺失数据插补。
6.根据训练集样本找出肝癌组织特异性甲基化标志物
a)计算每一个甲基化单倍型标志物在训练集中肝癌与其它癌种相比的AUC并从高到低排序,筛选出可较好区分肝癌与其它癌种的甲基化标志物作为候选标志物;
b)使用上一步构建的甲基化标志物在训练集构建逻辑回归模型,然后使用测试集样本验证模型的效果。该步骤主要基于python3 sklearn包linear_model模块的LogisticsRegression函数进行,具体步骤:
1.使用StandardScaler对训练集数据进行标准化,并保存标准化转换公式,其中公式为:x*=(x-u)/σ,μ为所有样本数据的均值,σ为所有样本数据的标准差;
2.将标准化之后的数据输入LogisticsRegression函数,训练逻辑回归模型;
3.将标准化公式应用到测试集数据对测试集进行标准化;
4.将训练好的逻辑回归模型应用于测试集样本进行测试。
筛选出的肝癌组织特异性的甲基化标志物具体见表3.2。
这些甲基化标志物在肝癌与其他6种癌种中的甲基化水平如下表3.2和图21,图22所示:这些甲基化标志物在训练集和测试集中肝癌与其它癌种相比都具有显著性的差异(u检验p值小于0.05),且甲基化水平也具有较大差别。
表3.2在训练集和测试集中甲基化标志物在肝癌与其他6种癌种中的甲基化水平均值


根据上表可知,以单个肝癌甲基化标志物Seq ID NO:137为例查看该标志物在七个癌种中甲基化水平在训练集和测试集中的分布分别如图23和图24所示,可看出该肝癌标志物的甲基化水平在肝癌中相比其它癌种都具有显著性的差异(wilcox test:P<=0.05),是良好的肝癌组织特异性甲基化标志物。类似地,其他肝癌甲基化标志物也是良好的肝癌组织特异性甲基化标志物。
实施例3.2:单个肝癌甲基化标志物判别性能
为了验证单个肝癌甲基化标志物的区分肝癌与其它6个癌种的潜力,使用单个肝癌甲基化标志物的甲基化水平数据在实施例3.1训练集数据中训练模型,并使用测试集样本对模型的性能进行验证,具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标标志物的甲基化水平值,w为不同标志物的系数,b为截距值,y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集样本中目标甲基化位点的数据,TrainPheno是训练集样本的性状(肝癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集样本中目标甲基化位点的数据,TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是肝癌进行判断。
4.统计模型的AUC,并根据确定的阈值统计敏感性、特异性,准确性等指标。
本实施例中单个肝癌甲基化标志物逻辑回归模型的效果见表3.3,从该表 中可看出,所有的肝癌甲基化标志物在测试集和训练集都可以达到0.76以上的AUC和0.70以上的准确率,都是较好的肝癌组织特异性标志物,其中表现优异的肝癌标志物如Seq ID NO:156,Seq ID NO:145,Seq ID NO:150都可以在80%左右的特异性下达到83%以上的敏感性,整体准确性达到80%左右。
表3.3单个肝癌甲基化标志物逻辑回归模型的表现


实施例3.3:所有目标肝癌甲基化标志物的机器学习模型
本实施例使用所有的37个肝癌甲基化标志物的甲基化水平构建了逻辑回归的机器学习模型,用以从多个癌种数据中准确区分出肝癌的样本。具体的步骤与实施例3.2一致,只是相关数据带入了所有37个目标肝癌甲基化标志物的数据。具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标甲基化标志物的甲基化水平值,w为不同甲基化标志物的系数,b为截距值(参数是通过训练逻辑回归模型得到的),y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据(甲基化单倍型频率),TrainPheno是训练集样本的性状(肝癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据(甲基化单倍型频率),TestPred为模型预测分值,使用该预测分值并根据上述阈值对 样本是否是肝癌进行判断。
训练集和测试集中模型预测分值分布见图25,从图中可看出肝癌和其它癌种样本模型分值都具有显著的差异(wilcox test:P<=0.05)。ROC曲线见图26,在测试集中,肝癌与其它癌种区分的AUC达到了0.906,设置阈值为0.297,大于该值则预测为肝癌,反之预测为其它癌种,在特异性为91.5%时,敏感性达到了76.3%,样本整体预测的准确率达到了86.1%,可以很好地从7种癌症样本中区分出肝癌样本。
实施例3.4:肝癌甲基化标志物组合1机器学习模型
为了验证相关标志物组合的效果,本实施例从所有37个肝癌甲基化标志物中随机选取了一共9个肝癌甲基化标志物Seq ID NO:18,Seq ID NO:143,Seq ID NO:23,Seq ID NO:147,Seq ID NO:150,Seq ID NO:117,Seq ID NO:153,Seq ID NO:156,Seq ID NO:157的甲基化水平的数据构建新的机器学习模型。
机器学习模型构建的方法也同实施例3.2一致,但相关样本只使用了该实施例中的9个肝癌甲基化标志物的数据,该模型在训练集和测试集中的模型得分见图27,该模型ROC曲线见图28。可看出该模型在训练集和测试集中,肝癌样本分值同其他癌种分值具有显著差异(wilcox test:P<=0.05),该模型测试集AUC达到了0.955,阈值设成0.265时,大于该值预测为肝癌,小于该值预测为其他癌种,特异性为93.4%时,敏感性达到了76.3%,整体的准确率达到了87.3%,说明了该组合模型良好的性能。
实施例3.5:肝癌甲基化标志物组合2机器学习模型
该实施例使用另一肝癌甲基化标志物组合:Seq ID NO:138,Seq ID NO:143,Seq ID NO:23,Seq ID NO:145,Seq ID NO:150,Seq ID NO:151,Seq ID NO:152,Seq ID NO:125,Seq ID NO:156,Seq ID NO:132一共10个肝癌甲基化标志物进行机器学习模型的构建。
该模型构建方法同样与实施例3.2一致,但相关样本只使用了该实施例中的10个肝癌甲基化标志物的数据。该模型在训练集和测试集中的模型得分见图29,ROC曲线见图30。从图中可看出该模型在训练集和测试集中,肝癌样 本得分显著高于其它癌种得分(wilcox test:P<=0.05),阈值设置为0.279时,在特异性为91.5%时,敏感性达到了74.6%,整体的准确性可达到85.5%,同样可以较好的区分肝癌与其它癌种。
本申请从7个癌种的甲基化NGS测序数据中筛选出了37个肝癌特异性的甲基化标志物,根据这些甲基化标志物的甲基化水平数据构建的机器学习模型可以从7个癌种的数据中很好地区分出肝癌的样本,这些甲基化标志物都是良好的肝癌组织特异性的甲基化标志物,对泛癌种早筛过程中肝癌的组织溯源提供了重要的参考。




实施例4.1:甲基化靶向测序筛选乳腺癌特异性的甲基化位点
发明人收集了总计541个各个癌种的患者,所有入组患者签署知情同意书。将这些样本按照一定的比例分为训练集和测试集,其中训练集用于下述机器学习模型的构建,测试集用于模型的性能测试,样本信息见下表4.1,训练集中乳腺癌样本总数为37个,测试集中乳腺癌样本总数为17个。
表4.1各个癌种血浆样本数量统计表

通过MethylTitan的方法获得目标样本血浆cfDNA的甲基化测序数据,鉴别出其中的DNA甲基化分类标志物。过程如下:
1、血浆cfDNA样本的提取
采用streck血液收集管收集患者2ml全血样本,及时离心分离血浆(3天内),转运至实验室后,采用QIAGEN QIAamp Circulating Nucleic Acid Kit试剂盒根据说明书提取cfDNA。
2、测序及数据预处理
a)文库用Illumina Nextseq 500测序仪进行双端测序。
b)Pear(v0.6.0)软件将Illumina Hiseq X10/Nextseq 500/Novaseq测序仪下机的双端150bp测序的同一片段双端测序数据合并成一条序列,最短重叠长度20bp,合并之后最短30bp。
c)使用Trim_galore v 0.6.0、cutadapt v1.8.1软件对合并后的测序数据进行去接头处理。在序列的5’端去除接头序列为“AGATCGGAAGAGCAC”,并去除两端测序质量值低于20的碱基。
3、测序数据比对
本文使用的参考基因组数据来自UCSC数据库(UCSC:HG19,http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)。
a)首先将HG19使用Bismark软件分别进行胞嘧啶到胸腺嘧啶(CT)和腺嘌呤到鸟嘌呤(GA)的转化,并且分别对转换后的基因组使用Bowtie2软件构建索引。
b)将Illumina Nextseq 500测序仪的下机数据同样进行CT和GA转化。
c)使用Bowtie2软件分别将转化后的序列比对到转化后的HG19参考基因组,最短种子序列长度20,种子序列不允许错配。
4、甲基化单倍型频率(MHF)的计算
对于每个目标区域HG19的CpG位点,根据上述比对结果,获取每个位点对应的甲基化状态。本文中位点的核苷酸编号对应于HG19的核苷酸位置编号。一个目标甲基化区域可能有多个甲基化单倍型,对于目标区域内的每一个甲基化单倍型都需要进行该值的计算,MHF的计算公式示例如下:
其中i表示目标甲基化区间,h表示目标的甲基化单倍型,Ni表示位于目标甲基化区间的读段(reads)数目,Ni,h表示包含目标甲基化单倍型的读段数目。
5、甲基化数据矩阵
a)将训练集和测试集的各个样本的甲基化测序数据(甲基化单倍型频率)分别合并成数据矩阵,对每个深度低于200的位点做缺失值处理。
b)去除缺失值比例高于10%的位点。
c)对于数据矩阵的缺失值,利用KNN算法进行缺失数据插补。
6.根据训练集样本找出乳腺癌组织特异性甲基化标志物
a)计算每一个甲基化单倍型标志物在训练集中乳腺癌与其它癌种相比的AUC并从高到低排序,筛选出可较好区分乳腺癌与其它癌种的甲基化标志物作为候选标志物;
b)使用上一步构建的甲基化标志物在训练集构建逻辑回归模型,然后使用测试集样本验证模型的效果。该步骤主要基于python3 sklearn包linear_model模块的LogisticsRegression函数进行,具体步骤:
1.使用StandardScaler对训练集数据进行标准化,并保存标准化转换公式(公式为:x*=(x-u)/σ,μ为所有样本数据的均值,σ为所有样本数据的标准差);
2.将标准化之后的数据输入LogisticsRegression函数,训练逻辑回归模型;
3.将标准化公式应用到测试集数据对测试集进行标准化;
4.将训练好的逻辑回归模型应用于测试集样本进行测试。
筛选出的乳腺癌组织特异性的甲基化标志物具体表4.2。这些甲基化标志物在乳腺癌与其他6种癌种中的甲基化水平如下表4.2和图31和图32所示。这些甲基化标志物在训练集和测试集中乳腺癌与其它癌种相比都具有显著性的差异(u检验p值小于0.05),且甲基化水平也具有较大差别。
表4.2在训练集和测试集中甲基化标志物在乳腺癌与其他6种癌种中的甲基化水平均值


以单个甲基化标志物Seq ID NO:21为例查看该标志物在七个癌种中甲基化水平在训练集和测试集中的分布分别如图33和图34所示,可看出该标志物的甲基化水平在乳腺癌中相比其它6个癌种都具有显著性的差异(wilcox test:P<=0.05),是良好的乳腺癌组织特异性甲基化标志物。
实施例4.2:单个甲基化标志物判别性能
为了验证单个甲基化标志物的区分乳腺癌与其它6个癌种的潜力,使用单个甲基化标志物的甲基化水平数据在实施例4.1训练集数据中训练模型,并使用测试集样本对模型的性能进行验证,具体步骤如下:
1、使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标标志物的甲基化水平值,w为不同标志物的系数,b为截距值,y为模型预测分值(WTX就是每个标志物的甲基化水平值*对应的系数,为矩阵运算,需要先转置T):
2、使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集样本中目标甲基化位点的数据,TrainPheno是训练集样本的性状(乳腺癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3、使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集样本中目标甲基化位点的数据,TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是乳腺癌进行判断。
4、统计模型的AUC,并根据确定的阈值统计敏感性、特异性,准确性等指标。
本实施例中单个甲基化标志物逻辑回归模型的效果见表4.3,从该表中可看出,所有的甲基化标志物的在测试集和训练集都可以达到0.70以上的AUC和0.73以上的准确率,都是较好的乳腺癌组织特异性标志物,其中表现优异的标志物如Seq ID NO:31,Seq ID NO:22都可以在测试集中80%左右的特异性下达到70%以上的敏感性,AUC达到了0.85左右,整体准确性达到80%左右。
表4.3单个甲基化标志物逻辑回归模型的表现


实施例4.3:所有目标甲基化标志物的机器学习模型
本实施例使用所有的51个甲基化标志物的甲基化水平构建了逻辑回归的机器学习模型,用以从多个癌种数据中准确区分出乳腺癌的样本。具体的步骤与实施例4.2一致,只是相关样本带入了所有51个目标甲基化标志物的数据。具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标甲基化标志物的甲基化水平值,w为不同甲基化标志物的系数,b为截距值(参数是通过训练逻辑回归模型得到的),y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据(甲基化单倍型频率),TrainPheno是训练集样本的性状(乳腺癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据(甲基化单倍型频率),TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是乳腺癌进行判断。
训练集和测试集中模型预测分值分布见图35,从图中可看出乳腺癌和其它癌种样本模型分值都具有显著的差异(wilcox test:P<=0.05)。ROC曲线见图36,在测试集中,乳腺癌与其它癌种区分的AUC达到了0.921,设置阈值为0.178,大于该值则预测为乳腺癌,反之预测为其它癌种,在特异性为90.4%时,敏感性达到了85.7%,样本整体预测的准确率达到了89.8%,可以很好地从7种癌症样本中区分出乳腺癌样本。
实施例4.4:甲基化标志物组合1机器学习模型
为了验证相关标志物组合的效果,本实施例从所有51个甲基化标志物中随机选取了一共8个甲基化标志物Seq ID NO:16,Seq ID NO:20,Seq ID  NO:22,Seq ID NO:31,Seq ID NO:32,Seq ID NO:36,Seq ID NO:48,Seq ID NO:51的甲基化水平的数据构建新的机器学习模型。
机器学习模型构建的方法也同实施例4.2一致,但相关样本只使用了该实施例中的8个标志物的数据,该模型在训练集和测试集中的模型得分见图37,该模型ROC曲线见图38。可看出该模型在训练集和测试集中,乳腺癌样本分值同其他癌种分值具有显著差异(wilcox test:P<=0.05),该模型测试集AUC达到了0.893,阈值设成0.143时,大于该值预测为乳腺癌,小于该值预测为其他癌种,特异性为88.6%时,敏感性达到了66.7%,整体的准确率达到了86.1%,说明了该组合模型良好的性能。
实施例4.5:甲基化标志物组合2机器学习模型
该实施例使用另一甲基化标志物组合:Seq ID NO:5,Seq ID NO:11,Seq ID NO:14,Seq ID NO:27,Seq ID NO:28,Seq ID NO:32,Seq ID NO:45,Seq ID NO:49,Seq ID NO:51一共9个甲基化标志物进行机器学习模型的构建。
该模型构建方法同样与实施例4.2一致,但相关样本只使用了该实施例中的9个标志物的数据。该模型在训练集和测试集中的模型得分见图39,ROC曲线见图40。从图中可看出该模型在训练集和测试集中,乳腺癌样本得分显著高于其它癌种得分(wilcox test:P<=0.05)。测试集中,AUC达到了0.894,阈值设置为0.135时,测试集中在特异性为86.7%时,敏感性达到了90.5%,整体的准确性可达到87.1%,同样可以较好的区分乳腺癌与其它癌种。
本专利从7个癌种的甲基化NGS测序数据中筛选出了51个乳腺癌特异性的甲基化标志物,根据这些甲基化标志物的甲基化水平数据构建的机器学习模型可以从7个癌种的数据中很好地区分出乳腺癌的样本,这些甲基化标志物都是良好的乳腺癌组织特异性的甲基化标志物,对泛癌种早筛过程中乳腺癌的组织溯源提供了重要的参考。






实施例5.1:甲基化靶向测序筛选食管癌/胃癌特异性的甲基化位点
发明人收集了总计424个各个癌种的患者,所有入组患者签署知情同意书。将这些样本按照一定的比例分为训练集和测试集,其中训练集用于下述机器学习模型的构建,测试集用于模型的性能测试,样本信息见下表5.1,将其中食管癌和胃癌归为一类,训练集中该类样本总数为71个,测试集中该类样本总数为40个。
表5.1各个癌种血浆样本数量统计表

通过申请人自主研发的MethylTitanTM的方法获得目标样本血浆cfDNA的甲基化测序数据,鉴别出其中的DNA甲基化分类标志物。过程如下:
1、血浆cfDNA样本的提取
采用streck血液收集管收集患者2ml全血样本,及时离心分离血浆(3天内),转运至实验室后,采用QIAGEN QIAamp Circulating Nucleic Acid Kit试剂盒根据说明书提取cfDNA。
2、Illumina常规测序及数据预处理
a)文库用Illumina Nextseq 500测序仪进行双端测序。
b)Pear(v0.6.0)软件将Illumina Hiseq X10/Nextseq 500/Novaseq测序仪下机的双端150bp测序的同一片段双端测序数据合并成一条序列,最短重叠长度20bp,合并之后最短30bp。
c)使用Trim_galore v 0.6.0、cutadapt v1.8.1软件对合并后的测序数据进行去接头处理。在序列的5’端去除接头序列为“AGATCGGAAGAGCAC”,并去除两端测序质量值低于20的碱基。
3、测序数据比对
本文使用的参考基因组数据来自UCSC数据库(UCSC:HG19,http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)。
a)首先将HG19使用Bismark软件分别进行胞嘧啶到胸腺嘧啶(CT)和腺嘌呤到鸟嘌呤(GA)的转化,并且分别对转换后的基因组使用Bowtie2软件 构建索引。
b)将Illumina Nextseq 500测序仪的下机数据同样进行CT和GA转化。
c)使用Bowtie2软件分别将转化后的序列比对到转化后的HG19参考基因组,最短种子序列长度20,种子序列不允许错配。
4、甲基化单倍型频率(MHF)的计算
对于每个目标区域HG19的CpG位点,根据上述比对结果,获取每个位点对应的甲基化状态。本文中位点的核苷酸编号对应于HG19的核苷酸位置编号。一个目标甲基化区域可能有多个甲基化单倍型,对于目标区域内的每一个甲基化单倍型都需要进行该值的计算,MHF的计算公式示例如下:
其中i表示目标甲基化区间,h表示目标的甲基化单倍型,Ni表示位于目标甲基化区间的读段(reads)数目,Ni,h表示包含目标甲基化单倍型的读段数目。
5、甲基化数据矩阵
a)将训练集和测试集的各个样本的甲基化测序数据(甲基化单倍型频率)分别合并成数据矩阵,对每个深度低于200的位点做缺失值处理。
b)去除缺失值比例高于10%的位点。
c)对于数据矩阵的缺失值,利用KNN算法进行缺失数据插补。
6.根据训练集样本找出胃癌和/或食管癌组织特异性甲基化标志物
a)计算每一个甲基化单倍型标志物在训练集中胃癌和/或食管癌与其它癌种相比的AUC并从高到低排序,筛选出可较好区分胃癌和/或食管癌与其它癌种的甲基化标志物作为候选标志物;
b)使用上一步构建的甲基化标志物在训练集构建逻辑回归模型,然后使用测试集样本验证模型的效果。该步骤主要基于python3 sklearn包linear_model模块的LogisticsRegression函数进行,具体步骤:
1.使用StandardScaler对训练集数据进行标准化,并保存标准化转换公式,其中公式为:x*=(x-u)/σ,μ为所有样本数据的均值,σ为所有样本数据的标准差;
2.将标准化之后的数据输入LogisticsRegression函数,训练逻辑回归模型;
3.将标准化公式应用到测试集数据对测试集进行标准化;
4.将训练好的逻辑回归模型应用于测试集样本进行测试。
筛选出的胃癌和/或食管癌组织特异性的甲基化标志物见表5.2。这些甲基化标志物在胃癌和/或食管癌与其他5种癌种中的甲基化水平如下表5.2和图41。如图42所示,这些甲基化标志物在训练集和测试集中胃癌和/或食管癌与其它癌种相比都具有显著性的差异(u检验p值小于0.05),且甲基化水平也具有较大差别。
表5.2在训练集和测试集中甲基化标志物在胃癌和/或食管癌与其他5种癌种中的甲基化水平

以单个甲基化标志物Seq ID NO:172为例查看该标志物在七个癌种中甲基化水平在训练集和测试集中的分布分别如图43和图44所示,可看出该标志物的甲基化水平在食管癌和胃癌中相比其它5个癌种都具有显著性的差异(wilcox test:P<=0.05),是良好的食管癌和胃癌组织特异性甲基化标志物。
实施例5.2:单个甲基化标志物判别性能
为了验证单个甲基化标志物的区分食管癌和胃癌与其它5个癌种的潜力,使用单个甲基化标志物的甲基化水平数据在实施例5.1训练集数据中训练模型,并使用测试集样本对模型的性能进行验证,具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标标志物的甲基化水平值,w为不同标志物的系数,b为截距值,y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集样本中目标甲基化位点的数据,TrainPheno是训练集样本的性状(食管癌/胃癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集样本中目标甲基化位点的数据,TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是食管癌/胃癌进行判断。
4.统计模型的AUC,并根据确定的阈值统计敏感性、特异性,准确性等指标。
本实施例中单个标志物的逻辑回归模型的效果见表5.3。从该表中可看出,所有的标志物在测试集和训练集中都可以达到0.59以上的AUC和0.56以上的准确率,都是较好的食管癌和胃癌组织特异性标志物,其中表现优异的标志物如Seq ID NO:172,Seq ID NO:173,Seq ID NO:184都可以在70%以上的特异性下达到60%的敏感性,准确性达到70%左右。
表5.3单个标志物逻辑回归模型的表现

实施例5.3:所有目标甲基化标志物的机器学习模型
本实施例使用所有的34个甲基化标志物的甲基化水平构建了逻辑回归的机器学习模型,用以从多个癌种数据中准确区分出胃癌和/或食管癌的样本。具体的步骤与实施例5.2一致,只是相关数据带入了所有34个目标甲基化标志物的数据。具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标甲基化标志物的甲基化水平值,w为不同甲基化标志物的系数,b为截距值(参数是通过训练逻辑回归模型得到的),y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据(甲基化单倍型频率),TrainPheno是训练集样本的性状(食管癌/胃癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据(甲基化单倍型频率),TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是食管癌/胃癌进行判断。
训练集和测试集中模型预测分值分布见图45,从图中可看出胃癌和/或食管癌和其它癌种样本模型分值都具有显著的差异(wilcox test:P<=0.05)。ROC曲线见图46。在测试集中,胃癌和/或食管癌与其它癌种区分的AUC达到了0.922,设置阈值为0.346,大于该值则预测为胃癌和/或食管癌,反之预测为其它癌种。在特异性为95.2%时,敏感性达到了75%,样本整体预测的准确率达到了89.7%,可以较好地从7种癌症样本中区分出胃癌和/或食管癌。
实施例5.4:甲基化标志物组合1机器学习模型
为了验证相关标志物组合的效果,本实施例从所有34个甲基化标志物中随机选取了一共7个甲基化标志物Seq ID NO:165,Seq ID NO:167,Seq ID NO:169,Seq ID NO:150,Seq ID NO:172,Seq ID NO:174,Seq ID NO:179的甲基化水平的数据构建新的机器学习模型。
机器学习模型构建的方法也同实施例5.2一致,但相关样本只使用了该实施例中的7个标志物的数据,该模型在训练集和测试集中的模型得分见图47, 该模型ROC曲线见图48。可看出该模型在训练集和测试集中,胃癌和/或食管癌样本分值同其他癌种分值具有显著差异(wilcox test:P<=0.05),该模型测试集AUC达到了0.917,阈值设成0.30时,大于该值预测为胃癌和/或食管癌,小于该值预测为其他癌种,特异性为91.4%时,敏感性达到了70%,整体的准确率达到了85.5%,说明了该组合模型良好的性能。
实施例5.5:甲基化标志物组合2机器学习模型
该实施例使用另一甲基化标志物组合:Seq ID NO:143,Seq ID NO:23,Seq ID NO:172,Seq ID NO:174,Seq ID NO:177,Seq ID NO:178,Seq ID NO:180,Seq ID NO:183,Seq ID NO:186一共9个甲基化标志物进行机器学习模型的构建。
该模型构建方法同样与实施例5.2一致,但相关样本只使用了该实施例中的9个标志物的数据。该模型在训练集和测试集中的模型得分见图49,ROC曲线见图50。从图中可看出该模型在训练集和测试集中,胃癌和/或食管癌样本得分显著高于其它癌种得分(wilcox test:P<=0.05),阈值设置为0.285时,在特异性为91.4%时,敏感性达到了62.5%,整体的准确性可达到83.4%,同样可以较好的区分胃癌和/或食管癌与其它癌种。
本申请从7个癌种的甲基化NGS测序数据中筛选出了34个食管癌和胃癌特异性的甲基化标志物,根据这些甲基化标志物的甲基化水平数据构建的机器学习模型可以从7个癌种的数据中较好地区分出胃癌和/或食管癌的样本,这些甲基化标志物都是良好的胃癌和/或食管癌组织特异性的甲基化标志物,对泛癌种早筛过程中胃癌和/或食管癌的组织溯源提供了重要的参考。
本文中使用的标志物的序列:




实施例6.1:甲基化靶向测序筛选胰腺癌特异性的甲基化位点
发明人收集了总计541个各个癌种的患者,所有入组患者签署知情同意书。将这些样本按照一定的比例分为训练集和测试集,其中训练集用于下述机器学习模型的构建,测试集用于模型的性能测试,样本信息见下表6.1,训练集中胰腺癌样本总数为37个,测试集中胰腺癌样本总数为17个。
表6.1各个癌种血浆样本数量统计表
通过申请人自主研发的MethylTitanTM的方法获得目标样本血浆cfDNA的甲基化测序数据,鉴别出其中的DNA甲基化分类标志物。过程如下:
1、血浆cfDNA样本的提取
采用streck血液收集管收集患者2ml全血样本,及时离心分离血浆(3天内),转运至实验室后,采用QIAGEN QIAamp Circulating Nucleic Acid Kit试剂盒根据说明书提取cfDNA。
2、Illumina常规测序及数据预处理
a)文库用Illumina Nextseq 500测序仪进行双端测序。
b)Pear(v0.6.0)软件将Illumina Hiseq X10/Nextseq 500/Novaseq测序仪下机的双端150bp测序的同一片段双端测序数据合并成一条序列,最短重叠长度20bp,合并之后最短30bp。
c)使用Trim_galore v 0.6.0、cutadapt v1.8.1软件对合并后的测序数据进行去接头处理。在序列的5’端去除接头序列为“AGATCGGAAGAGCAC”,并去除两端测序质量值低于20的碱基。
3、测序数据比对
本文使用的参考基因组数据来自UCSC数据库(UCSC:HG19,http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)。
a)首先将HG19使用Bismark软件分别进行胞嘧啶到胸腺嘧啶(CT)和 腺嘌呤到鸟嘌呤(GA)的转化,并且分别对转换后的基因组使用Bowtie2软件构建索引。
b)将Illumina Nextseq 500测序仪的下机数据同样进行CT和GA转化。c)使用Bowtie2软件分别将转化后的序列比对到转化后的HG19参考基因组,最短种子序列长度20,种子序列不允许错配。
4、甲基化单倍型频率(MHF)的计算
对于每个目标区域HG19的CpG位点,根据上述比对结果,获取每个位点对应的甲基化状态。本文中位点的核苷酸编号对应于HG19的核苷酸位置编号。一个目标甲基化区域可能有多个甲基化单倍型,对于目标区域内的每一个甲基化单倍型都需要进行该值的计算,MHF的计算公式示例如下:
其中i表示目标甲基化区间,h表示目标的甲基化单倍型,Ni表示位于目标甲基化区间的读段(reads)数目,Ni,h表示包含目标甲基化单倍型的读段数目。
5、甲基化数据矩阵
a)将训练集和测试集的各个样本的甲基化测序数据(甲基化单倍型频率)分别合并成数据矩阵,对每个深度低于200的位点做缺失值处理。
b)去除缺失值比例高于10%的位点。
c)对于数据矩阵的缺失值,利用KNN算法进行缺失数据插补。
6.根据训练集样本找出胰腺癌组织特异性甲基化标志物
a)计算每一个甲基化单倍型标志物在训练集中胰腺癌与其它癌种相比的AUC并从高到低排序,筛选出可较好区分胰腺癌与其它癌种的甲基化标志物作为候选标志物;
b)使用上一步构建的甲基化标志物在训练集构建逻辑回归模型,然后使用测试集样本验证模型的效果。该步骤主要基于python3 sklearn包linear_model模块的LogisticsRegression函数进行,具体步骤:
1.使用StandardScaler对训练集数据进行标准化,并保存标准化转换公式,其中公式为:x*=(x-u)/σ,μ为所有样本数据的均值,σ为所有样本数据的标准差;
2.将标准化之后的数据输入LogisticsRegression函数,训练逻辑回归模 型;
3.将标准化公式应用到测试集数据对测试集进行标准化;
4.将训练好的逻辑回归模型应用于测试集样本进行测试。
筛选出的胰腺癌组织特异性的甲基化标志物具体见表6.2。相关甲基化标志物位于目标基因内或者该目标基因上游区或下游区,其中单独一个或者多个甲基化标志物的组合都可以用作为胰腺癌特异性的甲基化标志物。
这些甲基化标志物在胰腺癌与其他6种癌种中的甲基化水平如下表6.2和图51。如图52所示,这些甲基化标志物在训练集和测试集中胰腺癌与其它癌种相比都具有显著性的差异(u检验p值小于0.05),且甲基化水平也具有较大差别。

以单个甲基化标志物Seq ID NO:202为例查看该标志物在七个癌种中甲基化水平在训练集和测试集中的分布分别如图53和图54所示,可看出该标志物的甲基化水平在胰腺癌中相比其它6个癌种都具有显著性的差异(wilcox test:P<=0.05),是良好的胰腺癌组织特异性甲基化标志物。
实施例6.2:单个胰腺癌甲基化标志物判别性能
为了验证单个胰腺癌甲基化标志物的区分胰腺癌与其它6个癌种的潜力,使用单个胰腺癌甲基化标志物的甲基化水平数据在实施例6.1训练集数据中训练模型,并使用测试集样本对模型的性能进行验证,具体步骤如下:
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标标志物的甲基化水平值,w为不同胰腺癌标志物的系数,b为截距值,y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集样本中目标甲基化位点的数据,TrainPheno是训练集样本的性状(胰腺癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集样本中目标甲基化位点的数据,TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是胰腺癌进行判断。
4.统计模型的AUC,并根据确定的阈值统计敏感性、特异性,准确性等指标。
本实施例中单个胰腺癌甲基化标志物逻辑回归模型的效果见表6.3,从该表中可看出,所有的胰腺癌甲基化标志物在测试集和训练集中都可以达到0.60以上的AUC和0.68以上的准确率,都是较好的胰腺癌组织特异性标志物,其中表现优异的胰腺癌标志物如Seq ID NO:194,Seq ID NO:189都可以在测试集中75%以上的特异性下达到40%以上的敏感性,整体准确性达到73%以上。
表6.3单个胰腺癌甲基化标志物逻辑回归模型的表现
实施例6.3:所有目标胰腺癌甲基化标志物的机器学习模型
本实施例使用所有的36个胰腺癌甲基化标志物的甲基化水平构建了逻辑回归的机器学习模型,用以从多个癌种数据中准确区分出胰腺癌的样本。具体的步骤与实施例6.2一致,只是相关样本带入了所有36个目标胰腺癌甲基化标志物的数据。
1.使用python(V3.9.7)中的sklearn(V1.0.1)包中的逻辑回归模型:
AllModel=LogisticRegression(),该模型的公式如下,其中x为样本目标胰腺癌甲基化标志物的甲基化水平值,w为不同胰腺癌甲基化标志物的系数,b为截距值(参数是通过训练逻辑回归模型得到的),y为模型预测分值:
2.使用训练集的样本进行训练:AllModel.fit(Traindata,TrainPheno),其中TrainData是训练集的数据(甲基化单倍型频率),TrainPheno是训练集样本的性状(胰腺癌为1,其它癌种为0),并根据训练集的样本确定模型的相关阈值。
3.使用测试集的样本进行测试:TestPred=AllModel.predict_proba(TestData)[:,1],其中TestData为测试集数据(甲基化单倍型频率),TestPred为模型预测分值,使用该预测分值并根据上述阈值对样本是否是胰腺癌进行判断。
训练集和测试集中模型预测分值分布见图55,从图中可看出胰腺癌和其它癌种样本模型分值都具有显著的差异(wilcox test:P<=0.05)。ROC曲线见图56,在测试集中,胰腺癌与其它癌种区分的AUC达到了0.921,设置阈值为0.124,大于该值则预测为胰腺癌,反之预测为其它癌种,在特异性为93.5%时,敏感性达到了70.6%,样本整体预测的准确率达到了91.4%,可以很好地从7种癌症样本中区分出胰腺癌样本。
实施例6.4:胰腺癌甲基化标志物组合1机器学习模型
为了验证胰腺癌标志物组合的效果,本实施例从所有36个胰腺癌甲基化标志物中随机选取了一共11个胰腺癌甲基化标志物Seq ID NO:190,Seq ID NO:195,Seq ID NO:202,Seq ID NO:203,Seq ID NO:206,Seq ID NO:172,Seq ID NO:210,Seq ID NO:211,Seq ID NO:213,Seq ID NO:154,Seq ID NO:214的甲基化水平的数据构建新的机器学习模型。
机器学习模型构建的方法也同实施例6.3一致,但相关样本只使用了该实施例中的11个胰腺癌标志物的数据,该模型在训练集和测试集中的模型得分见图57,该模型ROC曲线见图58。可看出该模型在训练集和测试集中,胰腺癌样本分值同其他癌种分值具有显著差异(wilcox test:P<=0.05),该模型测试集AUC达到了0.931,阈值设成0.114时,大于该值预测为胰腺癌,小于该 值预测为其他癌种,特异性为92.4%时,敏感性达到了64.7%,整体的准确率达到了89.8%,说明了该组合模型良好的性能。
实施例6.5:胰腺癌甲基化标志物组合2机器学习模型
该实施例使用另一胰腺癌甲基化标志物组合:Seq ID NO:195,Seq ID NO:196,Seq ID NO:199,Seq ID NO:202,Seq ID NO:203,Seq ID NO:210,Seq ID NO:211,Seq ID NO:213,Seq ID NO:154,Seq ID NO:216一共10个胰腺癌甲基化标志物进行机器学习模型的构建。
该模型构建方法同样与实施例6.3一致,但相关样本只使用了该实施例中的10个标志物的数据。该模型在训练集和测试集中的模型得分见图59,ROC曲线见图60。从图中可看出该模型在训练集和测试集中,胰腺癌样本得分显著高于其它癌种得分(wilcox test:P<=0.05)。测试集中,AUC达到了0.909,阈值设置为0.111时,测试集中在特异性为91.2%时,敏感性达到了58.8%,整体的准确性可达到88.2%,同样可以较好的区分胰腺癌与其它癌种。
本申请从7个癌种的甲基化NGS测序数据中筛选出了36个胰腺癌特异性的甲基化标志物,根据这些胰腺癌甲基化标志物的甲基化水平数据构建的机器学习模型可以从7个癌种的数据中很好地区分出胰腺癌的样本,这些甲基化标志物都是良好的胰腺癌组织特异性的甲基化标志物,对泛癌种早筛过程中胰腺癌的组织溯源提供了重要的参考。
虽然已经描述了多个实施方案,但是显而易见的是,基本公开和实施例可以提供利用或包含在本文所述的标志物和方法中的其它实施方案。因此,应当理解的是,本发明的范围由从公开和所附权利要求中可以理解的范围来限定,而不是由特定实施例来限定。
胰腺癌甲基化标志物的序列如下:




Claims (84)

  1. 试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分结直肠癌患者与非结直肠癌的癌症患者,(2)用于诊断或辅助诊断结直肠癌;或者(3)用于泛癌筛查过程中对结直肠癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中结直肠癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区:基因SFN;基因GPR3;基因FCGR1B;基因FAM150B;基因RGPD3;基因NUP210;基因LMOD3;基因FOXF2;基因TBXT;基因PRR15;基因ELN;基因TFPI2;基因REPIN1;基因PDLIM2;基因SDC2;基因TRAPPC9;基因TJP2;基因DIP2C;基因DDIT4;基因MRPL23;基因PAX6;基因PLXNC1;基因MLNR;基因MYO16;基因TMEM179;基因GATM;基因CACNA1H;基因NLRC5;基因SHISA6;基因KCNJ12;基因PRAC1;基因MYO15B;基因CANT1;基因SALL3;基因THOP1;基因ZBTB7A;基因DNM2;基因LGALS4;基因WISP2;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为140bp-510bp,优选200bp-470bp。
  2. 根据权利要求1所述的用途,其中所述非结直肠癌的癌症或泛癌包括肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
  3. 根据权利要求1或2所述的用途,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID No.52-90。
  4. 根据权利要求1-3中任一项所述的用途,其中试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
  5. 根据权利要求1-4中任一项所述的用途,其中试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
  6. 一种构建区分结直肠癌与其他非结直肠癌的预测模型的方法,其包括:
    (1)获得结直肠癌样品和非结直肠癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域 的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区:基因SFN;基因GPR3;基因FCGR1B;基因FAM150B;基因RGPD3;基因NUP210;基因LMOD3;基因FOXF2;基因TBXT;基因PRR15;基因ELN;基因TFPI2;基因REPIN1;基因PDLIM2;基因SDC2;基因TRAPPC9;基因TJP2;基因DIP2C;基因DDIT4;基因MRPL23;基因PAX6;基因PLXNC1;基因MLNR;基因MYO16;基因TMEM179;基因GATM;基因CACNA1H;基因NLRC5;基因SHISA6;基因KCNJ12;基因PRAC1;基因MYO15B;基因CANT1;基因SALL3;基因THOP1;基因ZBTB7A;基因DNM2;基因LGALS4;基因WISP2;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为140bp-510bp,优选200bp-470bp;优选地,所述非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌;和
    (2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
  7. 根据权利要求6所述的方法,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID No.52-90;
    优选地,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,基因组DNA是血浆中的游离DNA。
  8. 根据权利要求6或7所述的方法,其中步骤(1)包括获得样品DNA的甲基化测序数据。
  9. 根据权利要求6-8中任一项所述的方法,其中步骤(2)包括使用逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。
  10. 根据权利要求6-9中任一项所述的方法构建的结直肠癌预测模型。
  11. 诊断结直肠癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据权利要求6-9中任一项所述的方法以构建结直肠癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是结直肠癌进行判断。
  12. 一种用于检测结直肠癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种结直肠癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述结直肠癌组织特异性甲基化标 志物是以下区域或其位点,所述区域包含以下基因以及该基因在其所处的染色体中的2.3kb上游区和2.3kb下游区:基因SFN;基因GPR3;基因FCGR1B;基因FAM150B;基因RGPD3;基因NUP210;基因LMOD3;基因FOXF2;基因TBXT;基因PRR15;基因ELN;基因TFPI2;基因REPIN1;基因PDLIM2;基因SDC2;基因TRAPPC9;基因TJP2;基因DIP2C;基因DDIT4;基因MRPL23;基因PAX6;基因PLXNC1;基因MLNR;基因MYO16;基因TMEM179;基因GATM;基因CACNA1H;基因NLRC5;基因SHISA6;基因KCNJ12;基因PRAC1;基因MYO15B;基因CANT1;基因SALL3;基因THOP1;基因ZBTB7A;基因DNM2;基因LGALS4;或基因WISP2;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;
    优选地,其中所述位点的长度为140bp-510bp,优选200bp-470bp;
    优选地,其中所述甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID No.52-90。
  13. 根据权利要求12所述的试剂盒或装置,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,其中核酸是血浆中的游离DNA。
  14. 根据权利要求12或13所述的试剂盒或装置,其中试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法;
    优选地,所述试剂包含用于检测甲基化标志物的寡核苷酸,优选地,寡核苷酸是引物和/或探针;
    优选地,所述引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物;
    优选地,所述试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非结直肠癌的癌症患者的前述特异性甲基化标志物;优选地,所述非结直肠癌的癌症是肺癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
  15. 试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分肺癌患者与非肺癌的癌症患者,(2)用于诊断或辅助诊断肺癌;或 者(3)用于泛癌筛查过程中对肺癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中肺癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为120bp-500bp,优选200bp-480bp。
  16. 根据权利要求15所述的用途,其中所述非肺癌的癌症或泛癌包括结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
  17. 根据权利要求15或16所述的用途,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:24、65、76和91-135。
  18. 根据权利要求15-17中任一项所述的用途,其中试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
  19. 根据权利要求15-18中任一项所述的用途,其中试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
  20. 一种构建区分肺癌与其他非肺癌的癌症的预测模型的方法,其包括:
    (1)获得肺癌样品和非肺癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb 下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为120bp-500bp,优选200bp-480bp;优选地,所述非肺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌;和
    (2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
  21. 根据权利要求20所述的方法,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:24、65、76和91-135;
    优选地,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,基因组DNA是血浆中的游离DNA。
  22. 根据权利要求20或21所述的方法,其中步骤(1)包括获得样品DNA的甲基化测序数据。
  23. 根据权利要求20-22中任一项所述的方法,其中步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。
  24. 根据权利要求20-23中任一项所述的方法构建的肺癌预测模型。
  25. 诊断肺癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据权利要求20-23中任一项所述的方法以构建肺癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是肺癌进行判断,大于阈值预测为肺癌,反之预测为其它癌种。
  26. 一种用于检测肺癌组织特异性甲基化标志物的试剂盒或装置,其包 含检测来自样品的基因组DNA中的一种或多种肺癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述肺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.2kb上游区和2.2kb下游区:基因ARHGEF16;位于基因CASZ1;基因MAP3K6;基因TRIM58;基因ARHGEF33;基因PSD4;基因HOXD4;基因SLC12A8;基因DGKG;基因TERT;基因NR2F1;基因PCDHGC5;基因KCNMB1;基因FOXC1;基因HIST1H4F;基因TYW1;基因LRRC4;基因DGKI;基因PDLIM2;基因RHOBTB2;基因TMEM75;基因OPLAH;基因NR5A1;基因SPAG6;基因WAPAL;基因BTBD16;基因DPYSL4;基因TTC40;基因ADAM8;基因SLC22A11;基因CPT1A;基因B4GALNT1;基因FBRSL1;基因XPO4;基因TFDP1;基因GCH1;基因TMEM179;基因ITPKA;基因SOX8;基因SLC9A3R2;基因SEPT-9;基因MBP;基因NFATC1;基因DNM2;基因RASAL3;基因TAF4;基因NTSR1;基因SLC17A9;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为120bp-500bp,优选200bp-480bp;
    优选地,其中所述甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:24、65、76和91-135。
  27. 根据权利要求26所述的试剂盒或装置,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,其中核酸是血浆中的游离DNA。
  28. 根据权利要求26或27所述的试剂盒或装置,其中试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法;
    优选地,所述试剂包含用于检测甲基化标志物的寡核苷酸,优选地,寡核苷酸是引物和/或探针;
    优选地,所述引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物;
    优选地,所述试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非肺癌的癌症患者的前述特异性 甲基化标志物;优选地,所述非肺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
  29. 试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分肝癌患者与非肝癌的癌症患者,(2)用于诊断或辅助诊断肝癌;或者(3)用于泛癌筛查过程中对肝癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中肝癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;WISP2基因;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为100bp-550bp,优选150bp-480bp。
  30. 根据权利要求29所述的用途,其中所述非肝癌的癌症或泛癌包括结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
  31. 根据权利要求29或30所述的用途,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:7、18、23、29、41、90、94、104、117、120、125、128、132和136-159。
  32. 根据权利要求29-31中任一项所述的用途,其中试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
  33. 根据权利要求29-32中任一项所述的用途,其中试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
  34. 一种构建区分肝癌与其他非肝癌的癌症的预测模型的方法,其包括:
    (1)获得肝癌样品和非肝癌的癌症样品的基因组DNA中甲基化标志物的 甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;WISP2基因;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为100bp-550bp,优选150bp-480bp;优选地,所述非肝癌的癌症是结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌;和
    (2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
  35. 根据权利要求34所述的方法,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:7、18、23、29、41、90、94、104、117、120、125、128、132和136-159;
    优选地,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,基因组DNA是血浆中的游离DNA。
  36. 根据权利要求34或35所述的方法,其中步骤(1)包括获得样品DNA的甲基化测序数据。
  37. 根据权利要求34-36中任一项所述的方法,其中步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。
  38. 根据权利要求34-37中任一项所述的方法构建的肝癌预测模型。
  39. 诊断肝癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据权利要求34-37中任一项所述的方法以构建肝癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是肝癌进行判断,大于阈值预测为肝癌,反之预测为其它癌种。
  40. 一种用于检测肝癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种肝癌组织特异性甲基化标志 物状态和/或水平的试剂或组件,所述肝癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的3kb上游区和3kb下游区:TAL1基因;TRIM58基因;LBH基因;ABCG5基因;PAX8基因;DLEC1基因;AMIGO3基因;RASSF1基因;CLDN11基因;SLC2A9基因;SLC9A3基因;CXXC5基因;FOXC1基因;HIST1H4F基因;TRIM40基因;HOXA13基因;CRHR2基因;AGPAT6基因;TCF24基因;OPLAH基因;GPAM基因;ADAM8基因;GRASP基因;B4GALNT1基因;STX2基因;ATL1基因;ITPKA基因;PIF1基因;ZFHX3基因;C1QL1基因;SEPT-9基因;KCTD1基因;PIP5K1C基因;RASAL3基因;CYP2F1基因;WISP2基因;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为100bp-550bp,优选150bp-480bp;
    优选地,其中所述甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:7、18、23、29、41、90、94、104、117、120、125、128、132和136-159。
  41. 根据权利要求40所述的试剂盒或装置,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,其中核酸是血浆中的游离DNA。
  42. 根据权利要求40或41所述的试剂盒或装置,其中试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法;
    优选地,所述试剂包含用于检测甲基化标志物的寡核苷酸,优选地,寡核苷酸是引物和/或探针;
    优选地,所述引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物;
    优选地,所述试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非肝癌的癌症患者的前述特异性甲基化标志物;优选地,所述非肝癌的癌症是结直肠癌、肺癌、胃癌、食管癌、胰腺癌和/或乳腺癌。
  43. 试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于 (1)区分乳腺癌患者与非乳腺癌的癌症患者,(2)用于诊断或辅助诊断乳腺癌;或者(3)用于泛癌筛查过程中对乳腺癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中乳腺癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为150bp-500bp,优选200bp-470bp。
  44. 根据权利要求43所述的用途,其中所述非乳腺癌的癌症或泛癌包括结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或肺癌。
  45. 根据权利要求43或44所述的用途,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:1-51。
  46. 根据权利要求43-45中任一项所述的用途,其中试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
  47. 根据权利要求43-46中任一项所述的用途,其中试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
  48. 一种构建区分乳腺癌与其他非乳腺癌的癌症的预测模型的方法,其包括:
    (1)获得乳腺癌样品和非乳腺癌的癌症样品的基因组DNA中甲基化标志 物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为150bp-500bp,优选200bp-470bp;优选地,所述非乳腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、胰腺癌和/或肺癌;和
    (2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
  49. 根据权利要求48所述的方法,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:1-51;
    优选地,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,基因组DNA是血浆中的游离DNA。
  50. 根据权利要求48或49所述的方法,其中步骤(1)包括获得样品DNA的甲基化测序数据。
  51. 根据权利要求48-50中任一项所述的方法,其中步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的阈值。
  52. 根据权利要求48-51中任一项所述的方法构建的乳腺癌预测模型。
  53. 诊断乳腺癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据权利要求48-51中任一项所述的方法以构建乳腺癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以获得模型预测分值,使用预测分值并根据阈值对样本是否是乳腺癌进行判断。
  54. 一种用于检测乳腺癌组织特异性甲基化标志物的试剂盒或装置,其 包含检测来自样品的基因组DNA中的一种或多种乳腺癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述乳腺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因BARHL2;基因ALX3;基因TBX15;基因C2CD4D;基因RYR2;基因LBH;SIX3;基因SIX2;基因OTX1;基因EMX1;基因LBX2;基因BCL2L11;基因PAX8;基因HOXD1;基因SATB2;基因VILL;基因CLDN11;基因EPHB3;基因NKX3-2;基因KCTD8;基因PITX1;基因CXXC5;基因FOXC1;基因NRN1;基因HOXA9;基因DLX6;基因MOS;基因TCF24;基因CA3;基因GDF6;基因FOXD4;基因PTF1A;基因TLX1;基因INA;基因NKX6-2;基因PAX6;基因BCAT1;基因FAIM2;基因GRASP;基因CCNA1;基因SIX1;基因PRKCB;基因SOX9;基因ST8SIA5;基因NFIX;基因EPS8L1;基因ZIK1;基因KAL1;基因ZNF81;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为150bp-500bp,优选200bp-470bp;
    优选地,其中所述甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:1-51。
  55. 根据权利要求54所述的试剂盒或装置,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,其中核酸是血浆中的游离DNA。
  56. 根据权利要求54或55所述的试剂盒或装置,其中试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法;
    优选地,所述试剂包含用于检测甲基化标志物的寡核苷酸,优选地,寡核苷酸是引物和/或探针;
    优选地,所述引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物;
    优选地,所述试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非乳腺癌的癌症患者的前述特异性甲基化标志物;优选地,所述非乳腺癌的癌症是结直肠癌、肝癌、胃癌、 食管癌、胰腺癌和/或肺癌。
  57. 试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分胃癌和/或食管癌患者与除胃癌和食管癌以外的癌症患者,(2)用于诊断或辅助诊断胃癌和/或食管癌;或者(3)用于泛癌筛查过程中对胃癌和/或食管癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中胃癌和/或食管癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因TAL1;基因VAV3;基因PMF1;基因ATP2B4;基因SH3YL1;基因SLC9A3;基因CXXC5;基因PCDHGA11;基因FOXF2;基因ZNF273;基因KLRG2;基因CRB2;基因SEC16A;基因GPAM;基因ASCL2;基因PAX6;基因PTGDR2;基因PLEKHB1;基因TBX5;基因STX2;基因FBRSL1;基因ATP11A;基因BTBD6;基因CRIP2;基因ONECUT1;基因ZNF764;基因IGHV3OR16-17;基因SALL1;基因ACTG1;基因GATA6;基因KCTD1;基因CYP2F1;基因TPTE;基因CLDN5;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为150bp-500bp,优选200bp-470bp。
  58. 根据权利要求57所述的用途,其中所述除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。
  59. 根据权利要求57或58所述的用途,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID No.23、72、143、150、152、157和160-187。
  60. 根据权利要求57-59中任一项所述的用途,其中试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
  61. 根据权利要求57-60中任一项所述的用途,其中试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
  62. 一种构建区分胃癌和/或食管癌与除胃癌和食管癌以外的癌症的预测模型的方法,其包括:
    (1)获得胃癌和/或食管癌样品和除胃癌和食管癌以外的癌症样品的基因 组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因TAL1;基因VAV3;基因PMF1;基因ATP2B4;基因SH3YL1;基因SLC9A3;基因CXXC5;基因PCDHGA11;基因FOXF2;基因ZNF273;基因KLRG2;基因CRB2;基因SEC16A;基因GPAM;基因ASCL2;基因PAX6;基因PTGDR2;基因PLEKHB1;基因TBX5;基因STX2;基因FBRSL1;基因ATP11A;基因BTBD6;基因CRIP2;基因ONECUT1;基因ZNF764;基因IGHV3OR16-17;基因SALL1;基因ACTG1;基因GATA6;基因KCTD1;基因CYP2F1;基因TPTE;基因CLDN5;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为150bp-500bp,优选200bp-470bp;优选地,所述除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌;和
    (2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
  63. 根据权利要求62所述的方法,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID No.23、72、143、150、152、157和160-187;
    优选地,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,基因组DNA是血浆中的游离DNA。
  64. 根据权利要求62或63所述的方法,其中步骤(1)包括获得样品DNA的甲基化测序数据。
  65. 根据权利要求62-64中任一项所述的方法,其中步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练,并根据训练集的样本确定模型的相关阈值。
  66. 根据权利要求62-65中任一项所述的方法构建的胃癌和/或食管癌预测模型。
  67. 诊断胃癌和/或食管癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据权利要求62-65中任一项所述的方法以构建胃癌和/或食管癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以得到模型预测分值,使用预测分值并根据阈值对样本是否是胃癌和/或食管癌进行判断,大于阈值预测为胃癌和/或食管 癌,反之预测为其它癌种。
  68. 一种用于检测胃癌和/或食管癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种胃癌和/或食管癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述胃癌和/或食管癌组织特异性甲基化标志物是以下区域或其位点,所述区域包含以下基因以及该基因在其所处的染色体中的2kb上游区和2kb下游区:基因TAL1;基因VAV3;基因PMF1;基因ATP2B4;基因SH3YL1;基因SLC9A3;基因CXXC5;基因PCDHGA11;基因FOXF2;基因ZNF273;基因KLRG2;基因CRB2;基因SEC16A;基因GPAM;基因ASCL2;基因PAX6;基因PTGDR2;基因PLEKHB1;基因TBX5;基因STX2;基因FBRSL1;基因ATP11A;基因BTBD6;基因CRIP2;基因ONECUT1;基因ZNF764;基因IGHV3OR16-17;基因SALL1;基因ACTG1;基因GATA6;基因KCTD1;基因CYP2F1;基因TPTE;基因CLDN5;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;
    优选地,其中所述位点的长度为150bp-500bp,优选200bp-470bp;
    优选地,其中所述甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID No.23、72、143、150、152、157和160-187。
  69. 根据权利要求68所述的试剂盒或装置,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,其中核酸是血浆中的游离DNA。
  70. 根据权利要求68或69所述的试剂盒或装置,其中试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法;
    优选地,所述试剂包含用于检测甲基化标志物的寡核苷酸,优选地,寡核苷酸是引物和/或探针;
    优选地,所述引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物;
    优选地,所述试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/ 或对照物,所述对照物是来自正常受试者或除胃癌和食管癌以外的癌症患者的前述特异性甲基化标志物;优选地,所述除胃癌和食管癌以外的癌症或泛癌包括肺癌、肝癌、结直肠癌、胰腺癌和/或乳腺癌。
  71. 试剂或组件在制备试剂盒或装置中的用途,所述试剂盒或装置用于(1)区分胰腺癌患者与非胰腺癌的癌症患者,(2)用于诊断或辅助诊断胰腺癌;或者(3)用于泛癌筛查过程中对胰腺癌的组织溯源,其中试剂或组件包含检测样品基因组DNA中胰腺癌组织特异性甲基化标志物的甲基化水平的试剂或组件,所述甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;或基因SDF2L1;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为130bp-530bp,优选150bp-480bp。
  72. 根据权利要求71所述的用途,其中所述非胰腺癌的癌症或泛癌包括结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌。
  73. 根据权利要求71或72所述的用途,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:68、88、154、163、172、177和188-217。
  74. 根据权利要求71-73中任一项所述的用途,其中试剂或组件包含以下一种或多种检测甲基化的方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法。
  75. 根据权利要求71-74中任一项所述的用途,其中试剂或组件包含用于检测甲基化标志物的引物和/或探针,和/或样品为细胞、组织、细针穿刺活检物和/或血浆,优选地,样品基因组DNA是血浆中的游离DNA。
  76. 一种构建区分胰腺癌与其他非胰腺癌的癌症的预测模型的方法,其 包括:
    (1)获得胰腺癌样品和非胰腺癌的癌症样品的基因组DNA中甲基化标志物的甲基化水平作为训练集;所述甲基化标志物选自以下区域或该区域的位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;基因SDF2L1;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为130bp-530bp,优选150bp-480bp;优选地,所述非胰腺癌的癌症是结直肠癌、肝癌、胃癌、食管癌、乳腺癌和/或肺癌;和
    (2)使用甲基化标志物的甲基化水平数据构建逻辑回归的机器学习模型。
  77. 根据权利要求76所述的方法,其中所述甲基化标志物包含以下任一项或多项所示的核苷酸序列或者其互补序列或变体序列:SEQ ID NO:68、88、154、163、172、177和188-217;
    优选地,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,基因组DNA是血浆中的游离DNA。
  78. 根据权利要求76或77所述的方法,其中步骤(1)包括获得样品DNA的甲基化测序数据。
  79. 根据权利要求76-78中任一项所述的方法,其中步骤(2)包括建立逻辑回归模型以得到模型预测分值;以及使用获得的甲基化标志物的甲基化水平作为训练集进行训练并根据训练集的样本确定模型的阈值。
  80. 根据权利要求76-79中任一项所述的方法构建的胰腺癌预测模型。
  81. 诊断胰腺癌的装置,其包含存储器和处理存储器存储的指令的处理器,所述指令执行根据权利要求76-79中任一项所述的方法以构建胰腺癌预测模型;并且使用待测样品的基因组DNA中的甲基化标志物的甲基化水平作为测试集以获得模型预测分值,使用预测分值并根据阈值对样本是否是胰腺癌进行判断。
  82. 一种用于检测胰腺癌组织特异性甲基化标志物的试剂盒或装置,其包含检测来自样品的基因组DNA中的一种或多种胰腺癌组织特异性甲基化标志物状态和/或水平的试剂或组件,所述胰腺癌组织特异性甲基化标志物是以下区域或其位点,所述区域是以下基因以及该基因在其所处的染色体中的2.5kb上游区和2.5kb下游区:基因TNFRSF14;基因PGM1;基因CELF3;基因ATP2B4;基因SF3B6;基因CNNM4;基因SP9;基因C2orf82;基因NEU4;基因RPL35A;基因HGFAC;基因EXOC3;基因GDNF;基因NEUROG1;基因HIST1H2BA;基因OSTM1;基因CCR6;基因CCAR2;基因TNFRSF10D;基因TJP2;基因DAB2IP;基因NTMT1;基因MKI67;基因PTGDR2;基因CCDC77;基因MYL2;基因FRY;基因SMEK1;基因BTBD6;基因PIF1;基因SRL;基因SPNS1;基因DNM2;基因ZNF569;基因SDF2L1;或任一种基因的互补序列或变体,只要变体中的甲基化位点未发生突变;优选地,其中所述位点的长度为130bp-530bp,优选150bp-480bp;
    优选地,其中所述甲基化标志物包含以下中任一项或多项所示的核苷酸序列或其互补序列或者变体序列:SEQ ID NO:68、88、154、163、172、177和188-217。
  83. 根据权利要求82所述的试剂盒或装置,其中样品为细胞、组织、细针穿刺活检物或血浆,优选地,其中核酸是血浆中的游离DNA。
  84. 根据权利要求82或83所述的试剂盒或装置,其中试剂或组件包含以下一种或多种方法中使用的试剂或组件:基于重亚硫酸盐转化的PCR、DNA测序、甲基化敏感的限制性内切酶分析法、荧光定量法、甲基化敏感性高分辨率熔解曲线法和基于芯片的甲基化图谱分析和质谱法;
    优选地,所述试剂包含用于检测甲基化标志物的寡核苷酸,优选地,寡核苷酸是引物和/或探针;
    优选地,所述引物是利用甲基化测序法检测位点的甲基化水平/状态的引物或用于扩增一个或多个甲基化位点的PCR引物;
    优选地,所述试剂包含重亚硫酸盐及其衍生物、PCR缓冲液、聚合酶、dNTP、引物、探针、甲基化敏感或不敏感的限制性内切酶、酶切缓冲液、荧光染料、荧光淬灭剂、荧光报告剂、外切核酸酶、碱性磷酸酶、内标和/或对照物,所述对照物是来自正常受试者或非胰腺癌的癌症患者的特异性甲基化标志物;优选地,所述非胰腺癌的癌症是结直肠癌、肝癌、胃癌、食管 癌、乳腺癌和/或肺癌。
PCT/CN2023/105537 2022-07-04 2023-07-03 癌症特异性甲基化标志物及其应用 WO2024008040A1 (zh)

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
CN202210787623.2 2022-07-04
CN202210787425.6 2022-07-04
CN202210787623.2A CN118127150A (zh) 2022-07-04 胰腺癌特异性甲基化标志物及其诊断胰腺癌的应用
CN202210786398.0A CN117385026A (zh) 2022-07-04 2022-07-04 乳腺癌特异性甲基化标志物及其诊断乳腺癌的应用
CN202210787412.9 2022-07-04
CN202210787502.8A CN117385028A (zh) 2022-07-04 2022-07-04 结直肠癌特异性甲基化标志物及其应用
CN202210787313.0A CN117344012A (zh) 2022-07-04 2022-07-04 胃癌和/或食管癌特异性甲基化标志物及其应用
CN202210786398.0 2022-07-04
CN202210787425.6A CN117363728A (zh) 2022-07-04 2022-07-04 肝癌组织特异性甲基化标志物及其诊断肝癌的应用
CN202210787412.9A CN117385027A (zh) 2022-07-04 2022-07-04 肺癌特异性甲基化标志物及其诊断肺癌的应用
CN202210787313.0 2022-07-04
CN202210787502.8 2022-07-04

Publications (1)

Publication Number Publication Date
WO2024008040A1 true WO2024008040A1 (zh) 2024-01-11

Family

ID=89454391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105537 WO2024008040A1 (zh) 2022-07-04 2023-07-03 癌症特异性甲基化标志物及其应用

Country Status (2)

Country Link
TW (1) TW202403054A (zh)
WO (1) WO2024008040A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100068720A1 (en) * 2005-12-15 2010-03-18 Weiwei Li Method and kit for detection of early cancer or pre-cancer using blood and body fluids
US20120264640A1 (en) * 2009-11-05 2012-10-18 Genomictree, Inc. Method for detecting the methylation of colorectal-cancer-specific methylation marker genes for colorectal cancer diagnosis
US20180305689A1 (en) * 2015-04-22 2018-10-25 Mina Therapeutics Limited Sarna compositions and methods of use
WO2019068082A1 (en) * 2017-09-29 2019-04-04 Arizona Board Of Regents On Behalf Of The University Of Arizona DNA METHYLATION BIOMARKERS FOR THE DIAGNOSIS OF CANCER
CN112779334A (zh) * 2021-02-01 2021-05-11 杭州医学院 一种用于***癌早期筛查的甲基化标志物组合及筛选方法
CN114317736A (zh) * 2021-08-19 2022-04-12 广州市基准医疗有限责任公司 用于泛癌种检测的甲基化标志物组合及其应用
CN114507731A (zh) * 2020-11-16 2022-05-17 南京腾辰生物科技有限公司 一种用于辅助癌症诊断的甲基化标志物及试剂盒

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100068720A1 (en) * 2005-12-15 2010-03-18 Weiwei Li Method and kit for detection of early cancer or pre-cancer using blood and body fluids
US20120264640A1 (en) * 2009-11-05 2012-10-18 Genomictree, Inc. Method for detecting the methylation of colorectal-cancer-specific methylation marker genes for colorectal cancer diagnosis
US20180305689A1 (en) * 2015-04-22 2018-10-25 Mina Therapeutics Limited Sarna compositions and methods of use
WO2019068082A1 (en) * 2017-09-29 2019-04-04 Arizona Board Of Regents On Behalf Of The University Of Arizona DNA METHYLATION BIOMARKERS FOR THE DIAGNOSIS OF CANCER
CN114507731A (zh) * 2020-11-16 2022-05-17 南京腾辰生物科技有限公司 一种用于辅助癌症诊断的甲基化标志物及试剂盒
CN112779334A (zh) * 2021-02-01 2021-05-11 杭州医学院 一种用于***癌早期筛查的甲基化标志物组合及筛选方法
CN114317736A (zh) * 2021-08-19 2022-04-12 广州市基准医疗有限责任公司 用于泛癌种检测的甲基化标志物组合及其应用

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HAO, XIAOKE ET AL.: "DNA methylation markers for diagnosis and prognosis of common cancers", PNAS, vol. 114, no. 28, 11 July 2017 (2017-07-11), pages 7414 - 7419, XP055668394, DOI: 10.1073/pnas.1703577114 *
KE JI, CUI JIAN, YANG XINGGUO, DU XIN, BOBO MA, YU LEI: "Comprehensive Analysis of the Relationship between m6A Methylation Patterns and Immune Microenvironment in Lung Adenocarcinoma", CHINESE JOURNAL OF LUNG CANCER, ZHONGGUO KANGUAN XIEHUI, CN, vol. 25, no. 5, 1 May 2022 (2022-05-01), CN , pages 311 - 322, XP093126481, ISSN: 1009-3419, DOI: 10.3779/j.issn.1009-3419.2022.103.02 *

Also Published As

Publication number Publication date
TW202403054A (zh) 2024-01-16

Similar Documents

Publication Publication Date Title
US11549148B2 (en) Neuroendocrine tumors
CN108866192B (zh) 基于甲基化修饰的肿瘤标记物stamp-ep1
WO2021128519A1 (zh) Dna甲基化生物标志物组合、检测方法和试剂盒
WO2020029567A1 (zh) 一种基于联合检测sdc2和sfrp2基因甲基化水平的结直肠癌早期诊断试剂
WO2023071890A1 (zh) 胃癌***转移相关的甲基化生物标记物及其组合和检测试剂盒
WO2012047899A2 (en) Novel dna hypermethylation diagnostic biomarkers for colorectal cancer
CN108866191B (zh) 基于甲基化修饰的肿瘤标记物stamp-ep2
EP3301194B1 (en) Method for assisting diagnosis of onset risk of gastric cancer, and artificial dna and kit for diagnosing onset risk of gastric cancer used in the method
CN114317738B (zh) 用于检测胃癌***节转移相关的甲基化生物标记物或其组合及应用
US11377694B2 (en) Unbiased DNA methylation markers define an extensive field defect in histologically normal prostate tissues associated with prostate cancer: new biomarkers for men with prostate cancer
WO2022161076A1 (zh) 用于肺结节良恶性检测的甲基化标记物或其组合及应用
JP2024020392A (ja) 特定の遺伝子のcpgメチル化変化を利用した肝癌診断用組成物およびその使用
CN109652541B (zh) 基于甲基化修饰的肿瘤标记物stamp-ep6
US11130998B2 (en) Unbiased DNA methylation markers define an extensive field defect in histologically normal prostate tissues associated with prostate cancer: new biomarkers for men with prostate cancer
WO2020221315A1 (zh) 基于甲基化修饰的肿瘤标记物stamp-ep8及其应用
JP2023513039A (ja) 特定の遺伝子のCpGメチル化の変化を利用した膀胱癌診断用組成物およびその使用
WO2024008040A1 (zh) 癌症特异性甲基化标志物及其应用
CN113999901B (zh) 心肌特异性甲基化标记物
WO2020221314A1 (zh) 基于甲基化修饰的肿瘤标记物stamp-ep7及其应用
CN104531866B (zh) 用于结肠直肠癌中使用的生物标志物
WO2020221316A1 (zh) 基于甲基化修饰的肿瘤标记物stamp-ep9及其应用
CN118127150A (zh) 胰腺癌特异性甲基化标志物及其诊断胰腺癌的应用
CN117344012A (zh) 胃癌和/或食管癌特异性甲基化标志物及其应用
CN117385026A (zh) 乳腺癌特异性甲基化标志物及其诊断乳腺癌的应用
CN117385027A (zh) 肺癌特异性甲基化标志物及其诊断肺癌的应用

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834807

Country of ref document: EP

Kind code of ref document: A1