WO2016095093A1 - Method for screening tumor, method and device for detecting variation of target region - Google Patents

Method for screening tumor, method and device for detecting variation of target region Download PDF

Info

Publication number
WO2016095093A1
WO2016095093A1 PCT/CN2014/093871 CN2014093871W WO2016095093A1 WO 2016095093 A1 WO2016095093 A1 WO 2016095093A1 CN 2014093871 W CN2014093871 W CN 2014093871W WO 2016095093 A1 WO2016095093 A1 WO 2016095093A1
Authority
WO
WIPO (PCT)
Prior art keywords
reference sequence
region
genes
alignment
fragment
Prior art date
Application number
PCT/CN2014/093871
Other languages
French (fr)
Chinese (zh)
Inventor
蔡宇航
陈希
钱朝阳
管彦芳
易鑫
朱红梅
杨玲
吴仁花
Original Assignee
天津华大基因科技有限公司
深圳华大基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津华大基因科技有限公司, 深圳华大基因科技有限公司 filed Critical 天津华大基因科技有限公司
Priority to PCT/CN2014/093871 priority Critical patent/WO2016095093A1/en
Publication of WO2016095093A1 publication Critical patent/WO2016095093A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the invention relates to the field of biomedicine. Specifically, the present invention relates to a target region variation detecting method, a target region variation detecting device, a tumor screening method, and a device.
  • Tumors are classified into benign tumors and malignant tumors. Malignant tumors are also called cancers. They are diseases caused by abnormal mechanisms controlling cell growth and proliferation. Cancer is one of the diseases that threaten human health. In addition to uncontrolled growth of cancer cells, cancer cells locally invade the surrounding normal tissues and even transfer to other parts of the body via the internal circulatory system or lymphatic system.
  • the present invention is directed to at least one of the technical problems existing in the prior art or at least one means.
  • the present invention provides a method of detecting a variation of a target region, the method comprising: (1) acquiring a nucleic acid in a sample to be tested, the nucleic acid being composed of a plurality of nucleic acid fragments, the nucleic acid fragment being derived from Broken genomic DNA and/or free DNA; (2) capturing the nucleic acid fragment using a kit to obtain a target region; (3) performing sequence determination on the target region to obtain sequencing data, the sequencing data being read by multiple Segment composition; (4) detecting a variation in the target region based on the sequencing data; wherein the kit comprises a probe capable of specifically recognizing the following predetermined regions: 547 genes in Table 1 The gene region of at least 10 genes in the gene.
  • the predetermined area is at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 of the 547 genes.
  • the combination of gene regions that can be specifically recognized by the kit probe in the method of the present invention is obtained by the inventor through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are common tumor occurrence or developmental correlations. region.
  • the common tumors include lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer.
  • the related gene sequences of a plurality of common cancers can be obtained in one time, simply and conveniently, and high specificity, and the related gene sequences can be detected and analyzed, and the detection analysis results can be used for early screening of various common cancers. Check and judge the possibility and effect of artificially intervening early tumor intervention. At present, most cancers such as lung cancer, liver cancer, and gastric cancer have been diagnosed at the time of hospital pathological diagnosis, which delays the earlier treatment time and greatly reduces the possibility of cure.
  • the predetermined region is a gene region of 145 genes listed in Table 2 of the 547 genes.
  • the gene region of the 145 genes in Table 2 that the probe can specifically recognize is obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of lung cancer.
  • all lung cancer-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist in early screening of lung cancer. diagnosis.
  • the predetermined region is a gene region of 60 genes listed in Table 3 of the 547 genes.
  • the gene regions of the 60 genes of Table 3 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of colorectal cancer.
  • all colorectal cancer-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist colorectal cancer. Early screening diagnosis.
  • the predetermined region is the gene region of the 43 genes listed in Table 4 of the 547 genes.
  • the gene regions of the 43 genes in Table 4 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are related to the occurrence and development of gynecological reproductive tract tumors.
  • the genital tract tumors include ovarian cancer, endometrial cancer, and cervical cancer.
  • the probe has a length of from 25 to 300 nt, preferably from 50 to 250 nt, and more preferably from 80 nt to 120 nt.
  • the probe is obtained by first obtaining an initial probe set and then screening the initial probe set.
  • Obtaining the initial probe set includes: determining a reference sequence of the gene region, starting from one end of the reference sequence, sequentially acquiring a DNA fragment on the reference sequence to the other end of the reference sequence, wherein one DNA The fragment is an initial probe, and all of the DNA fragments constitute the initial probe set, the DNA fragments completely overlap, partially overlap or not overlap at all, and the initial probe set can cover the gene region at least once .
  • the reference sequence of the gene region can be obtained from the reference genome, for example, the corresponding gene region is obtained from the human reference genome HG19, and the corresponding gene region on all HG19 constitutes the reference sequence of the gene region, and HG19 can be NCBI database download.
  • the initial probe set is obtained by using an iterative algorithm design, including: determining a position of the gene region on a reference genome, and obtaining a reference sequence of the gene region, from the reference sequence
  • the first nucleotide begins to copy the reference sequence to obtain a first DNA fragment
  • the reference sequence is copied from the second nucleotide of the reference sequence to obtain a second DNA fragment, from which the reference sequence
  • the third nucleotide begins to copy the reference sequence to obtain a third DNA fragment, such that the subsequent DNA fragment is sequentially obtained until one end of the Nth DNA fragment exceeds the reference sequence, wherein one DNA fragment is an initial probe.
  • N is the total number of initial probes contained in the initial probe set to obtain an initial probe set capable of covering the entire target gene region, and for the final probe
  • further screening the initial probe set comprises: combining the DNA fragment (initial probe set) with the reference sequence The alignment is performed to obtain the number of alignments of each DNA fragment on the reference sequence, and the DNA fragments having more than one alignment are filtered out.
  • the initial probe set is further screened, in order to enable the final probe to capture the gene region in the same reaction system, and/or to cause the captured gene region to be eluted together under the same reaction conditions, including : Remove DNA fragments with a GC content other than 35-70%.
  • step (2) and step (3) comprise: (h) terminating the nucleic acid fragment to obtain a terminal repair fragment; (i) adding base A to the end repair fragment At both ends, a sticky end fragment is obtained; (j) a linker is ligated to both ends of the sticky end fragment to obtain a linker ligation fragment; (k) performing a first amplification of the linker ligation fragment to obtain a first amplification product; (1) capturing the first amplification product using the kit to obtain the target region; (m) performing second amplification on the target region to obtain a second amplification product; and (n) Right place The second amplification product is subjected to sequence determination to obtain the sequencing data.
  • the sequencing library construction included in steps (2) and (3) in the method of this aspect of the invention is particularly suitable for the construction of a sequencing library containing a trace amount of nucleic acid in a sample.
  • the sample contains a trace amount of free DNA.
  • the plasma sample contains an extremely small amount of target free DNA.
  • the first amplification allows the amount of nucleic acid to meet the needs of chip/probe hybridization capture.
  • the chip hybridization captures a certain amount of nucleic acid, and the second amplification enables capture.
  • the next target fragment is re-amplified to meet the requirements of on-machine sequencing and quality control detection.
  • This library construction method is particularly suitable for sequencing library construction of samples with a total free nucleic acid of not less than 10 ng or a conventional tissue genomic DNA of not less than 1 ⁇ g.
  • sequencing (2) and (3) in the method of this aspect of the invention includes sequencing of the constructed sequencing library
  • sequencing can be performed using known platforms including, but not limited to, Illumina's Hiseq2000/2500 platform, Life Technologies Ion Torrent platform and single molecule sequencing platform.
  • the sequencing method can be either single-ended sequencing or double-end sequencing. In one embodiment of the invention, double-end sequencing is utilized, and the resulting sequencing data consists of multiple pairs of read pairs.
  • the quality of the offline data after sequencing is high, and the high-quality offline data is beneficial for subsequent accurate detection and analysis. .
  • the step (4) comprises: performing a first alignment of the sequencing data with a reference sequence to obtain a first alignment result; and comparing the first alignment result with the reference sequence Part of performing a second alignment to obtain a second alignment result, wherein a part of the reference sequence includes each known InDel site in the target region reference sequence, and each of the known InDel sites is 1000 bp upstream and downstream a reference sequence; simultaneously detecting SNP, InDel, SV, and CNV variations in the target region based on the first alignment result and the second alignment result.
  • the second alignment is a local alignment
  • the first alignment is a conventional global alignment, and may be performed by using software such as SOAP or BWA according to its default setting to obtain a first comparison result, the first ratio.
  • the result includes the matching position of the read segment on the reference sequence and the matching situation information.
  • performing the second alignment that is, based on the first alignment result, locally re-aligning all the sequence information (reads) near all known INDELs in the reference sequence corresponding to the captured gene region, and eliminating the first ratio
  • the alignment error improves the accuracy of subsequent mutation detection, and the second alignment can be performed using the GATK comparison software ( https://www.broadinstitute.org/gatk/ ).
  • the SNP and INDEL variations are detected simultaneously by the GATK Unified Genotyper software.
  • the sequencing data is filtered before the first alignment, and the filtering includes removing the uncertainty base ratio by more than 10%.
  • the read range and/or the ratio of the number of bases whose base quality value is not more than 5 is not less than 50%.
  • the same pair of reads of the two of the read pairs of the first alignment result are removed.
  • a portion of the reference sequence includes each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site.
  • the reference sequence can be selected as a human reference genome, that is, a known human genome sequence, such as HG19, HG19 can be downloaded from NCBI, and the target region reference sequence is matched with the target region. That part of the reference genome sequence.
  • the step (4) further comprises: when at least one of the detected variations satisfies the following (i) or (ii), determining that the sample to be tested is a positive sample: (i) The number of reads supported in the negative control sample is less than 2 and the mutation rate in the positive control sample is greater than 1%, (ii) the sequencing depth is not less than 10X, and at least 3 reads are supported, in the negative control sample.
  • the number of reads supported by the reader was less than 2, the mutation rate in the positive control sample was greater than 1%, and the read support amount of the variant site was significantly different from the read support amount of the same site in the normal control sample.
  • the positive sample refers to the tumor sample.
  • the two determination conditions are determined by the inventor combined with the current relevant database information and a large amount of literature report information, a large number of positive samples and a large number of negative samples. It is statistically significant, and the latter is more than the former. Strictly, preferably, there are more than 30 positive or negative control samples.
  • the data of the control sample can be obtained by extracting and sequencing the nucleic acid of the control sample by itself, or by sequencing the samples in other publicly or publicly available databases. Data, multiple control sample data makes the statistical decision conditions/results statistically significant and more credible.
  • the results of the determination based on any two criteria can be used for clinical tumor diagnosis screening, and can be used to understand the type, possibility and development of cancer in the sample.
  • the read support amount of the variant site in the sample to be tested is significantly different from the read support amount of the same site of the normal control sample (negative control sample), wherein the read support amount can be
  • the number of reads in order to support the variation may also be the ratio of the reads supporting the variation in the read of the site in the alignment, in one embodiment of the invention, the latter is used for comparison, said Significant differences mean substantial differences, for example, for variant A in the sample to be tested, the ratio of reads in multiple positive samples (cancer samples) is 5/400 (variation 5 reads, total 400 reads), That is, the average mutation frequency of the site in the positive sample is 1.25%, and the ratio of reads support in the multiple negative control samples is 1/200 (variation 1 reads, total 200 reads), that is, the average in the negative control sample.
  • the frequency of variation is 0.5%. If the frequency of variation of the mutation site in the sample to be tested is closer to 1.25%, for example, 0.9%, the significant difference or the substantial difference is reached.
  • the difference between the read support ratio (variation frequency) of the variant site A and the mutation frequency of the site in the negative control sample for example, the z test or the t test can be used, and the difference is significant (p ⁇ 0.05), that is, Achieving significant differences.
  • the present invention further provides an apparatus for detecting a variation of a target area, which is used to implement or perform the target area variation detecting method of one or more embodiments of the present invention described above, the apparatus comprising: a data acquisition unit, configured to acquire sequencing data of the target region, where the sequencing data is composed of a plurality of read segments and/or a plurality of read segment pairs, and the data acquisition unit performs: acquiring nucleic acid in the sample to be tested
  • the nucleic acid is composed of a plurality of nucleic acid fragments derived from fragmented genomic DNA and/or free DNA, and the nucleic acid fragment is captured by a kit to obtain a target region, and the target region is subjected to sequence determination, wherein
  • the kit comprises a probe capable of specifically recognizing a predetermined region: a gene region of at least 10 of the 547 genes in Table 1;
  • a detecting unit configured to detect the target region variation based on the sequencing data from the data acquiring unit, the variation including at least
  • the detecting unit 200 in the device 1000 includes a first comparing subunit 13, a second comparing subunit 15, and a mutation identifying subunit 17,
  • the a comparison sub-unit 13 is configured to first compare the sequencing data from the data acquisition unit 100 with a reference sequence to obtain a first comparison result
  • the second comparison sub-unit 15 is configured to be from the first
  • the first alignment result of the comparison sub-unit 13 is second aligned with a portion of the reference sequence to obtain a second alignment result
  • the variation identification sub-unit 17 is configured to be based on the first alignment sub-unit a first alignment result of 13 and a second alignment result from the second alignment subunit 15 simultaneously detecting SNV, InDel, SV, and CNV variations in the target region to obtain mutation site information
  • a portion of the reference sequence includes each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site.
  • the detecting unit 200 of the apparatus 1000 further includes a first filtering subunit 12, the first filtering subunit 12 and the first comparing subunit 13 Connecting, for filtering the sequencing data before the sequencing data enters the first aligned sub-single 13 element, the filtering comprising removing a read having an undetermined base ratio of more than 10% and/or The ratio of the number of bases whose base quality value is not more than 5 is not less than 50%.
  • the filtering comprising removing a read having an undetermined base ratio of more than 10% and/or The ratio of the number of bases whose base quality value is not more than 5 is not less than 50%.
  • the detecting unit 200 further includes a second filtering subunit 14, the second filtering subunit 14 and the first comparing subunit 13 and the second alignment respectively
  • the subunit 15 is connected for removing one of the first alignment results from the first comparison subunit 13 before the first comparison result enters the second comparison subunit 15
  • the reference sequence may be HG19, the first alignment performed in the first comparison unit is a global alignment, and the second alignment performed in the second alignment sub-unit is a local alignment.
  • the detecting unit 200 in the apparatus 1000 further includes a determining subunit 19 for determining from the mutation identifying subunit 17. Whether the mutation site satisfies the following, and determines that the sample to be tested is a positive sample when at least one of the mutation sites satisfies the following: the number of read support in the negative control sample is less than 2 and in the positive control sample The mutation rate is greater than 1%.
  • the detecting unit 200 further includes a determining subunit 19 for determining whether the mutation site from the mutation identifying subunit 17 satisfies the following.
  • the at least one of the variant sites satisfies the following to determine that the sample to be tested is a positive sample: the sequencing depth is not less than 10X, the support of at least 3 reads, and the number of read support in the negative control sample is less than 2
  • the mutation rate in the positive control sample was greater than 1%, and the read support amount of the variant site was significantly different from the read support amount of the same site in the normal control sample.
  • a method of screening a tumor particularly an early screening, including but not limited to lung cancer, colorectal cancer, gastric cancer, breast cancer, renal cancer, pancreatic cancer, ovary Cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer, the method comprising: obtaining nucleic acid in a sample to be tested, the nucleic acid consisting of a plurality of nucleic acid fragments derived from fragmented genomic DNA And/or free DNA; capturing the nucleic acid fragment using a kit to obtain a target region; performing sequence determination on the target region to obtain sequencing data, the sequencing data being composed of a plurality of reads; based on the sequencing data, detecting The variation in the target region, based on at least one of the detected mutations, satisfies the following (i) or (ii), and determines that the sample to be tested is a positive sample: (i) read support in the negative control sample The number of mutations
  • the predetermined area is at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 of the 547 genes.
  • the combination of gene regions that can be specifically recognized by the kit probe in the method of the present invention is obtained by the inventor through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are common tumor occurrence or developmental correlations. region.
  • the common tumors include lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer.
  • the related gene sequences of a plurality of common cancers can be obtained in one time, simply and conveniently, and high specificity, and the related gene sequences can be detected and analyzed, and the detection analysis results can be used for early screening of various common cancers. Check and judge the possibility and effect of artificially intervening early tumor intervention. At present, most cancers such as lung cancer, liver cancer, and gastric cancer have been diagnosed at the time of hospital pathological diagnosis, which delays the earlier treatment time and greatly reduces the possibility of cure.
  • the predetermined region is a gene region of 145 genes listed in Table 2 of the 547 genes.
  • the gene region of the 145 genes in Table 2 that the probe can specifically recognize is obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of lung cancer.
  • all lung cancer-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist in early screening of lung cancer. diagnosis.
  • the predetermined region is a gene region of 60 genes listed in Table 3 of the 547 genes.
  • the gene regions of the 60 genes of Table 3 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of colorectal cancer.
  • the probe in this kit in the method of the invention can acquire all colorectal cancer-related gene sequences in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist the early stage of colorectal cancer. Screening diagnosis.
  • the predetermined region is the gene region of the 43 genes listed in Table 4 of the 547 genes.
  • the gene regions of the 43 genes in Table 4 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are related to the occurrence and development of gynecological reproductive tract tumors.
  • the genital tract tumors include ovarian cancer, endometrial cancer, and cervical cancer.
  • the probe has a length of from 25 to 300 nt, preferably from 50 to 250 nt, and more preferably from 80 nt to 120 nt.
  • the probe is obtained by first obtaining an initial probe set and then screening the initial probe set. definite.
  • Obtaining the initial probe set includes: determining a reference sequence of the gene region, starting from one end of the reference sequence, sequentially acquiring a DNA fragment on the reference sequence to the other end of the reference sequence, wherein one DNA
  • the fragment is an initial probe
  • all of the DNA fragments constitute the initial probe set, the DNA fragments completely overlap, partially overlap or not overlap at all, and the initial probe set can cover the gene region at least once .
  • the reference sequence of the gene region can be obtained from the reference genome, for example, the corresponding gene region is obtained from the human reference genome HG19, and the corresponding gene region on all HG19 constitutes the reference sequence of the gene region, and HG19 can be NCBI database download.
  • the initial probe set is obtained by using an iterative algorithm design, including: determining a position of the gene region on a reference genome, and obtaining a reference sequence of the gene region, from the reference sequence
  • the first nucleotide begins to copy the reference sequence to obtain a first DNA fragment
  • the reference sequence is copied from the second nucleotide of the reference sequence to obtain a second DNA fragment, from which the reference sequence
  • the third nucleotide begins to copy the reference sequence to obtain a third DNA fragment, such that the subsequent DNA fragment is sequentially obtained until one end of the Nth DNA fragment exceeds the reference sequence, wherein one DNA fragment is an initial probe.
  • N is the total number of initial probes contained in the initial probe set to obtain an initial probe set capable of covering the entire target gene region, and for the final probe
  • further screening the initial probe set comprises: combining the DNA fragment (initial probe set) with the reference sequence The alignment is performed to obtain the number of alignments of each DNA fragment on the reference sequence, and the DNA fragments having more than one alignment are filtered out.
  • the initial probe set is further screened, in order to enable the final probe to capture the gene region in the same reaction system, and/or to cause the captured gene region to be eluted together under the same reaction conditions, including : Remove DNA fragments with a GC content other than 35-70%.
  • the kit captures the nucleic acid fragment, obtains a target region, and performs sequence determination on the target region to obtain sequencing data, including: (h) end-repairing the nucleic acid fragment, Obtain a terminal repair fragment; (i) adding base A to both ends of the end repair fragment to obtain a sticky end fragment; (j) attaching a linker to both ends of the sticky end fragment to obtain a linker fragment; (k) pair The linker ligation fragment is subjected to a first amplification to obtain a first amplification product; (1) capturing the first amplification product by the kit to obtain the target region; (m) targeting the target The region is subjected to a second amplification to obtain a second amplification product; and (n) the second amplification product is sequenced to obtain the sequencing data.
  • the sequencing library construction included in the method of this aspect of the invention is particularly suitable for the construction of a sequencing library containing a trace amount of nucleic acid in a sample.
  • the sample is a plasma sample containing a trace amount of free DNA, including extremely small targets. Free DNA, the first amplification enables the amount of nucleic acid to meet the needs of chip/probe hybridization capture, and the chip hybridization captures a certain amount of nucleic acid, and the second amplification enables the target fragment under capture to be re-amplified. Meet the requirements of sequencing and quality control testing.
  • This library construction method is particularly suitable for sequencing library construction of samples with a total free nucleic acid of not less than 10 ng or a conventional tissue genomic DNA of not less than 1 ⁇ g.
  • sequencing can be performed using known platforms including, but not limited to, Illumina's Hiseq2000/2500 platform, Life Technologies' Ion Torrent platform, and single molecule sequencing platform.
  • the sequencing method can be either single-ended sequencing or double-end sequencing. In one embodiment of the invention, double-end sequencing is utilized, and the resulting sequencing data consists of multiple pairs of read pairs.
  • the detecting the variation in the target area based on the sequencing data comprises: performing a first comparison with the reference sequence to obtain a first comparison result; Performing a second alignment with a portion of the reference sequence to obtain a second alignment result, the portion of the reference sequence including each known InDel site in the target region reference sequence, and A reference sequence of 1000 bp each upstream and downstream of each known InDel site is described; based on the first alignment result and the second alignment result, SNP, InDel, SV and CNV mutations in the target region are simultaneously detected.
  • the second alignment is a local alignment
  • the first alignment is a conventional global alignment, and may be performed by using software such as SOAP or BWA according to its default setting to obtain a first comparison result, the first ratio.
  • the result includes the matching position of the read segment on the reference sequence and the matching situation information.
  • performing the second alignment that is, based on the first alignment result, locally re-aligning all the sequence information (reads) near all known INDELs in the reference sequence corresponding to the captured gene region, and eliminating the first ratio
  • the alignment error improves the accuracy of subsequent mutation detection, and the second alignment can be performed using the GATK comparison software ( https://www.broadinstitute.org/gatk/ ).
  • the SNP and INDEL variations are detected simultaneously by the GATK Unified Genotyper software.
  • the mutation detecting method of the present invention it is possible to accurately detect a low frequency mutation having a mutation frequency of 1%.
  • the tumor screening method of this aspect of the invention the low frequency mutation with a mutation frequency of 1% can be accurately detected, which facilitates early screening of the tumor, assists in clinical diagnosis, and can prevent, treat, or monitor the tumor in time. .
  • the sequencing data is filtered before the first alignment, and the filtering includes removing the uncertainty base ratio by more than 10%.
  • Reading and / or alkali matrix A ratio of the number of bases having a magnitude of not more than 5 is not less than 50% of the read.
  • the same pair of reads of the two of the read pairs of the first alignment result are removed.
  • a portion of the reference sequence includes each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site.
  • the reference sequence can be selected as a human reference genome, that is, a known human genome sequence, such as HG19, HG19 can be downloaded from NCBI, and the target region reference sequence is matched with the target region. That part of the reference genome sequence.
  • the at least one of the detected variations satisfies the following (i) or (ii), and the sample to be tested is determined to be a positive sample, wherein the positive sample refers to a tumor
  • the sample may be a lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer or liver cancer samples.
  • the two judgment conditions are determined by the inventor in combination with the current relevant database information and a large amount of literature report information, a large number of positive samples and a large number of negative samples, which are statistically significant, and the latter is more strict than the former.
  • control sample there are more than 30 positive or negative control samples, and the data of the control sample can be obtained by extracting and sequencing the nucleic acid of the control sample by itself, or according to the sample sequencing data of the published or public database of others.
  • Multiple control sample data makes the statistical decision conditions/results statistically significant and more credible.
  • the results of the determination based on any two criteria can be used for clinical tumor diagnosis screening, and can be used to understand the type, possibility and development of cancer in the sample.
  • the read support amount of the variant site in the sample to be tested is significantly different from the read support amount of the same site of the normal control sample (negative control sample), wherein the read support amount can be
  • the number of reads in order to support the variation may also be the ratio of the reads supporting the variation in the read of the site in the alignment, in one embodiment of the invention, the latter is used for comparison, said Significant differences mean substantial differences, for example, for variant A in the sample to be tested, the ratio of reads in multiple positive samples (cancer samples) is 5/400 (variation 5 reads, total 400 reads), That is, the average mutation frequency of the site in the positive sample is 1.25%, and the ratio of reads support in the multiple negative control samples is 1/200 (variation 1 reads, total 200 reads), that is, the average in the negative control sample.
  • the frequency of variation is 0.5%. If the frequency of variation of the mutation site in the sample to be tested is closer to 1.25%, for example, 0.9%, the significant difference or the substantial difference is reached.
  • the difference between the read support ratio (variation frequency) of the variant site A and the mutation frequency of the site in the negative control sample for example, the z test or the t test can be used, and the difference is significant (p ⁇ 0.05), that is, Achieving significant differences.
  • the inventors have shown that the concentration of free plasma DNA (cfDNA) in normal human peripheral blood is 1-100 ng/mL, and the circulating tumor DNA (ctDNA) content in peripheral blood of tumor patients will increase significantly due to tumor cell secretion.
  • the genomic fragments produced by apoptosis or necrosis enter the blood, making the ctDNA content in the peripheral blood of tumor patients mean The degree can reach 180ng/mL, and the timing and monitoring of the changes and mutations of ctDNA in the peripheral blood of normal people and tumor patients can be applied to at least one of the following: early diagnosis of tumor, hereditary tumor prediction and state evaluation, Early detection of tumor progression, tumor detection and evaluation, tumor targeted therapy, chemotherapy for genetic variation analysis, tumor pathogenic gene trace residue detection, and tumor resistance gene mutation analysis.
  • minimally invasive the subject only needs to provide 5-10 mL of peripheral blood samples
  • Sex The subject can be tested for multiple times in real time. It can be detected regularly during early screening to monitor the risk of tumor. Tumor patients can be tested at any time after surgery, after chemotherapy/targeting, to analyze the prognosis of the operation and medication.
  • FIG. 1 is a schematic structural view of a target region variation detecting device in an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a target region variation detecting device in an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a target region variation detecting device in an embodiment of the present invention.
  • Fig. 4 is a view showing the configuration of a target region variation detecting device in an embodiment of the present invention.
  • variants are common, and "SNP” (SNV), “CNV”, “insert deletion” (indel), and “structural variation” (SV) in the present invention are common.
  • SNP SNP
  • CNV CNV
  • insert deletion indel
  • SV structural variation
  • the definition is the same as usual, but the size of each variation is not particularly limited in the present invention, so that there are some crossovers between the several variations, such as when the insertion/deletion is a large fragment or even a whole chromosome, the copy number is also generated.
  • Mutation (CNV) or chromosomal aneuploidy also belongs to SV. The cross-over of the size of these types of variations does not prevent one skilled in the art from performing the methods and/or apparatus of the present invention and achieve the results described.
  • the "reference sequence” in the present invention is a known genomic sequence or at least a part of a known genomic sequence, and "first”, “second” and the like used in the present invention are merely for convenience of description and are not to be construed as indicating or Implying relative importance, it cannot be understood as having a sequential relationship.
  • the meaning of "multiple” is two or More than two.
  • a kit for obtaining an aspect of the invention, a method and/or apparatus for carrying out an aspect of the invention generally comprises designing a target region capture probe/chip, microsample construction and hybridization sequencing, and biological information analysis of the off-machine data. Interpretation of variant data.
  • an iterative algorithm is used to design a target region capture probe/chip that can be used or assisted for common cancer or specific cancer early screening diagnosis.
  • the target region includes driver genes related to common cancers or specific cancers, high-frequency mutation genes, important genes in 12 cancer-related signaling pathways, target drugs, and genes related to chemotherapy drugs.
  • the designed probe/chip can specifically identify the target region from a complex genome and capture the target region with high specificity and high coverage on the same set of probes/chips.
  • cfDNA plasma free DNA
  • ctDNA DNA derived from tumor cells
  • Linking Adapter library linker refers to a designed base sequence that binds to a primer when the cfDNA/ctDNA library is amplified, allows DNA amplification, and is sequenced with the primer when sequencing. The combination facilitates sequencing primer binding to the site to be sequenced for assisted DNA sequencing.
  • the library was subjected to the first round of PCR amplification
  • the library is quality-controlled after amplification and hybridized with the above probe/chip;
  • the hybrid library is subjected to a second round of PCR amplification
  • the Illumina HiSeq2500/2000 was sequenced on a machine with a sequencing depth of over 2000X.
  • SOAPnuke filter remove low quality reads
  • the individualized interpretation of the mutated data after bioinformatics analysis with reference to the constructed tumor database and related literature, analyzes the variation detected by the subject to assist in judging the tumor-related gene status of the tested sample, and whether there is cancer Risks, as well as the benign and malignant conditions of early tumor tissue, are used to assist in the most appropriate prevention and treatment methods in combination with clinical test results.
  • test results obtained using the method/device of the present invention are described in detail below in connection with specific individual samples.
  • the following examples are merely illustrative of the invention and are not to be construed as limiting the invention.
  • the reagents, sequences (linkers, tags and primers), software and instruments not specifically addressed in the following examples are conventional commercial products or open source, such as the hiseq2000 sequencing platform from Illumina. Library related kits for library construction and the like.
  • CANPer CANPer
  • CANPer is a liquid phase chip.
  • the CANPer chip includes the driver gene of the common high-risk cancer, the high-frequency mutation gene, and the important genes in the 12 signal pathways related to cancer, totaling 547 genes, 300Kb. The list of genes is shown in Table 1.
  • the peripheral blood plasma of patients with early stage of lung nodules was used as the sample to be tested.
  • the sample was from Tianjin Maternal and Child Health Hospital. The contents are as follows:
  • the separated plasma and remaining blood cells are stored in a -80 ° C refrigerator to avoid repeated freezing and thawing.
  • the filter column is empty at 8000 rpm for 1 min;
  • the filter column is 14000 rpm, 3 min;
  • the above-mentioned gene chip CANPer-1.75M entrusted by Roche was used, and hybridization capture and elution were carried out according to the instructions provided by the chip manufacturer. Finally, the magnetic beads were eluted by back-mixing with 21 ⁇ L of ddH 2 O.
  • the Illumina HiSeq2500 PE101+8+101 program was used for sequencing on the machine.
  • the sequencing experiment was performed according to the manufacturer's operating instructions (see Illumina/Solexa official cBot) for sequencing.
  • SOAPnuke filter removes n ⁇ 10% and the number of bases with a base mass value ⁇ 5 > 50% of reads;
  • Filt_bam remove the mismatch base ⁇ 3 reads
  • QC information such as the capture efficiency, effective number of reads, average depth, repetition rate, coverage, and uncovered intervals of the statistical chip;
  • the screening parameters used were: sequencing depth ⁇ 10X, variability in negative (normal) samples ⁇ 2%, variability in positive samples ⁇ 1%, and the number of reads supporting the variation in the sample data to be tested ⁇ 3, there is a significant difference between the read support ratio of the normal control (somatic cells) (p ⁇ 0.05);
  • chemotherapeutic drugs Annotate the function of variation, the number of reads support, the frequency of mutation, the variation of amino acids, and the variation in the database Cosmic; assist in determining the possible sources of disease based on the variation.
  • the killing effect of chemotherapeutic drugs on tumor cells is significantly correlated with the expression and/or polymorphism of a specific (a group of) genes.
  • the detection of related genes predicts the efficacy of chemotherapeutic drugs and selects appropriate drugs for individualized chemotherapy. It has become a reasonable choice to improve efficacy and reduce ineffective treatment.
  • the PharmGKB database is used to integrate all the current chemotherapeutic drugs and the genes related to curative effect and predictive evaluation of therapeutic effects, and to form a database for interpretation of individualized drugs for chemotherapy.
  • the chemotherapy data was integrated into the individualized information flow of the tumor to complete the automated interpretation of the chemotherapy drug.
  • Targeted drugs have the characteristics of significant drug efficacy and few side effects in tumor therapy, but they are dependent on targets (including protein, DNA, etc.). Target analysis must be performed on patients before they can determine whether patients can take drugs. Integrate current FDA-approved targeted drugs, as well as drugs in clinical III and IV. According to the NCCN clinical guidelines, the clinical drug gene research collates the relationship between the drug target gene and the target drug, and forms a database of individualized target drug interpretation.
  • the sample detected the 451 amino acid missense mutation of the EGFR gene, belonging to exon 12, which is located in the extracellular domain of the protein, which is not recorded in the COSMIC database, but the p.
  • the R451H] missense mutation was recorded once and reported to be associated with lung cancer (18948947). Functional predictions show that the mutation is a deleterious mutation that may have an impact on gene function.
  • the human epidermal growth factor receptor an expression product of the proto-oncogene c-erbB1 is a member of the receptor tyrosine kinase family.
  • EGFR is mainly located on the surface of the cell membrane, and activates its own tyrosine phosphorylation by binding to ligands. Autophosphorylation promotes downstream signaling pathways, including MPAK, PI3K and JNK pathways, and induces cell proliferation and differentiation. Mutations or abnormal expression of EGFR are present in many solid tumors.
  • the LungPer chip includes a lung cancer-related driver gene (Driver Gene), a high-frequency mutation gene, an important gene in 12 cancer-related signaling pathways, a target drug, and a chemotherapeutic drug-related gene, and a total of 145 genes, 250 Kb.
  • Driver Gene lung cancer-related driver gene
  • Table 2 The list of genes is shown in Table 2.
  • the sample detected the 451 amino acid missense mutation of the EGFR gene, belonging to exon 12, which is located in the extracellular domain of the protein, which is not recorded in the COSMIC database, but the p.
  • the R451H] missense mutation was recorded once and reported to be associated with lung cancer (18948947). Functional predictions show that the mutation is a deleterious mutation that may have an impact on gene function.
  • the human epidermal growth factor receptor an expression product of the proto-oncogene c-erbB1 is a member of the receptor tyrosine kinase family.
  • EGFR is mainly located on the surface of the cell membrane, and activates its own tyrosine phosphorylation by binding to ligands. Autophosphorylation promotes downstream signaling pathways, including MPAK, PI3K and JNK pathways, and induces cell proliferation and differentiation. Mutations or abnormal expression of EGFR are present in many solid tumors.
  • the ColorectalPer chip includes a driver gene related to colorectal cancer, a high-frequency mutated gene, an important gene in 12 cancer-related signaling pathways, a target drug, and a chemotherapeutic drug-related gene, and a total of 60 genes, as shown in Table 3. A total of 123Kb.
  • the samples are from Tianjin Maternal and Child Health Hospital.
  • This sample detected a missense mutation of KRAS p. [Gly12Asp], which was recorded 10,303 times in the COSMIC database, and about 60% were reported to be associated with the pathogenesis of colorectal cancer.
  • the KRAS codon 12 is located on the GTP domain and is the most common mutation in KRAS.
  • KRAS a member of the Ras gene family, encodes the P21 protein and functions in the MAPK signaling pathway. It is an oncogene that binds to GDP/GTP and promotes GTPase activity. When KRAS is mutated, it cannot be hydrolyzed by hydrolase, and it is in a state of continuous activation, which causes up-regulation of RAF/MAPK and transmits a variety of survival pathway signals, thereby allowing cells to overgrow and proliferate and resist EGFR-TKIs. Mutations can lead to a variety of malignancies, including lung cancer, mucinous adenomas, pancreatic ductal cancer, and colon cancer.
  • the most common way in which the KRAS gene is activated is point mutations, which occur at the N-terminal 12, 13, and 61, 146 codons, with the 12th codon mutation being the most common. Different mutation sites have different activation mechanisms for P21 protein.
  • the 12th codon mutation can attenuate the intrinsic GTPase activity of P21 and reduce apoptosis and decrease the inhibition of cell contact.
  • the encoded protein is one of the components of the DNA mismatch repair system (MMR), forming two different heterodimers: MutS ⁇ (MSH2-MSH6 heterodimer) and MutS ⁇ (MSH2-MSH3 heterodimer) Can be mismatched with DNA
  • MMR DNA mismatch repair system
  • MutS ⁇ or ⁇ forms a ternary complex with MutL ⁇ heterodimer, which is responsible for directing downstream MMR events, including strand recognition, excision, and resynthesis.
  • the binding and hydrolysis of ATP plays an important role in the mismatch repair function, and the ATPase activity is related to MutS ⁇ .
  • MutS ⁇ can also play a role in DNA homologous recombination repair function. This gene is associated with hereditary nonpolyposis colorectal cancer type I and endometrial cancer.
  • the WCNPer chip includes a driver gene related to gynecological genital tract tumors, a high-frequency mutated gene, an important gene in 12 signal pathways related to cancer, a target drug and a chemotherapeutic drug-related gene, and a total of 43 genes, as shown in Table 4. Show, a total of 300Kb.
  • the peripheral blood plasma of the subject was taken as the research object, and the sample was from Tianjin Maternal and Child Health Hospital. The experiment and data analysis were carried out with reference to Example 1.
  • This sample detected a missense mutation of BRAF p. [G469V], which was recorded 17 times in the COSMIC database and found in tumors such as lung, large intestine, biliary tract, upper respiratory tract, and esophagus.
  • the BRAF codon 469 is located in the ATP-binding domain of the protein kinase domain.
  • a melanoma study has shown that this mutation is an activating mutation that may cause BRAF to change from an inactive state to an active state or to abnormally activate the BRAF signaling pathway, and disease. The occurrence and development may be related.
  • the BRAF gene encodes a serine threonine protein kinase in the MAPK pathway, which transduces the signal from Ras to MEK1/2, thereby participating in the regulation of cell function and affecting cell sorting, differentiation and secretion. Mutations produced by this gene are associated with many types of cancer, such as colorectal cancer, lung cancer, liver cancer, pancreatic cancer, thyroid cancer, ovarian cancer, and the like. In ovarian cancer, the mutation frequency of BRAF gene is 8%, which is the driver gene in the development of ovarian cancer.
  • TP53 codon 266 is located in the sequence-specific DNA-binding domain and is an important domain for TP53 to function. This mutation may affect or lose the complete function of TP53.
  • TP53 is a driver gene in tumorigenesis and development, and its full function is affected. Impact or loss may be related to the development of the disease.
  • the TP53 gene is one of the genes most recently found to be associated with tumors. As an important tumor suppressor gene, it plays a key role in cell cycle regulation, DNA damage repair, cell differentiation, apoptosis and senescence. The TP53 gene is involved in more than 50% of human malignancies. Clinical studies have confirmed that 95.1% of p53 point mutations in tumors occur mainly at the highly conserved sites 175, 245, 248, 249, 273 and 282. Many tumor treatments are currently achieved by regulating TP53 protein. The TP53 gene has clinical application in a variety of cancers. Breast cancer patients with mutations in TP53 (exons 5-8) have a poor prognosis and tamoxifen has a significantly reduced efficacy. Mutation and loss of function of TP53 is one of the most common genetic abnormalities in ovarian cancer.
  • the subject detects important variability related to gynecological diseases, and combined with the clinical diagnosis, the risk and benign and malignant condition of the gynecological tumor can be judged.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed is a method for detecting the variation of a target region, comprising: obtaining the nucleic acid of a test sample, wherein the nucleic acid is composed of a plurality of nucleic acid fragments; capturing the nucleic acid fragments using a kit to obtain the target region; sequencing the target region to obtain the sequencing data, wherein the sequencing data is composed of a plurality of reads; and detecting the variation of the target region on the basis of the sequencing data. The kit comprises the probes capable of recognizing the following predetermined region specifically: the gene regions of at least 10 genes of the 547 genes in Table 1. Also disclosed are a device for detecting the variation of the target region and a method for screening tumor.

Description

肿瘤筛查方法、目标区域变异检测方法和装置Tumor screening method, target region variation detecting method and device 技术领域Technical field
本发明涉及生物医学领域。具体而言,本发明涉及目标区域变异检测方法、目标区域变异检测装置、肿瘤筛查方法和装置。The invention relates to the field of biomedicine. Specifically, the present invention relates to a target region variation detecting method, a target region variation detecting device, a tumor screening method, and a device.
背景技术Background technique
肿瘤分为良性肿瘤和恶性肿瘤,恶性肿瘤亦称为癌症,为由控制细胞生长增殖机制失常而引起的疾病,癌症是威胁人类健康的疾病之一。癌细胞除了生长失控外,还会局部侵入周遭正常组织甚至经由体内循环***或淋巴***转移到身体其他部分。Tumors are classified into benign tumors and malignant tumors. Malignant tumors are also called cancers. They are diseases caused by abnormal mechanisms controlling cell growth and proliferation. Cancer is one of the diseases that threaten human health. In addition to uncontrolled growth of cancer cells, cancer cells locally invade the surrounding normal tissues and even transfer to other parts of the body via the internal circulatory system or lymphatic system.
目前针对癌症的筛查诊断、监控等的手段仍有待改进。The current methods for screening, monitoring, and monitoring cancer have yet to be improved.
发明内容Summary of the invention
本发明旨在至少解决现有技术中存在的技术问题之一或者至少提供一种手段。The present invention is directed to at least one of the technical problems existing in the prior art or at least one means.
依据本发明的一方面,本发明提供一种检测目标区域变异的方法,所述方法包括:(1)获取待测样本中的核酸,所述核酸由多个核酸片段组成,所述核酸片段来自断裂的基因组DNA和/或游离DNA;(2)利用试剂盒捕获所述核酸片段,获得目标区域;(3)对所述目标区域进行序列测定,获得测序数据,所述测序数据由多个读段组成;(4)基于所述测序数据,检测所述目标区域中的变异;其中,所述试剂盒包含探针,所述探针能够特异性识别以下预定区域:表1里的547个基因中的至少10个基因的基因区域。According to an aspect of the present invention, the present invention provides a method of detecting a variation of a target region, the method comprising: (1) acquiring a nucleic acid in a sample to be tested, the nucleic acid being composed of a plurality of nucleic acid fragments, the nucleic acid fragment being derived from Broken genomic DNA and/or free DNA; (2) capturing the nucleic acid fragment using a kit to obtain a target region; (3) performing sequence determination on the target region to obtain sequencing data, the sequencing data being read by multiple Segment composition; (4) detecting a variation in the target region based on the sequencing data; wherein the kit comprises a probe capable of specifically recognizing the following predetermined regions: 547 genes in Table 1 The gene region of at least 10 genes in the gene.
表1Table 1
Figure PCTCN2014093871-appb-000001
Figure PCTCN2014093871-appb-000001
Figure PCTCN2014093871-appb-000002
Figure PCTCN2014093871-appb-000002
Figure PCTCN2014093871-appb-000003
Figure PCTCN2014093871-appb-000003
在本发明的一个实施例中,所述预定区域为所述547个基因中的至少20个、30个、40个、50个、60个、70个、80个、90个、100个、200个、300个、400个、500个或者全部547个基因的基因区域。本发明的方法中的试剂盒探针能够特异性识别的基因区域组合,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合是常见肿瘤的发生或发育相关区域。所述常见肿瘤包括肺癌、结直肠癌、胃癌、乳腺癌、肾癌、胰腺癌、卵巢癌、子宫内膜癌、甲状腺癌、***、食管癌和肝癌。利用本发明一方面的方法能够一次性、简单方便且高特异性的获取多种常见癌症的相关基因序列,并且检测分析这些相关基因序列,检测分析结果可以辅助用于多种常见癌症的早期筛查判断,增加人为的早期干预肿瘤的发生发展的可能性和效果。目前大部分癌症如肺癌、肝癌、胃癌等在医院病理确诊时已是晚期,耽误了较早的治疗时间大大的减少治愈的可能性。In an embodiment of the present invention, the predetermined area is at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 of the 547 genes. Gene regions of 300, 400, 500 or all 547 genes. The combination of gene regions that can be specifically recognized by the kit probe in the method of the present invention is obtained by the inventor through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are common tumor occurrence or developmental correlations. region. The common tumors include lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer. By using the method of one aspect of the invention, the related gene sequences of a plurality of common cancers can be obtained in one time, simply and conveniently, and high specificity, and the related gene sequences can be detected and analyzed, and the detection analysis results can be used for early screening of various common cancers. Check and judge the possibility and effect of artificially intervening early tumor intervention. At present, most cancers such as lung cancer, liver cancer, and gastric cancer have been diagnosed at the time of hospital pathological diagnosis, which delays the earlier treatment time and greatly reduces the possibility of cure.
在本发明的一个实施例中,所述预定区域为所述547个基因中的表2所列的145个基因的基因区域。探针能够特异性识别的表2的145个基因的基因区域,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合与肺癌的发生发展相关。利用本发明方法中的这一试剂盒中的探针,能够一次性的、简单方便且高特异性的获取全部肺癌相关基因序列,而且基于检测这些基因序列获取的信息能够辅助肺癌的早期筛查诊断。In one embodiment of the invention, the predetermined region is a gene region of 145 genes listed in Table 2 of the 547 genes. The gene region of the 145 genes in Table 2 that the probe can specifically recognize is obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of lung cancer. Using the probes in this kit of the method of the present invention, all lung cancer-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist in early screening of lung cancer. diagnosis.
表2Table 2
KRASKRAS ALKALK ROS1ROS1 ADAM23ADAM23 KIAA0907KIAA0907 KRTAP5-5KRTAP5-5 MAP1BMAP1B
EGFREGFR RB1RB1 FGFR3FGFR3 DNMT3BDNMT3B GAB1GAB1 TSHZ3TSHZ3 ZNF814ZNF814
TP53TP53 PDGFRAPDGFRA FGFR4FGFR4 SDHAP2SDHAP2 OR10Z1OR10Z1 XIRP2XIRP2 ZFHX4ZFHX4
BRAFBRAF KDRKDR JAK3JAK3 DHX9DHX9 CNTNAP3BCNTNAP3B NYAP2NYAP2 ZNF804AZNF804A
PIK3CAPIK3CA FBXW7FBXW7 APCAPC CSNK2A1CSNK2A1 IL32IL32 NUDT11NUDT11 OR5D18OR5D18
ERBB2ERBB2 HRASHRAS FRG1BFRG1B CNTN5CNTN5 NAV3NAV3 SNAPC4SNAPC4 ZNF479ZNF479
CDKN2ACDKN2A JAK2JAK2 CHEK2CHEK2 ATXN3ATXN3 TNRC6ATNRC6A ZNF598ZNF598 OR51V1OR51V1
NRASNRAS ERBB4ERBB4 KLK1KLK1 CLIP1CLIP1 FAM135BFAM135B KIAA2022KIAA2022 OR4N2OR4N2
STK11STK11 KITKIT NBPF10NBPF10 OR4M2OR4M2 VGLL3VGLL3 DDX11L2DDX11L2 OR4C15OR4C15
NFE2L2NFE2L2 SMAD4SMAD4 PARGPARG OR10G8OR10G8 KRTAP4-11KRTAP4-11 MUC6MUC6 OR14C36OR14C36
CTNNB1CTNNB1 FGFR2FGFR2 FBN2FBN2 PAPPA2PAPPA2 ANAPC1ANAPC1 ATXN1ATXN1 CROCCCROCC
METMET DDR2DDR2 HSD17B7P2HSD17B7P2 OR8H2OR8H2 FAM47CFAM47C MUC16MUC16 OR2T2OR2T2
PTENPTEN ATMATM WASH2PWASH2P PBX2PBX2 AKAP6AKAP6 BEST3BEST3 PCDH11XPCDH11X
AKT1AKT1 RETRET POTECPOTEC POLDIP2POLDIP2 ZNF804BZNF804B DSPPDSPP REG3AREG3A
KEAP1KEAP1 NOTCH1NOTCH1 EEF1B2EEF1B2 SLC6A10PSLC6A10P ZEB1ZEB1 MB21D2MB21D2 REG1BREG1B
DDX11DDX11 EPB41L4AEPB41L4A TBX6TBX6 PRB2PRB2 OR2T34OR2T34 NTRK3NTRK3 LRRIQ3LRRIQ3
DNAH8DNAH8 OR2M2OR2M2 WDR62WDR62 CNTNAP2CNTNAP2 LPALPA NTRK1NTRK1 EPHA5EPHA5
OR2B11OR2B11 OR4C16OR4C16 DCAF4L2DCAF4L2 CDH10CDH10 MMP27MMP27 NF1NF1 OR5L2OR5L2
OR4K2OR4K2 KCNB2KCNB2 EPHA3EPHA3 CDH12CDH12 VAV3VAV3 INHBAINHBA OR2T33OR2T33
FAM47AFAM47A STAG3L2STAG3L2 PTPRDPTPRD RALGAPBRALGAPB THSD4THSD4 FGFR1FGFR1 GNA15GNA15
RYR2RYR2 KRTAP4-8KRTAP4-8 NOTCH2NOTCH2 FOLH1FOLH1 OR4N4OR4N4    
在本发明的一个实施例中,所述预定区域为所述547个基因中的表3所列60个基因的基因区域。探针能够特异性识别的表3的60个基因的基因区域,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合与结直肠癌的发生发展相关。利用本发明方法中的这一试剂盒中的探针,能够一次性的、简单方便且高特异性的获取全部结直肠癌相关基因序列,而且基于检测这些基因序列获取的信息能够辅助结直肠癌的早期筛查诊断。In one embodiment of the invention, the predetermined region is a gene region of 60 genes listed in Table 3 of the 547 genes. The gene regions of the 60 genes of Table 3 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of colorectal cancer. By using the probe in the kit of the method of the present invention, all colorectal cancer-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist colorectal cancer. Early screening diagnosis.
表3table 3
KRASKRAS SRCSRC TLR3TLR3 EP300EP300 TMPRSS13TMPRSS13 EPHA5EPHA5
BRAFBRAF PTENPTEN MC4RMC4R CYLDCYLD PHF2PHF2 EPHA3EPHA3
APCAPC AXIN1AXIN1 MLH1MLH1 FBN2FBN2 OPRD1OPRD1 PTPRDPTPRD
TP53TP53 FLGFLG AKT1AKT1 NF1NF1 LILRB5LILRB5 NTRK3NTRK3
PIK3CAPIK3CA LIG1LIG1 CASD1CASD1 ASXL1ASXL1 COL18A1COL18A1 NTRK1NTRK1
CTNNB1CTNNB1 MAP2K1MAP2K1 PTCH1PTCH1 SMAD4SMAD4 LARP4BLARP4B ALKALK
NRASNRAS PIK3R1PIK3R1 ADAMTS18ADAMTS18 IRF5IRF5 DMKNDMKN ROS1ROS1
EGFREGFR ERBB2ERBB2 MSH2MSH2 DOCK3DOCK3 ROBO2ROBO2 RETRET
FBXW7FBXW7 STK11STK11 BAP1BAP1 MYOM1MYOM1 KCNN3KCNN3 PDGFRAPDGFRA
ARID1AARID1A IL7RIL7R CTNNA1CTNNA1 NEFHNEFH INHBAINHBA FGFR1FGFR1
在本发明的一个实施例中,所述预定区域为所述547个基因中的表4所列43个基因的基因区域。探针能够特异性识别的表4的43个基因的基因区域,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合与妇科生殖道肿瘤的发生发展相关。所说的生殖道肿瘤包括卵巢癌、子宫内膜癌和***。利用本发明方法中的这一试剂盒中的探针,能够一次性的、简单方便且高特异性的获取全部生殖道肿瘤相关基因序列,而且基于检测这些基因序列获取的信息能够辅助生殖道肿瘤的早期筛查诊断。In one embodiment of the invention, the predetermined region is the gene region of the 43 genes listed in Table 4 of the 547 genes. The gene regions of the 43 genes in Table 4 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are related to the occurrence and development of gynecological reproductive tract tumors. The genital tract tumors include ovarian cancer, endometrial cancer, and cervical cancer. By using the probe in the kit of the method of the invention, all genital tumor-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist the genital tumor Early screening diagnosis.
表4Table 4
Figure PCTCN2014093871-appb-000004
Figure PCTCN2014093871-appb-000004
Figure PCTCN2014093871-appb-000005
Figure PCTCN2014093871-appb-000005
在本发明的一个实施例中,所述探针的长度为25-300nt,较佳的,为50-250nt,更佳的,为80nt-120nt。In one embodiment of the invention, the probe has a length of from 25 to 300 nt, preferably from 50 to 250 nt, and more preferably from 80 nt to 120 nt.
为获得能够在同一反应体系中同时特异性捕获所说的基因区域的探针,在本发明的一个实施例中,探针是通过先获得初始探针集,再筛选所述初始探针集来确定的。获取所述初始探针集包括:确定所述基因区域的参考序列,从所述参考序列的一端开始,在所述参考序列上依次获取DNA片段直至所述参考序列的另一端,其中,一条DNA片段为一条初始探针,全部所述DNA片段构成所述初始探针集,所述DNA片段之间完全重叠、部分重叠或完全不重叠,所述初始探针集能够覆盖所述基因区域至少一次。所说的基因区域的参考序列可以从参考基因组上获取,例如从人参考基因组HG19上获得对应的基因区域,所有的HG19上的对应的基因区域构成所说的基因区域的参考序列,HG19可以从NCBI数据库下载。在本发明的一个实施例中,利用迭代算法设计获取所述初始探针集,包括:确定所述基因区域在参考基因组上的位置,获取所述基因区域的参考序列,从所述参考序列的第一个核苷酸开始拷贝所述参考序列获取第一条DNA片段,从所述参考序列的第二个核苷酸开始拷贝所述参考序列获取第二条DNA片段,从所述参考序列的第三个核苷酸开始拷贝所述参考序列获取第三条DNA片段,这样依次获取后续DNA片段直至第N条DNA片段的一端超出所述参考序列,其中,一条DNA片段为一条初始探针,全部所述DNA片段构成所述初始探针集,N为所述初始探针集中包含的初始探针的总数,以获得能够全面覆盖目标基因区域的初始探针集,而且为使最终的探针具高特异性,在本发明的一个实施例中,进一步对所述筛选初始探针集,包括:将所述DNA片段(初始探针集)与所述参考序列比对,获得每一条DNA片段在参考序列上的比对次数,过滤掉比对次数超过1的DNA片段。为使最终的探针能在同一反应体系中捕获所说的基因区域,和/或使捕获的基因区域在同一反应条件下被一起洗脱下来,进一步对所述初始探针集进行筛选,包括:去除掉GC含量不在35-70%的DNA片段。In order to obtain a probe capable of simultaneously specifically capturing the region of the gene in the same reaction system, in one embodiment of the invention, the probe is obtained by first obtaining an initial probe set and then screening the initial probe set. definite. Obtaining the initial probe set includes: determining a reference sequence of the gene region, starting from one end of the reference sequence, sequentially acquiring a DNA fragment on the reference sequence to the other end of the reference sequence, wherein one DNA The fragment is an initial probe, and all of the DNA fragments constitute the initial probe set, the DNA fragments completely overlap, partially overlap or not overlap at all, and the initial probe set can cover the gene region at least once . The reference sequence of the gene region can be obtained from the reference genome, for example, the corresponding gene region is obtained from the human reference genome HG19, and the corresponding gene region on all HG19 constitutes the reference sequence of the gene region, and HG19 can be NCBI database download. In an embodiment of the present invention, the initial probe set is obtained by using an iterative algorithm design, including: determining a position of the gene region on a reference genome, and obtaining a reference sequence of the gene region, from the reference sequence The first nucleotide begins to copy the reference sequence to obtain a first DNA fragment, and the reference sequence is copied from the second nucleotide of the reference sequence to obtain a second DNA fragment, from which the reference sequence The third nucleotide begins to copy the reference sequence to obtain a third DNA fragment, such that the subsequent DNA fragment is sequentially obtained until one end of the Nth DNA fragment exceeds the reference sequence, wherein one DNA fragment is an initial probe. All of the DNA fragments constitute the initial probe set, N is the total number of initial probes contained in the initial probe set to obtain an initial probe set capable of covering the entire target gene region, and for the final probe With high specificity, in one embodiment of the invention, further screening the initial probe set comprises: combining the DNA fragment (initial probe set) with the reference sequence The alignment is performed to obtain the number of alignments of each DNA fragment on the reference sequence, and the DNA fragments having more than one alignment are filtered out. The initial probe set is further screened, in order to enable the final probe to capture the gene region in the same reaction system, and/or to cause the captured gene region to be eluted together under the same reaction conditions, including : Remove DNA fragments with a GC content other than 35-70%.
在本发明的一个实施例中,步骤(2)和步骤(3),包括:(h)末端修复所述核酸片段,获得末端修复片段;(i)加碱基A至所述末端修复片段的两端,获得粘性末端片段;(j)连接接头于所述粘性末端片段的两端,获得接头连接片段;(k)对所述接头连接片段进行第一扩增,获得第一扩增产物;(l)利用所述试剂盒对所述第一扩增产物进行捕获,获得所述目标区域;(m)对所述目标区域进行第二扩增,获得第二扩增产物;以及(n)对所 述第二扩增产物进行序列测定,获得所述测序数据。本发明这一方面方法中的步骤(2)和(3)包括的测序文库构建,特别适用于样本含微量核酸的测序文库的构建,在本发明的一个实施例中,样本为含微量游离DNA的血浆样本,包含极其微量的目标游离DNA,第一扩增使得核酸的量能满足芯片/探针杂交捕获的需求,而因芯片杂交捕获会损耗一定量的核酸,第二扩增能使捕获下的目标片段获得再次扩增以满足上机测序和质控检测的要求。这一文库构建方法特别适用于总游离核酸不低于10ng或者常规组织基因组DNA不低于1μg的样本的测序文库构建。而本发明这一方面方法中的步骤(2)和(3)包括的对所构建的测序文库的测序,测序可以利用已知平台进行,包括但不限于Illumina的Hiseq2000/2500平台、Life Technologies的Ion Torrent平台和单分子测序平台。测序方式可以选择单端测序,也可以是双末端测序,在本发明的一个实施例中利用双末端测序,所得的测序数据由多对读段对组成。利用本发明的这一方面的方法的步骤(2)和(3)构建的目标区域文库以及对文库测序,测序后的下机数据质量高,基于高质量的下机数据利于后续的准确检测分析。In one embodiment of the invention, step (2) and step (3) comprise: (h) terminating the nucleic acid fragment to obtain a terminal repair fragment; (i) adding base A to the end repair fragment At both ends, a sticky end fragment is obtained; (j) a linker is ligated to both ends of the sticky end fragment to obtain a linker ligation fragment; (k) performing a first amplification of the linker ligation fragment to obtain a first amplification product; (1) capturing the first amplification product using the kit to obtain the target region; (m) performing second amplification on the target region to obtain a second amplification product; and (n) Right place The second amplification product is subjected to sequence determination to obtain the sequencing data. The sequencing library construction included in steps (2) and (3) in the method of this aspect of the invention is particularly suitable for the construction of a sequencing library containing a trace amount of nucleic acid in a sample. In one embodiment of the invention, the sample contains a trace amount of free DNA. The plasma sample contains an extremely small amount of target free DNA. The first amplification allows the amount of nucleic acid to meet the needs of chip/probe hybridization capture. The chip hybridization captures a certain amount of nucleic acid, and the second amplification enables capture. The next target fragment is re-amplified to meet the requirements of on-machine sequencing and quality control detection. This library construction method is particularly suitable for sequencing library construction of samples with a total free nucleic acid of not less than 10 ng or a conventional tissue genomic DNA of not less than 1 μg. While sequencing (2) and (3) in the method of this aspect of the invention includes sequencing of the constructed sequencing library, sequencing can be performed using known platforms including, but not limited to, Illumina's Hiseq2000/2500 platform, Life Technologies Ion Torrent platform and single molecule sequencing platform. The sequencing method can be either single-ended sequencing or double-end sequencing. In one embodiment of the invention, double-end sequencing is utilized, and the resulting sequencing data consists of multiple pairs of read pairs. Using the target region library constructed by steps (2) and (3) of the method of this aspect of the invention and sequencing the library, the quality of the offline data after sequencing is high, and the high-quality offline data is beneficial for subsequent accurate detection and analysis. .
在本发明的一个实施例中,步骤(4)包括:将所述测序数据与参考序列进行第一比对,获得第一比对结果;将所述第一比对结果与所述参考序列的一部分进行第二比对,获得第二比对结果,所述参考序列的一部分包括目标区域参考序列中的每个已知InDel位点,以及所述每个已知InDel位点上下游各1000bp的参考序列;基于所述第一比对结果和第二比对结果,同时检测所述目标区域中的SNP、InDel、SV和CNV变异。这里,所说的第二比对为局部比对,第一比对为常规全局比对,可利用但不限于SOAP或BWA等软件依照其默认设置进行,获得第一比对结果,第一比对结果包括读段在参考序列上的匹配位置及匹配情况信息。而进行第二比对即基于第一比对结果,对与所捕获的基因区域对应的参考序列中的所有已知INDEL附近的所有序列信息(reads)进行局部重新比对,能够消除第一比对中的错误,提高后续变异检测的准确性,第二比对可利用GATK比对软件(https://www.broadinstitute.org/gatk/)进行。在本发明的一个实施例中,通过GATK UnifiedGenotyper软件同时检测所说的SNP和INDEL变异。利用本发明的这一方面的变异检测方法,能够准确检测出突变频率为1%的低频突变。In an embodiment of the present invention, the step (4) comprises: performing a first alignment of the sequencing data with a reference sequence to obtain a first alignment result; and comparing the first alignment result with the reference sequence Part of performing a second alignment to obtain a second alignment result, wherein a part of the reference sequence includes each known InDel site in the target region reference sequence, and each of the known InDel sites is 1000 bp upstream and downstream a reference sequence; simultaneously detecting SNP, InDel, SV, and CNV variations in the target region based on the first alignment result and the second alignment result. Here, the second alignment is a local alignment, and the first alignment is a conventional global alignment, and may be performed by using software such as SOAP or BWA according to its default setting to obtain a first comparison result, the first ratio. The result includes the matching position of the read segment on the reference sequence and the matching situation information. And performing the second alignment, that is, based on the first alignment result, locally re-aligning all the sequence information (reads) near all known INDELs in the reference sequence corresponding to the captured gene region, and eliminating the first ratio The alignment error improves the accuracy of subsequent mutation detection, and the second alignment can be performed using the GATK comparison software ( https://www.broadinstitute.org/gatk/ ). In one embodiment of the invention, the SNP and INDEL variations are detected simultaneously by the GATK Unified Genotyper software. With the mutation detecting method of this aspect of the invention, it is possible to accurately detect a low frequency mutation having a mutation frequency of 1%.
为使变异检测结果更准确可信,在本发明的一个实施例中,在所述第一比对之前,对所述测序数据进行过滤,所述过滤包括去除掉不确定碱基比例超过10%的读段和/或碱基质量值不大于5的碱基数的比例不小于50%的读段。并且任选地,在所述第二比对之前,去除掉第一比对结果中的一个读段对中的两个读段相同的读段对。所说的参考序列的一部分包括目标区域参考序列中的每个已知InDel位点,以及所述每个已知InDel位点上下游各1000bp的参考序列。当待测样本来自于人时,参考序列可选择为人参考基因组,即已知的人基因组序列,例如为HG19,HG19可以从NCBI下载得来,而所说的目标区域参考序列为与目标区域匹配的那部分参考基因组序列。 In order to make the mutation detection result more accurate and reliable, in one embodiment of the present invention, the sequencing data is filtered before the first alignment, and the filtering includes removing the uncertainty base ratio by more than 10%. The read range and/or the ratio of the number of bases whose base quality value is not more than 5 is not less than 50%. And optionally, prior to said second alignment, the same pair of reads of the two of the read pairs of the first alignment result are removed. A portion of the reference sequence includes each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site. When the sample to be tested is from a human, the reference sequence can be selected as a human reference genome, that is, a known human genome sequence, such as HG19, HG19 can be downloaded from NCBI, and the target region reference sequence is matched with the target region. That part of the reference genome sequence.
在本发明的一个实施例中,步骤(4)还包括,当所检测出的变异中的至少之一满足以下(i)或(ii),则判定所述待测样本为阳性样本:(i)在阴性对照样本中的读段支持数少于2条和在阳性对照样本中的突变率大于1%,(ii)测序深度不小于10X,至少有3个读段的支持,在阴性对照样本中的读段支持数少于2条,在阳性对照样本中的突变率大于1%,以及该变异位点的读段支持量与正常对照样本相同位点的读段支持量具有显著差异。所说的阳性样本指肿瘤样本,两个判定条件是发明人结合目前相关数据库信息和大量文献报道信息、检测统计大量阳性样本和大量阴性样本确定下来的,具有统计意义,后者比前者更为严格,较佳的,这里的阳性或者阴性对照样本超过30个,对照样本的数据可以自己对对照样本的核酸进行提取、序列测定来获得,也可以依照他人已公开或公开数据库中的的样本测序数据,多个对照样本数据使统计判定条件/结果具有统计意义,更加可信。依据任一两个判定条件进行判定的结果可以辅助用于临床的肿瘤诊断筛查,可辅助用于了解所测样本个体癌变的种类、可能性及发展情况等。需要说明的是,所说的变异位点在待测样本中的读段支持量与正常对照样本(阴性对照样本)相同位点的读段支持量具有显著差异,其中的读段支持量,可以为支持该变异的读段的数目,也可以是支持该变异的读段在比对上该位点读段中的比例,在本发明的一个实施例中,采用后者来比较,所说的具有显著差异指具有实质差异,例如对于待测样本中的变异位点A,多个阳性样本(cancer样本)中的reads支持比例都为5/400(变异5条reads,总400条reads),即阳性样本中该位点的平均变异频率1.25%,而在多个阴性对照样本中的reads支持比例都为1/200(变异1条reads,总200条reads),即阴性对照样本中的平均变异频率0.5%,若待测样本中的该变异位点的变异频率更接近1.25%,例如达到0.9%,则达到所说的显著差异或者所说的实质差异。具有显著差异,也可以指统计学上的对数据差异性的评价——显著性差异,例如对待测样本中的变异位点A进行多次检测,获得该位点的多组比对结果数据,从每组比对结果数据中都可获得一个读段支持比例,所说的读段支持比例=支持该变异位点的读段数/比对上该位点的总读段数,接着比较待测样本的变异位点A的读段支持比例(变异频率)与阴性对照样本中的该位点的突变频率的差异,例如可以利用z检验或t检验,差异具有显著性(p≤0.05),即认为达到所说的具有显著差异。In an embodiment of the present invention, the step (4) further comprises: when at least one of the detected variations satisfies the following (i) or (ii), determining that the sample to be tested is a positive sample: (i) The number of reads supported in the negative control sample is less than 2 and the mutation rate in the positive control sample is greater than 1%, (ii) the sequencing depth is not less than 10X, and at least 3 reads are supported, in the negative control sample. The number of reads supported by the reader was less than 2, the mutation rate in the positive control sample was greater than 1%, and the read support amount of the variant site was significantly different from the read support amount of the same site in the normal control sample. The positive sample refers to the tumor sample. The two determination conditions are determined by the inventor combined with the current relevant database information and a large amount of literature report information, a large number of positive samples and a large number of negative samples. It is statistically significant, and the latter is more than the former. Strictly, preferably, there are more than 30 positive or negative control samples. The data of the control sample can be obtained by extracting and sequencing the nucleic acid of the control sample by itself, or by sequencing the samples in other publicly or publicly available databases. Data, multiple control sample data makes the statistical decision conditions/results statistically significant and more credible. The results of the determination based on any two criteria can be used for clinical tumor diagnosis screening, and can be used to understand the type, possibility and development of cancer in the sample. It should be noted that the read support amount of the variant site in the sample to be tested is significantly different from the read support amount of the same site of the normal control sample (negative control sample), wherein the read support amount can be The number of reads in order to support the variation may also be the ratio of the reads supporting the variation in the read of the site in the alignment, in one embodiment of the invention, the latter is used for comparison, said Significant differences mean substantial differences, for example, for variant A in the sample to be tested, the ratio of reads in multiple positive samples (cancer samples) is 5/400 (variation 5 reads, total 400 reads), That is, the average mutation frequency of the site in the positive sample is 1.25%, and the ratio of reads support in the multiple negative control samples is 1/200 (variation 1 reads, total 200 reads), that is, the average in the negative control sample. The frequency of variation is 0.5%. If the frequency of variation of the mutation site in the sample to be tested is closer to 1.25%, for example, 0.9%, the significant difference or the substantial difference is reached. Significant differences can also refer to statistically significant differences in data--significant differences, such as multiple detections of variant site A in a sample to be tested, and multiple sets of alignment data for that site are obtained, A read support ratio is obtained from each set of comparison result data, the read support ratio = the number of reads supporting the variant site / the total number of reads on the alignment, and then the sample to be tested is compared The difference between the read support ratio (variation frequency) of the variant site A and the mutation frequency of the site in the negative control sample, for example, the z test or the t test can be used, and the difference is significant (p ≤ 0.05), that is, Achieving significant differences.
依据本发明的另一方面,本发明还提供一种检测目标区域变异的装置,用以实现或执行上述本发明一方面的或者任一具体实施方式的目标区域变异检测方法,所述装置包括:数据获取单元,用于获取所述目标区域的测序数据,所述测序数据由多个读段和/或多个读段对组成,在所述数据获取单元中进行:获取待测样本中的核酸,所述核酸由多个核酸片段组成,所述核酸片段来自断裂的基因组DNA和/或游离DNA,利用试剂盒捕获所述核酸片段,获得目标区域,对所述目标区域进行序列测定,其中,所述试剂盒包含探针,所述探针能够特异性识别以下预定区域:表1里的547个基因中的至少10个基因的基因区域; 检测单元,用于基于来自数据获取单元的测序数据,检测所述目标区域变异,所述变异包括SNP、InDel、SV和CNV至少之一。本领域人员可以理解,本发明的这一方面的装置中的全部或部分单元,可选择的、可拆卸的包含一个或多个子单元以执行或实现前述本发明方法的各个具体实施方式。According to another aspect of the present invention, the present invention further provides an apparatus for detecting a variation of a target area, which is used to implement or perform the target area variation detecting method of one or more embodiments of the present invention described above, the apparatus comprising: a data acquisition unit, configured to acquire sequencing data of the target region, where the sequencing data is composed of a plurality of read segments and/or a plurality of read segment pairs, and the data acquisition unit performs: acquiring nucleic acid in the sample to be tested The nucleic acid is composed of a plurality of nucleic acid fragments derived from fragmented genomic DNA and/or free DNA, and the nucleic acid fragment is captured by a kit to obtain a target region, and the target region is subjected to sequence determination, wherein The kit comprises a probe capable of specifically recognizing a predetermined region: a gene region of at least 10 of the 547 genes in Table 1; And a detecting unit, configured to detect the target region variation based on the sequencing data from the data acquiring unit, the variation including at least one of a SNP, an InDel, an SV, and a CNV. It will be understood by those skilled in the art that all or a portion of the elements of the apparatus of this aspect of the invention, optionally, detachably comprise one or more sub-units to perform or implement various embodiments of the foregoing methods of the invention.
例如,在本发明的一个实施例中,如图1所示,装置1000中的检测单元200包括第一比对子单元13、第二比对子单元15和变异识别子单元17,所述第一比对子单元13用以将来自数据获取单元100的测序数据与参考序列进行第一比对,获得第一比对结果,所述第二比对子单元15用以将来自所述第一比对子单元13的第一比对结果与所述参考序列的一部分进行第二比对,获得第二比对结果,所述变异识别子单元17用以基于来自所述第一比对子单元13的第一比对结果和来自所述第二比对子单元15的第二比对结果,同时检测所述目标区域中的SNV、InDel、SV和CNV变异,获得变异位点信息,其中,所述参考序列的一部分包括目标区域参考序列中的每个已知InDel位点,以及所述每个已知InDel位点上下游各1000bp的参考序列。For example, in one embodiment of the present invention, as shown in FIG. 1, the detecting unit 200 in the device 1000 includes a first comparing subunit 13, a second comparing subunit 15, and a mutation identifying subunit 17, the a comparison sub-unit 13 is configured to first compare the sequencing data from the data acquisition unit 100 with a reference sequence to obtain a first comparison result, and the second comparison sub-unit 15 is configured to be from the first The first alignment result of the comparison sub-unit 13 is second aligned with a portion of the reference sequence to obtain a second alignment result, the variation identification sub-unit 17 is configured to be based on the first alignment sub-unit a first alignment result of 13 and a second alignment result from the second alignment subunit 15 simultaneously detecting SNV, InDel, SV, and CNV variations in the target region to obtain mutation site information, wherein A portion of the reference sequence includes each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site.
在本发明的一个实施例中,如图2所示,所述装置1000的检测单元200还包括第一过滤子单元12,所述第一过滤子单元12与所述第一比对子单元13连接,用于在所述测序数据进入所述第一比对子单13元之前,对所述测序数据进行过滤,所述过滤包括去除掉不确定碱基比例超过10%的读段和/或碱基质量值不大于5的碱基数的比例不小于50%的读段。任选的,如图3所示,所述检测单元200还包括第二过滤子单元14,所述第二过滤子单元分别14与所述第一比对子单元13和所述第二比对子单元15连接,用于在所述第一比对结果进入所述第二比对子单元15之前,去除掉来自所述第一比对子单元13的第一比对结果中的一个读段对中的两个读段相同的读段对。上述参考序列可以为HG19,所述第一比对单元中进行的第一比对为全局比对,所述第二比对子单元中进行的第二比对为局部比对。In an embodiment of the present invention, as shown in FIG. 2, the detecting unit 200 of the apparatus 1000 further includes a first filtering subunit 12, the first filtering subunit 12 and the first comparing subunit 13 Connecting, for filtering the sequencing data before the sequencing data enters the first aligned sub-single 13 element, the filtering comprising removing a read having an undetermined base ratio of more than 10% and/or The ratio of the number of bases whose base quality value is not more than 5 is not less than 50%. Optionally, as shown in FIG. 3, the detecting unit 200 further includes a second filtering subunit 14, the second filtering subunit 14 and the first comparing subunit 13 and the second alignment respectively The subunit 15 is connected for removing one of the first alignment results from the first comparison subunit 13 before the first comparison result enters the second comparison subunit 15 The same pair of reads of the two reads in the pair. The reference sequence may be HG19, the first alignment performed in the first comparison unit is a global alignment, and the second alignment performed in the second alignment sub-unit is a local alignment.
在本发明的一个实施例中,如图4所示,所述装置1000中的检测单元200还包括判定子单元19,所述判定子单元19用以判定来自所述变异识别子单元17中的变异位点是否满足以下,当所述变异位点中的至少一个满足以下则判定所述待测样本为阳性样本:在阴性对照样本中的读段支持数少于2条和在阳性对照样本中的突变率大于1%。在本发明的另一个实施例中,所述检测单元200还包括判定子单元19,所述判定子单元19用以判定来自所述变异识别子单元17中的变异位点是否满足以下,当所述变异位点中的至少一个满足以下则判定所述待测样本为阳性样本:测序深度不小于10X,至少有3个读段的支持,在阴性对照样本中的读段支持数少于2条,在阳性对照样本中的突变率大于1%,以及变异位点的读段支持量与正常对照样本相同位点的读段支持量具有显著差异。前述对本发明一方面或者任一具体实施方式中的目标区域变异检测方法的技术特征和优点的描述,同样适用于本发明这一方面的检测装置,在此不再赘述。 In an embodiment of the present invention, as shown in FIG. 4, the detecting unit 200 in the apparatus 1000 further includes a determining subunit 19 for determining from the mutation identifying subunit 17. Whether the mutation site satisfies the following, and determines that the sample to be tested is a positive sample when at least one of the mutation sites satisfies the following: the number of read support in the negative control sample is less than 2 and in the positive control sample The mutation rate is greater than 1%. In another embodiment of the present invention, the detecting unit 200 further includes a determining subunit 19 for determining whether the mutation site from the mutation identifying subunit 17 satisfies the following. The at least one of the variant sites satisfies the following to determine that the sample to be tested is a positive sample: the sequencing depth is not less than 10X, the support of at least 3 reads, and the number of read support in the negative control sample is less than 2 The mutation rate in the positive control sample was greater than 1%, and the read support amount of the variant site was significantly different from the read support amount of the same site in the normal control sample. The foregoing description of the technical features and advantages of the target region variation detecting method in one aspect or any embodiment of the present invention is equally applicable to the detecting device of this aspect of the present invention, and details are not described herein again.
依据本发明的再一方面,提供一种对肿瘤进行筛查的方法,特别是早期筛查,所述肿瘤包括但不限于肺癌、结直肠癌、胃癌、乳腺癌、肾癌、胰腺癌、卵巢癌、子宫内膜癌、甲状腺癌、***、食管癌和肝癌,所述方法包括:获取待测样本中的核酸,所述核酸由多个核酸片段组成,所述核酸片段来自断裂的基因组DNA和/或游离DNA;利用试剂盒捕获所述核酸片段,获得目标区域;对所述目标区域进行序列测定,获得测序数据,所述测序数据由多个读段组成;基于所述测序数据,检测所述目标区域中的变异,基于检测出的变异中的至少之一满足以下(i)或者(ii),判定所述待测样本为阳性样本:(i)在阴性对照样本中的读段支持数少于2条和在阳性对照样本中的突变率大于1%,(ii)测序深度不小于10X,至少有3个读段的支持,在阴性对照样本中的读段支持数少于2条,在阳性对照样本中的突变率大于1%,以及其读段支持量与正常对照样本相同位点的读段支持量具有显著差异;其中,所述试剂盒包含探针,所述探针能够特异性识别以下预定区域:表1里的547个基因中的至少10个基因的基因区域。上述对本发明一方面的目标区域变异检测方法和/装置的相应技术特征和优点的描述,同样适用于本发明这一方面的癌症早期筛查方法,在此不再赘述。本领域技术人员可以理解,本发明这一方面的方法中的全部或部分步骤,可以通过包含对应功能单元的装置来实现。According to still another aspect of the present invention, there is provided a method of screening a tumor, particularly an early screening, including but not limited to lung cancer, colorectal cancer, gastric cancer, breast cancer, renal cancer, pancreatic cancer, ovary Cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer, the method comprising: obtaining nucleic acid in a sample to be tested, the nucleic acid consisting of a plurality of nucleic acid fragments derived from fragmented genomic DNA And/or free DNA; capturing the nucleic acid fragment using a kit to obtain a target region; performing sequence determination on the target region to obtain sequencing data, the sequencing data being composed of a plurality of reads; based on the sequencing data, detecting The variation in the target region, based on at least one of the detected mutations, satisfies the following (i) or (ii), and determines that the sample to be tested is a positive sample: (i) read support in the negative control sample The number of mutations is less than 2 and the mutation rate in the positive control sample is greater than 1%, (ii) the sequencing depth is not less than 10X, supported by at least 3 reads, and the number of reads supported in the negative control sample is less than 2 In the strip, the mutation rate in the positive control sample is greater than 1%, and the read support amount thereof is significantly different from the read support amount of the same position in the normal control sample; wherein the kit contains a probe, the probe The following predetermined regions can be specifically identified: the gene regions of at least 10 of the 547 genes in Table 1. The above description of the corresponding technical features and advantages of the target region variation detecting method and/or apparatus of one aspect of the present invention is equally applicable to the early cancer screening method of this aspect of the present invention, and will not be described herein. Those skilled in the art will appreciate that all or part of the steps of the method of this aspect of the invention may be implemented by a device comprising corresponding functional units.
在本发明的一个实施例中,所述预定区域为所述547个基因中的至少20个、30个、40个、50个、60个、70个、80个、90个、100个、200个、300个、400个、500个或者全部547个基因的基因区域。本发明的方法中的试剂盒探针能够特异性识别的基因区域组合,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合是常见肿瘤的发生或发育相关区域。所述常见肿瘤包括肺癌、结直肠癌、胃癌、乳腺癌、肾癌、胰腺癌、卵巢癌、子宫内膜癌、甲状腺癌、***、食管癌和肝癌。利用本发明一方面的方法能够一次性、简单方便且高特异性的获取多种常见癌症的相关基因序列,并且检测分析这些相关基因序列,检测分析结果可以辅助用于多种常见癌症的早期筛查判断,增加人为的早期干预肿瘤的发生发展的可能性和效果。目前大部分癌症如肺癌、肝癌、胃癌等在医院病理确诊时已是晚期,耽误了较早的治疗时间大大的减少治愈的可能性。In an embodiment of the present invention, the predetermined area is at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 of the 547 genes. Gene regions of 300, 400, 500 or all 547 genes. The combination of gene regions that can be specifically recognized by the kit probe in the method of the present invention is obtained by the inventor through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are common tumor occurrence or developmental correlations. region. The common tumors include lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer, and liver cancer. By using the method of one aspect of the invention, the related gene sequences of a plurality of common cancers can be obtained in one time, simply and conveniently, and high specificity, and the related gene sequences can be detected and analyzed, and the detection analysis results can be used for early screening of various common cancers. Check and judge the possibility and effect of artificially intervening early tumor intervention. At present, most cancers such as lung cancer, liver cancer, and gastric cancer have been diagnosed at the time of hospital pathological diagnosis, which delays the earlier treatment time and greatly reduces the possibility of cure.
在本发明的一个实施例中,所述预定区域为所述547个基因中的表2所列的145个基因的基因区域。探针能够特异性识别的表2的145个基因的基因区域,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合与肺癌的发生发展相关。利用本发明方法中的这一试剂盒中的探针,能够一次性的、简单方便且高特异性的获取全部肺癌相关基因序列,而且基于检测这些基因序列获取的信息能够辅助肺癌的早期筛查诊断。In one embodiment of the invention, the predetermined region is a gene region of 145 genes listed in Table 2 of the 547 genes. The gene region of the 145 genes in Table 2 that the probe can specifically recognize is obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of lung cancer. Using the probes in this kit of the method of the present invention, all lung cancer-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist in early screening of lung cancer. diagnosis.
在本发明的一个实施例中,所述预定区域为所述547个基因中的表3所列60个基因的基因区域。探针能够特异性识别的表3的60个基因的基因区域,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合与结直肠癌的发生发展相关。利用本 发明方法中的这一试剂盒中的探针,能够一次性的、简单方便且高特异性的获取全部结直肠癌相关基因序列,而且基于检测这些基因序列获取的信息能够辅助结直肠癌的早期筛查诊断。In one embodiment of the invention, the predetermined region is a gene region of 60 genes listed in Table 3 of the 547 genes. The gene regions of the 60 genes of Table 3 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are associated with the development of colorectal cancer. Use this The probe in this kit in the method of the invention can acquire all colorectal cancer-related gene sequences in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist the early stage of colorectal cancer. Screening diagnosis.
在本发明的一个实施例中,所述预定区域为所述547个基因中的表4所列43个基因的基因区域。探针能够特异性识别的表4的43个基因的基因区域,是发明人经过多次收集、多次筛选和多次试验组合获得的,这些基因区域组合与妇科生殖道肿瘤的发生发展相关。所说的生殖道肿瘤包括卵巢癌、子宫内膜癌和***。利用本发明方法中的这一试剂盒中的探针,能够一次性的、简单方便且高特异性的获取全部生殖道肿瘤相关基因序列,而且基于检测这些基因序列获取的信息能够辅助生殖道肿瘤的早期筛查诊断。In one embodiment of the invention, the predetermined region is the gene region of the 43 genes listed in Table 4 of the 547 genes. The gene regions of the 43 genes in Table 4 that the probe can specifically recognize are obtained by the inventors through multiple collections, multiple screenings, and multiple trial combinations. These gene region combinations are related to the occurrence and development of gynecological reproductive tract tumors. The genital tract tumors include ovarian cancer, endometrial cancer, and cervical cancer. By using the probe in the kit of the method of the invention, all genital tumor-related gene sequences can be obtained in a single, simple, convenient and highly specific manner, and the information obtained by detecting these gene sequences can assist the genital tumor Early screening diagnosis.
在本发明的一个实施例中,所述探针的长度为25-300nt,较佳的,为50-250nt,更佳的,为80nt-120nt。为获得能够在同一反应体系中同时特异性捕获所说的基因区域的探针,在本发明的一个实施例中,探针是通过先获得初始探针集,再筛选所述初始探针集来确定的。获取所述初始探针集包括:确定所述基因区域的参考序列,从所述参考序列的一端开始,在所述参考序列上依次获取DNA片段直至所述参考序列的另一端,其中,一条DNA片段为一条初始探针,全部所述DNA片段构成所述初始探针集,所述DNA片段之间完全重叠、部分重叠或完全不重叠,所述初始探针集能够覆盖所述基因区域至少一次。所说的基因区域的参考序列可以从参考基因组上获取,例如从人参考基因组HG19上获得对应的基因区域,所有的HG19上的对应的基因区域构成所说的基因区域的参考序列,HG19可以从NCBI数据库下载。在本发明的一个实施例中,利用迭代算法设计获取所述初始探针集,包括:确定所述基因区域在参考基因组上的位置,获取所述基因区域的参考序列,从所述参考序列的第一个核苷酸开始拷贝所述参考序列获取第一条DNA片段,从所述参考序列的第二个核苷酸开始拷贝所述参考序列获取第二条DNA片段,从所述参考序列的第三个核苷酸开始拷贝所述参考序列获取第三条DNA片段,这样依次获取后续DNA片段直至第N条DNA片段的一端超出所述参考序列,其中,一条DNA片段为一条初始探针,全部所述DNA片段构成所述初始探针集,N为所述初始探针集中包含的初始探针的总数,以获得能够全面覆盖目标基因区域的初始探针集,而且为使最终的探针具高特异性,在本发明的一个实施例中,进一步对所述筛选初始探针集,包括:将所述DNA片段(初始探针集)与所述参考序列比对,获得每一条DNA片段在参考序列上的比对次数,过滤掉比对次数超过1的DNA片段。为使最终的探针能在同一反应体系中捕获所说的基因区域,和/或使捕获的基因区域在同一反应条件下被一起洗脱下来,进一步对所述初始探针集进行筛选,包括:去除掉GC含量不在35-70%的DNA片段。In one embodiment of the invention, the probe has a length of from 25 to 300 nt, preferably from 50 to 250 nt, and more preferably from 80 nt to 120 nt. In order to obtain a probe capable of simultaneously specifically capturing the region of the gene in the same reaction system, in one embodiment of the invention, the probe is obtained by first obtaining an initial probe set and then screening the initial probe set. definite. Obtaining the initial probe set includes: determining a reference sequence of the gene region, starting from one end of the reference sequence, sequentially acquiring a DNA fragment on the reference sequence to the other end of the reference sequence, wherein one DNA The fragment is an initial probe, and all of the DNA fragments constitute the initial probe set, the DNA fragments completely overlap, partially overlap or not overlap at all, and the initial probe set can cover the gene region at least once . The reference sequence of the gene region can be obtained from the reference genome, for example, the corresponding gene region is obtained from the human reference genome HG19, and the corresponding gene region on all HG19 constitutes the reference sequence of the gene region, and HG19 can be NCBI database download. In an embodiment of the present invention, the initial probe set is obtained by using an iterative algorithm design, including: determining a position of the gene region on a reference genome, and obtaining a reference sequence of the gene region, from the reference sequence The first nucleotide begins to copy the reference sequence to obtain a first DNA fragment, and the reference sequence is copied from the second nucleotide of the reference sequence to obtain a second DNA fragment, from which the reference sequence The third nucleotide begins to copy the reference sequence to obtain a third DNA fragment, such that the subsequent DNA fragment is sequentially obtained until one end of the Nth DNA fragment exceeds the reference sequence, wherein one DNA fragment is an initial probe. All of the DNA fragments constitute the initial probe set, N is the total number of initial probes contained in the initial probe set to obtain an initial probe set capable of covering the entire target gene region, and for the final probe With high specificity, in one embodiment of the invention, further screening the initial probe set comprises: combining the DNA fragment (initial probe set) with the reference sequence The alignment is performed to obtain the number of alignments of each DNA fragment on the reference sequence, and the DNA fragments having more than one alignment are filtered out. The initial probe set is further screened, in order to enable the final probe to capture the gene region in the same reaction system, and/or to cause the captured gene region to be eluted together under the same reaction conditions, including : Remove DNA fragments with a GC content other than 35-70%.
在本发明的一个实施例中,所述利用试剂盒捕获所述核酸片段,获得目标区域,以及对所述目标区域进行序列测定,获得测序数据,包括:(h)末端修复所述核酸片段,获得 末端修复片段;(i)加碱基A至所述末端修复片段的两端,获得粘性末端片段;(j)连接接头于所述粘性末端片段的两端,获得接头连接片段;(k)对所述接头连接片段进行第一扩增,获得第一扩增产物;(l)利用所述试剂盒对所述第一扩增产物进行捕获,获得所述目标区域;(m)对所述目标区域进行第二扩增,获得第二扩增产物;以及(n)对所述第二扩增产物进行序列测定,获得所述测序数据。本发明这一方面方法中包含的测序文库构建,特别适用于样本含微量核酸的测序文库的构建,在本发明的一个实施例中,样本为含微量游离DNA的血浆样本,包含极其微量的目标游离DNA,第一扩增使得核酸的量能满足芯片/探针杂交捕获的需求,而因芯片杂交捕获会损耗一定量的核酸,第二扩增能使捕获下的目标片段获得再次扩增以满足上机测序和质控检测的要求。这一文库构建方法特别适用于总游离核酸不低于10ng或者常规组织基因组DNA不低于1μg的样本的测序文库构建。而本发明这一方面方法中包含的对所构建的测序文库的测序,测序可以利用已知平台进行,包括但不限于Illumina的Hiseq2000/2500平台、Life Technologies的Ion Torrent平台和单分子测序平台。测序方式可以选择单端测序,也可以是双末端测序,在本发明的一个实施例中利用双末端测序,所得的测序数据由多对读段对组成。利用本发明的这一方面的方法中的构建目标区域文库以及对文库测序,测序后的下机数据质量高,基于高质量的下机数据利于后续的准确检测分析。In one embodiment of the present invention, the kit captures the nucleic acid fragment, obtains a target region, and performs sequence determination on the target region to obtain sequencing data, including: (h) end-repairing the nucleic acid fragment, Obtain a terminal repair fragment; (i) adding base A to both ends of the end repair fragment to obtain a sticky end fragment; (j) attaching a linker to both ends of the sticky end fragment to obtain a linker fragment; (k) pair The linker ligation fragment is subjected to a first amplification to obtain a first amplification product; (1) capturing the first amplification product by the kit to obtain the target region; (m) targeting the target The region is subjected to a second amplification to obtain a second amplification product; and (n) the second amplification product is sequenced to obtain the sequencing data. The sequencing library construction included in the method of this aspect of the invention is particularly suitable for the construction of a sequencing library containing a trace amount of nucleic acid in a sample. In one embodiment of the invention, the sample is a plasma sample containing a trace amount of free DNA, including extremely small targets. Free DNA, the first amplification enables the amount of nucleic acid to meet the needs of chip/probe hybridization capture, and the chip hybridization captures a certain amount of nucleic acid, and the second amplification enables the target fragment under capture to be re-amplified. Meet the requirements of sequencing and quality control testing. This library construction method is particularly suitable for sequencing library construction of samples with a total free nucleic acid of not less than 10 ng or a conventional tissue genomic DNA of not less than 1 μg. For sequencing of the constructed sequencing library contained in the method of this aspect of the invention, sequencing can be performed using known platforms including, but not limited to, Illumina's Hiseq2000/2500 platform, Life Technologies' Ion Torrent platform, and single molecule sequencing platform. The sequencing method can be either single-ended sequencing or double-end sequencing. In one embodiment of the invention, double-end sequencing is utilized, and the resulting sequencing data consists of multiple pairs of read pairs. By using the library of the target region in the method of the aspect of the invention and sequencing the library, the quality of the downlink data after sequencing is high, and the high-quality offline data is favorable for subsequent accurate detection and analysis.
在本发明的一个实施例中,所述基于所述测序数据,检测所述目标区域中的变异,包括:将所述测序数据与参考序列进行第一比对,获得第一比对结果;将所述第一比对结果与所述参考序列的一部分进行第二比对,获得第二比对结果,所述参考序列的一部分包括目标区域参考序列中的每个已知InDel位点,以及所述每个已知InDel位点上下游各1000bp的参考序列;基于所述第一比对结果和第二比对结果,同时检测所述目标区域中的SNP、InDel、SV和CNV变异。这里,所说的第二比对为局部比对,第一比对为常规全局比对,可利用但不限于SOAP或BWA等软件依照其默认设置进行,获得第一比对结果,第一比对结果包括读段在参考序列上的匹配位置及匹配情况信息。而进行第二比对即基于第一比对结果,对与所捕获的基因区域对应的参考序列中的所有已知INDEL附近的所有序列信息(reads)进行局部重新比对,能够消除第一比对中的错误,提高后续变异检测的准确性,第二比对可利用GATK比对软件(https://www.broadinstitute.org/gatk/)进行。在本发明的一个实施例中,通过GATK UnifiedGenotyper软件同时检测所说的SNP和INDEL变异。利用本发明的变异检测方法,能够准确检测出突变频率为1%的低频突变。利用本发明的这一方面的肿瘤筛查方法,能够准确检测出突变频率为1%的低频突变,利于对肿瘤进行早期筛查,辅助临床诊断,能够对肿瘤进行预防、及早治疗或及时监控跟踪。In an embodiment of the present invention, the detecting the variation in the target area based on the sequencing data comprises: performing a first comparison with the reference sequence to obtain a first comparison result; Performing a second alignment with a portion of the reference sequence to obtain a second alignment result, the portion of the reference sequence including each known InDel site in the target region reference sequence, and A reference sequence of 1000 bp each upstream and downstream of each known InDel site is described; based on the first alignment result and the second alignment result, SNP, InDel, SV and CNV mutations in the target region are simultaneously detected. Here, the second alignment is a local alignment, and the first alignment is a conventional global alignment, and may be performed by using software such as SOAP or BWA according to its default setting to obtain a first comparison result, the first ratio. The result includes the matching position of the read segment on the reference sequence and the matching situation information. And performing the second alignment, that is, based on the first alignment result, locally re-aligning all the sequence information (reads) near all known INDELs in the reference sequence corresponding to the captured gene region, and eliminating the first ratio The alignment error improves the accuracy of subsequent mutation detection, and the second alignment can be performed using the GATK comparison software ( https://www.broadinstitute.org/gatk/ ). In one embodiment of the invention, the SNP and INDEL variations are detected simultaneously by the GATK Unified Genotyper software. With the mutation detecting method of the present invention, it is possible to accurately detect a low frequency mutation having a mutation frequency of 1%. By using the tumor screening method of this aspect of the invention, the low frequency mutation with a mutation frequency of 1% can be accurately detected, which facilitates early screening of the tumor, assists in clinical diagnosis, and can prevent, treat, or monitor the tumor in time. .
为使变异检测结果更准确可信,在本发明的一个实施例中,在所述第一比对之前,对所述测序数据进行过滤,所述过滤包括去除掉不确定碱基比例超过10%的读段和/或碱基质 量值不大于5的碱基数的比例不小于50%的读段。并且任选地,在所述第二比对之前,去除掉第一比对结果中的一个读段对中的两个读段相同的读段对。所说的参考序列的一部分包括目标区域参考序列中的每个已知InDel位点,以及所述每个已知InDel位点上下游各1000bp的参考序列。当待测样本来自于人时,参考序列可选择为人参考基因组,即已知的人基因组序列,例如为HG19,HG19可以从NCBI下载得来,而所说的目标区域参考序列为与目标区域匹配的那部分参考基因组序列。In order to make the mutation detection result more accurate and reliable, in one embodiment of the present invention, the sequencing data is filtered before the first alignment, and the filtering includes removing the uncertainty base ratio by more than 10%. Reading and / or alkali matrix A ratio of the number of bases having a magnitude of not more than 5 is not less than 50% of the read. And optionally, prior to said second alignment, the same pair of reads of the two of the read pairs of the first alignment result are removed. A portion of the reference sequence includes each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site. When the sample to be tested is from a human, the reference sequence can be selected as a human reference genome, that is, a known human genome sequence, such as HG19, HG19 can be downloaded from NCBI, and the target region reference sequence is matched with the target region. That part of the reference genome sequence.
在本发明的一个实施例中,所述基于检测出的变异中的至少之一满足以下(i)或者(ii),判定所述待测样本为阳性样本,其中,所说的阳性样本指肿瘤样本,可以肺癌、结直肠癌、胃癌、乳腺癌、肾癌、胰腺癌、卵巢癌、子宫内膜癌、甲状腺癌、***、食管癌或肝癌样本。(i)和(ii)两个判定条件是发明人结合目前相关数据库信息和大量文献报道信息、检测统计大量阳性样本和大量阴性样本确定下来的,具有统计意义,后者比前者更为严格,较佳的,这里的阳性或者阴性对照样本超过30个,对照样本的数据可以自己对对照样本的核酸进行提取、序列测定来获得,也可以依照他人已公开或公开数据库中的的样本测序数据,多个对照样本数据使统计判定条件/结果具有统计意义,更加可信。依据任一两个判定条件进行判定的结果可以辅助用于临床的肿瘤诊断筛查,可辅助用于了解所测样本个体癌变的种类、可能性及发展情况等。需要说明的是,所说的变异位点在待测样本中的读段支持量与正常对照样本(阴性对照样本)相同位点的读段支持量具有显著差异,其中的读段支持量,可以为支持该变异的读段的数目,也可以是支持该变异的读段在比对上该位点读段中的比例,在本发明的一个实施例中,采用后者来比较,所说的具有显著差异指具有实质差异,例如对于待测样本中的变异位点A,多个阳性样本(cancer样本)中的reads支持比例都为5/400(变异5条reads,总400条reads),即阳性样本中该位点的平均变异频率1.25%,而在多个阴性对照样本中的reads支持比例都为1/200(变异1条reads,总200条reads),即阴性对照样本中的平均变异频率0.5%,若待测样本中的该变异位点的变异频率更接近1.25%,例如达到0.9%,则达到所说的显著差异或者所说的实质差异。具有显著差异,也可以指统计学上的对数据差异性的评价——显著性差异,例如对待测样本中的变异位点A进行多次检测,获得该位点的多组比对结果数据,从每组比对结果数据中都可获得一个读段支持比例,所说的读段支持比例=支持该变异位点的读段数/比对上该位点的总读段数,接着比较待测样本的变异位点A的读段支持比例(变异频率)与阴性对照样本中的该位点的突变频率的差异,例如可以利用z检验或t检验,差异具有显著性(p≤0.05),即认为达到所说的具有显著差异。In one embodiment of the present invention, the at least one of the detected variations satisfies the following (i) or (ii), and the sample to be tested is determined to be a positive sample, wherein the positive sample refers to a tumor The sample may be a lung cancer, colorectal cancer, stomach cancer, breast cancer, kidney cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer or liver cancer samples. (i) and (ii) the two judgment conditions are determined by the inventor in combination with the current relevant database information and a large amount of literature report information, a large number of positive samples and a large number of negative samples, which are statistically significant, and the latter is more strict than the former. Preferably, there are more than 30 positive or negative control samples, and the data of the control sample can be obtained by extracting and sequencing the nucleic acid of the control sample by itself, or according to the sample sequencing data of the published or public database of others. Multiple control sample data makes the statistical decision conditions/results statistically significant and more credible. The results of the determination based on any two criteria can be used for clinical tumor diagnosis screening, and can be used to understand the type, possibility and development of cancer in the sample. It should be noted that the read support amount of the variant site in the sample to be tested is significantly different from the read support amount of the same site of the normal control sample (negative control sample), wherein the read support amount can be The number of reads in order to support the variation may also be the ratio of the reads supporting the variation in the read of the site in the alignment, in one embodiment of the invention, the latter is used for comparison, said Significant differences mean substantial differences, for example, for variant A in the sample to be tested, the ratio of reads in multiple positive samples (cancer samples) is 5/400 (variation 5 reads, total 400 reads), That is, the average mutation frequency of the site in the positive sample is 1.25%, and the ratio of reads support in the multiple negative control samples is 1/200 (variation 1 reads, total 200 reads), that is, the average in the negative control sample. The frequency of variation is 0.5%. If the frequency of variation of the mutation site in the sample to be tested is closer to 1.25%, for example, 0.9%, the significant difference or the substantial difference is reached. Significant differences can also refer to statistically significant differences in data--significant differences, such as multiple detections of variant site A in a sample to be tested, and multiple sets of alignment data for that site are obtained, A read support ratio is obtained from each set of comparison result data, the read support ratio = the number of reads supporting the variant site / the total number of reads on the alignment, and then the sample to be tested is compared The difference between the read support ratio (variation frequency) of the variant site A and the mutation frequency of the site in the negative control sample, for example, the z test or the t test can be used, and the difference is significant (p ≤ 0.05), that is, Achieving significant differences.
发明人通过研究结果显示,正常人外周血中的游离血浆DNA(cfDNA)的浓度为1-100ng/mL,而肿瘤患者外周血中的循环肿瘤DNA(ctDNA)含量将明显增加,由于肿瘤细胞分泌、凋亡或坏死所产生基因组片段入血,使肿瘤患者外周血中的ctDNA含量平均浓 度可达180ng/mL,通过对正常人群及肿瘤患者的外周血ctDNA的含量变化及突变情况进行定时监控,可应用于下列至少之一:肿瘤的早期诊断检测,遗传性肿瘤预测及状态评估,肿瘤早期发病进展检测,肿瘤术后效果检测评估,肿瘤靶向治疗、化疗治疗基因变异情况分析,肿瘤致病基因微量残留检测,肿瘤耐药性基因变异情况分析。The inventors have shown that the concentration of free plasma DNA (cfDNA) in normal human peripheral blood is 1-100 ng/mL, and the circulating tumor DNA (ctDNA) content in peripheral blood of tumor patients will increase significantly due to tumor cell secretion. The genomic fragments produced by apoptosis or necrosis enter the blood, making the ctDNA content in the peripheral blood of tumor patients mean The degree can reach 180ng/mL, and the timing and monitoring of the changes and mutations of ctDNA in the peripheral blood of normal people and tumor patients can be applied to at least one of the following: early diagnosis of tumor, hereditary tumor prediction and state evaluation, Early detection of tumor progression, tumor detection and evaluation, tumor targeted therapy, chemotherapy for genetic variation analysis, tumor pathogenic gene trace residue detection, and tumor resistance gene mutation analysis.
发明人发现相对与临床传统的肿瘤检测方法,根据本发明或者任一具体实施方式中的方法和/装置,具有下列优势:微创性:受检者只需要提供5-10mL外周血样本;实时性:可对受检者进行多次实时采血,早期筛查时可定期检测,监控肿瘤发病风险,肿瘤患者可在手术后、化疗用药/靶向用药后随时检测,以分析手术预后情况及用药的灵敏性、耐药性情况等;高灵敏度:不受限与病灶位置及大小,通过高深度的目标区域捕获测序,可以检测出突变频率为1%的低频变异,对于肿瘤发病早期以及肿瘤治疗复发后所出现的变异能够及时准确检出;高特异性:在ctDNA含量较少的情况下,能够保证较低的假阳性率、假阴性率,确保得到的检测结果能够准确的反应受检者实时外周血状况;高通量:基于新一代测序技术的目标区域捕获测序,能够在很短的时间内同时进行多例样本检测,并且在目标区域捕获芯片的使用下,相同数据量可进行更高深度的数据挖掘。The inventors have found that the methods and/or devices according to the present invention or any of the specific embodiments have the following advantages as compared with clinically conventional tumor detection methods: minimally invasive: the subject only needs to provide 5-10 mL of peripheral blood samples; Sex: The subject can be tested for multiple times in real time. It can be detected regularly during early screening to monitor the risk of tumor. Tumor patients can be tested at any time after surgery, after chemotherapy/targeting, to analyze the prognosis of the operation and medication. Sensitivity, drug resistance, etc.; high sensitivity: unrestricted and lesion location and size, through high-depth target region capture sequencing, can detect low frequency mutation with mutation frequency of 1%, for early tumor onset and tumor treatment The mutations that occur after recurrence can be detected accurately and timely; high specificity: in the case of less ctDNA content, lower false positive rate and false negative rate can be ensured, and the obtained test results can accurately respond to the subject. Real-time peripheral blood status; high-throughput: Target region capture sequencing based on next-generation sequencing technology, capable of simultaneous simultaneous delivery in a short period of time Example samples tested multiple rows, and in the target region captured using the chip, the same amount of data may be a higher data mining depth.
附图说明DRAWINGS
本发明的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1是本发明的一个实施例中的目标区域变异检测装置的结构示意图;1 is a schematic structural view of a target region variation detecting device in an embodiment of the present invention;
图2是本发明的一个实施例中的目标区域变异检测装置的结构示意图;2 is a schematic structural diagram of a target region variation detecting device in an embodiment of the present invention;
图3是本发明的一个实施例中的目标区域变异检测装置的结构示意图;3 is a schematic structural diagram of a target region variation detecting device in an embodiment of the present invention;
图4是本发明的一个实施例中的目标区域变异检测装置的结构示意图。Fig. 4 is a view showing the configuration of a target region variation detecting device in an embodiment of the present invention.
具体实施方式detailed description
本发明中的“变异”、“核酸变异”、“基因变异”可通用,本发明中的“SNP”(SNV)、“CNV”、“***缺失”(indel)和“结构变异”(SV)同通常定义,但本发明中对各种变异的大小不作特别限定,这样这几种变异之间有的有交叉,比如当***/缺失的为大片段甚至整条染色体时,也属于发生拷贝数变异(CNV)或是染色体非整倍性,也属于SV。这些类型变异的大小交叉并不妨碍本领域人员通过上述描述执行实现本发明的方法和/或装置并且达到所描述的结果。The "variation", "nucleic acid variation", and "gene variation" in the present invention are common, and "SNP" (SNV), "CNV", "insert deletion" (indel), and "structural variation" (SV) in the present invention are common. The definition is the same as usual, but the size of each variation is not particularly limited in the present invention, so that there are some crossovers between the several variations, such as when the insertion/deletion is a large fragment or even a whole chromosome, the copy number is also generated. Mutation (CNV) or chromosomal aneuploidy also belongs to SV. The cross-over of the size of these types of variations does not prevent one skilled in the art from performing the methods and/or apparatus of the present invention and achieve the results described.
本发明中的“参考序列”为已知基因组序列或者已知基因组序列的至少一部分,本发明中所使用的“第一”、“第二”等仅为方便描述指代,不能理解为指示或暗示相对重要性,也不能理解为有先后顺序关系。本发明的描述中,除非另有说明,“多个”的含义是两个或 两个以上。The "reference sequence" in the present invention is a known genomic sequence or at least a part of a known genomic sequence, and "first", "second" and the like used in the present invention are merely for convenience of description and are not to be construed as indicating or Implying relative importance, it cannot be understood as having a sequential relationship. In the description of the present invention, unless otherwise stated, the meaning of "multiple" is two or More than two.
下面通过具体的实施例,对本发明进行说明,需要说明的是这些实施例仅仅是为了说明目的,而不能以任何方式解释成对本发明的限制。The invention is illustrated by the following examples, which are intended to be illustrative only and not to be construed as limiting the invention.
获得本发明一方面的试剂盒、实现本发明一方面的方法和/或装置,一般包括目标区域捕获探针/芯片的设计、微量样本建库及杂交上机测序、下机数据的生物信息分析和变异数据解读。A kit for obtaining an aspect of the invention, a method and/or apparatus for carrying out an aspect of the invention generally comprises designing a target region capture probe/chip, microsample construction and hybridization sequencing, and biological information analysis of the off-machine data. Interpretation of variant data.
一般方法General method
除非特别说明,在下面的实施例中,按照下面的一般方法进行:Unless otherwise stated, in the following examples, the following general methods are carried out:
1.目标区域捕获芯片设计1. Target area capture chip design
基于TCGA、ICGC、COSMIC等数据库和相关和收集提取大量相关参考文献中的信息,采用迭代算法设计出能够用于或者辅助用于常见癌症或者具体癌症早筛诊断的目标区域捕获探针/芯片。目标区域包括常见癌或者具体癌症的相关的驱动基因(Driver Gene)、高频突变基因、癌症相关12条信号通路中重要基因,靶药及化疗药物相关基因等。所设计的探针/芯片可以从复杂的基因组中特异性识别出目标区域,在同一组探针/芯片上以高特异性和高覆盖率捕获目标区域。Based on TCGA, ICGC, COSMIC and other databases and related and collected information extracted from a large number of related references, an iterative algorithm is used to design a target region capture probe/chip that can be used or assisted for common cancer or specific cancer early screening diagnosis. The target region includes driver genes related to common cancers or specific cancers, high-frequency mutation genes, important genes in 12 cancer-related signaling pathways, target drugs, and genes related to chemotherapy drugs. The designed probe/chip can specifically identify the target region from a complex genome and capture the target region with high specificity and high coverage on the same set of probes/chips.
2.试验及数据分析2. Test and data analysis
(一)样本制备(1) Sample preparation
抽取受检者外周血5-10mL,存于EDTA抗凝管中,在4-6小时内对外周血进行分离,得到血浆游离DNA(cfDNA),cfDNA中可能包含来自肿瘤细胞的DNA(ctDNA);5-10 mL of peripheral blood of the subject was taken and stored in an EDTA anticoagulant tube. The peripheral blood was separated within 4-6 hours to obtain plasma free DNA (cfDNA), and the cfDNA may contain DNA derived from tumor cells (ctDNA). ;
cfDNA定量检测。Quantitative detection of cfDNA.
(二)文库制备及测序(2) Library preparation and sequencing
对cfDNA片段进行末端修复;End repair of the cfDNA fragment;
对cfDNA片段末端加A;Add A to the end of the cfDNA fragment;
连接Adapter文库接头:文库接头(Adapter)是指经过设计的一段碱基序列,作用在于cfDNA/ctDNA文库扩增时与引物相结合,使DNA扩增进行,并且在上机测序时与测序引物相结合,利于测序引物与待测序位点结合辅助DNA测序进行。Linking Adapter library linker: Adapter refers to a designed base sequence that binds to a primer when the cfDNA/ctDNA library is amplified, allows DNA amplification, and is sequenced with the primer when sequencing. The combination facilitates sequencing primer binding to the site to be sequenced for assisted DNA sequencing.
文库进行第一轮PCR扩增;The library was subjected to the first round of PCR amplification;
扩增后文库质控并与上述探针/芯片杂交;The library is quality-controlled after amplification and hybridized with the above probe/chip;
杂交文库进行第二轮PCR扩增;The hybrid library is subjected to a second round of PCR amplification;
文库定量及质控;Library quantification and quality control;
Illumina HiSeq2500/2000上机测序,测序深度达2000X以上。The Illumina HiSeq2500/2000 was sequenced on a machine with a sequencing depth of over 2000X.
(三)目标区域捕获测序下机数据进行生物信息分析(III) Target area capture and sequencing data for bioinformatics analysis
获得下机数据后需进行如下生物信息分析,得到最终的变异结果。 After obtaining the data of the machine, the following biological information analysis is performed to obtain the final variation result.
SOAPnuke filter:去除低质量reads;SOAPnuke filter: remove low quality reads;
与reference序列比对,产生bam文件;Compare with the reference sequence to generate a bam file;
标记重复序列;Marking a repeating sequence;
比对结果不好的序列重新比对,并校正质量值;Realigning the sequences with poor results and correcting the quality values;
去除错配序列;Removing the mismatch sequence;
分析下机数据QC;Analysis of the machine data QC;
寻找变异;Looking for variation;
对变异结果进行注释,得到最终数据结果。Comment the variation results to get the final data results.
(四)变异数据解读(4) Interpretation of variation data
对生物信息分析后的变异数据进行个体化解读,参考构建的肿瘤数据库及相关文献,对受检者检出的变异进行分析,以辅助判断此受检样本的肿瘤相关基因状态,是否有癌症发病风险,以及早期肿瘤组织的良恶性情况等,用以辅助结合临床检验结果给予最适合的预防及治疗方式。The individualized interpretation of the mutated data after bioinformatics analysis, with reference to the constructed tumor database and related literature, analyzes the variation detected by the subject to assist in judging the tumor-related gene status of the tested sample, and whether there is cancer Risks, as well as the benign and malignant conditions of early tumor tissue, are used to assist in the most appropriate prevention and treatment methods in combination with clinical test results.
以下结合具体个体样本对利用本发明的方法/装置获得的检测结果进行详细的描述。下面示例,仅用于解释本发明,而不能理解为对本发明的限制。除另有交待,以下实施例中涉及的未特别交待的试剂、序列(接头、标签和引物)、软件及仪器,都是常规市售产品或者开源的,比如购自Illumina公司的hiseq2000测序平台建库相关试剂盒来进行文库构建等。The test results obtained using the method/device of the present invention are described in detail below in connection with specific individual samples. The following examples are merely illustrative of the invention and are not to be construed as limiting the invention. Unless otherwise stated, the reagents, sequences (linkers, tags and primers), software and instruments not specifically addressed in the following examples are conventional commercial products or open source, such as the hiseq2000 sequencing platform from Illumina. Library related kits for library construction and the like.
实施例1常见癌相关基因检测Example 1 Common Cancer Related Gene Detection
1.目标区域捕获芯片设计1. Target area capture chip design
基于TCGA、ICGC、COSMIC等数据库和收集提取大量相关参考文献中的信息,采用迭代算法设计出能够用于或者辅助用于肿瘤早诊、术后监控以及肿瘤治疗(放化疗、靶向药物治疗等)效果监控的目标区域捕获芯片CANPer,CANPer为液相芯片。CANPer芯片包括了常见高发癌症的相关驱动基因(Driver Gene)、高频突变基因、癌症相关12条信号通路中重要基因,共计547个基因,300Kb。基因列表详见表1。Based on TCGA, ICGC, COSMIC and other databases and collecting and extracting information from a large number of related references, iterative algorithm is designed to be used or assisted for early diagnosis, postoperative monitoring and tumor treatment (chemotherapy, targeted drug therapy, etc.). The target area of the effect monitoring capture chip CANPer, CANPer is a liquid phase chip. The CANPer chip includes the driver gene of the common high-risk cancer, the high-frequency mutation gene, and the important genes in the 12 signal pathways related to cancer, totaling 547 genes, 300Kb. The list of genes is shown in Table 1.
2.以肺癌结节早期患者的外周血血浆作为待检样本,样本来自天津妇幼保健院,内容如下:2. The peripheral blood plasma of patients with early stage of lung nodules was used as the sample to be tested. The sample was from Tianjin Maternal and Child Health Hospital. The contents are as follows:
(一)外周血样本分离(1) Separation of peripheral blood samples
1)采集受检者外周血1-2管(5mL/管)于EDTA抗凝管中,轻柔上下颠倒(防止细胞破裂)6-8次充分混匀,在采血当天4-6小时内进行以下处理;1) Collect 1-2 tubes (5 mL/tube) of the peripheral blood of the subject in the EDTA anticoagulation tube, gently invert upside down (to prevent cell rupture) 6-8 times, mix thoroughly, and perform the following within 4-6 hours on the day of blood collection. deal with;
2)在4℃条件下1600g离心10分钟,离心后将上清(血浆)分装到多个1.5mL/2mL离心管中,在吸取过程中不能吸到中间层白细胞; 2) Centrifuge at 1600 g for 10 minutes at 4 ° C. After centrifugation, the supernatant (plasma) was dispensed into a plurality of 1.5 mL / 2 mL centrifuge tubes, and the intermediate layer of white blood cells could not be absorbed during the pipetting;
3)在4℃条件下16000g离心10分钟,去除残余细胞,将上清(血浆)转移到新的1.5mL/2mL离心管中,不能吸到管底白细胞,即得到分离后所需血浆;3) Centrifuge at 16000g for 10 minutes at 4 ° C, remove residual cells, transfer the supernatant (plasma) to a new 1.5 mL / 2 mL centrifuge tube, can not absorb the white blood cells at the bottom of the tube, that is, the plasma required after separation;
4)血浆样本处理完后,分离得到的血浆及剩余血细胞均保存到-80℃冰箱中,避免反复冻融。4) After the plasma sample is processed, the separated plasma and remaining blood cells are stored in a -80 ° C refrigerator to avoid repeated freezing and thawing.
(二)血浆游离DNA提取(采用QIAamp Circulating Nucleic Acid Kit)(2) Plasma free DNA extraction (using QIAamp Circulating Nucleic Acid Kit)
1)加30μL蛋白酶K至1.5mL离心管中;1) Add 30 μL of proteinase K to a 1.5 mL centrifuge tube;
2)加入300μL血浆;2) Add 300 μL of plasma;
3)加入240μL Buffer ACL和1.68μL Carrier RNA(0.2μg/μL),涡旋振荡30s,60℃温浴30min,温浴期间适当取出振荡;3) Add 240 μL Buffer ACL and 1.68 μL Carrier RNA (0.2 μg/μL), vortex for 30 s, warm at 60 ° C for 30 min, and take appropriate oscillation during the warm bath;
4)加入540μL Buffer ACB,涡旋振荡15-30s,冰上或-20℃冰箱放置5min;4) Add 540 μL Buffer ACB, vortex for 15-30 s, place on ice or -20 ° C refrigerator for 5 min;
5)取700μL血浆混合物加入过滤柱中,7500rpm离心30s;5) 700 μL of the plasma mixture was added to the filter column, and centrifuged at 7500 rpm for 30 s;
6)过滤柱空甩8000rpm,1min;6) The filter column is empty at 8000 rpm for 1 min;
7)加入600μL Buffer ACW1,8000rpm,1min离心洗涤;7) Add 600 μL Buffer ACW1, 8000 rpm, and centrifuge for 1 min;
8)加入700μL Buffer ACW2,8000rpm,1min离心洗涤;8) Add 700 μL Buffer ACW2, 8000 rpm, and centrifuge for 1 min;
9)加入700μL无水乙醇,8000rpm,1min离心洗涤;9) Add 700 μL of absolute ethanol, 8000 rpm, and centrifuge for 1 min;
10)过滤柱空甩14000rpm,3min;10) The filter column is 14000 rpm, 3 min;
11)把过滤柱放入新收集管中,打开盖子,56℃金属浴10min;11) Put the filter column into a new collection tube, open the lid, and let the metal bath at 56 ° C for 10 min;
12)将柱子放入新离心管汇总,加入60μL Buffer AVE回溶3min;12) Put the column into a new centrifuge tube and add 60μL Buffer AVE to dissolve for 3min;
13)14.000rpm离心1min,Qubit(Invitrogen,the Quant-iTTMdsDNA HS Assay Kit)定量质控所提取的cfDNA。13) Centrifuge at 14.000 rpm for 1 min, and quantify the extracted cfDNA by Qubit (Invitrogen, the Quant-iTTM dsDNA HS Assay Kit).
(三)文库构建(采用KAPA LTP Library Preparation Kit)(3) Library construction (using KAPA LTP Library Preparation Kit)
1)末端修复1) End repair
Figure PCTCN2014093871-appb-000006
Figure PCTCN2014093871-appb-000006
反应后加入AgencourtAMPureXPreagent 120μL,磁珠纯化后,使用42μL ddH2O回溶,带磁珠进行下一步反应;After the reaction, 120 μL of AgencourtAMPureX Preagent was added, and after purification of the magnetic beads, it was re-dissolved with 42 μL of ddH 2 O, and the next reaction was carried out with magnetic beads;
2)末端加A2) Add A to the end
Figure PCTCN2014093871-appb-000007
Figure PCTCN2014093871-appb-000007
Figure PCTCN2014093871-appb-000008
Figure PCTCN2014093871-appb-000008
反应后加入PEG/NaCl SPRI Solution 90μL,充分混合并进行磁珠纯化,下步Adapter连接反应体系中Adapter和ddH2O加入量按照下面公式进行计算:10nM*建库起始DNA量(ng)*Adaper用量(μL)=15μM(Adapter浓度)*50μL,使用ddH2O用量(μL)=35μL-Adapter用量(μL)回溶,进行下一步反应;After the reaction, 90 μL of PEG/NaCl SPRI Solution was added, and the beads were thoroughly mixed and purified. The amount of Adapter and ddH 2 O added in the next step of the Adapter connection reaction system was calculated according to the following formula: 10 nM* database initial DNA amount (ng)* Adaper dosage (μL) = 15μM (Adapter concentration) * 50μL, using ddH 2 O dosage (μL) = 35μL-Adapter dosage (μL) to dissolve back, the next reaction;
3)接头连接3) Connector connection
Figure PCTCN2014093871-appb-000009
Figure PCTCN2014093871-appb-000009
反应后,加入PEG/NaCl SPRI Solution 50μL,进行第一次磁珠纯化,使用50μL Tris-HCl(1mM,pH8.0)回溶;After the reaction, 50 μL of PEG/NaCl SPRI Solution was added, and the first magnetic bead purification was carried out, and 50 μL of Tris-HCl (1 mM, pH 8.0) was used for reconstitution;
再加入PEG/NaCl SPRI Solution 50μL,进行第二次磁珠纯化,使用25μLTris-HCl(1mM,pH8.0)回溶;Add 50 μL of PEG/NaCl SPRI Solution, perform a second magnetic bead purification, and use a 25 μL Lris-HCl (1 mM, pH 8.0) to dissolve back;
4)第一轮PCR扩增4) First round of PCR amplification
Figure PCTCN2014093871-appb-000010
Figure PCTCN2014093871-appb-000010
反应后加入AgencourtAMPureXPreagent 90μL,磁珠纯化后,使用31μL ddH2O回溶,取上清后质控并进行芯片杂交。 After the reaction, 90 μL of AgencourtAMPureX Preagent was added, and the magnetic beads were purified, and then re-dissolved using 31 μL of ddH 2 O. The supernatant was subjected to quality control and chip hybridization was carried out.
5)目标区域捕获芯片杂交5) Target region capture chip hybridization
本实施例中采用委托Roche合成的上述基因芯片CANPer-1.75M,参照芯片制造商提供的说明书进行杂交捕获及洗脱。最后使用21μL ddH2O回溶杂交洗脱磁珠。In the present example, the above-mentioned gene chip CANPer-1.75M entrusted by Roche was used, and hybridization capture and elution were carried out according to the instructions provided by the chip manufacturer. Finally, the magnetic beads were eluted by back-mixing with 21 μL of ddH 2 O.
6)第二轮PCR扩增6) Second round of PCR amplification
Figure PCTCN2014093871-appb-000011
Figure PCTCN2014093871-appb-000011
反应后加入AgencourtAMPureXPreagent 108μL,磁珠纯化后,使用31μL EB回溶,取上清后质控并上机测序。After the reaction, 108 μL of AgencourtAMPureX Preagent was added, and after purification of the magnetic beads, 31 μL of EB was used for reconstitution, and the supernatant was taken for quality control and sequenced.
7)上机测序7) Sequencing on the machine
本实施例中,采用Illumina HiSeq2500 PE101+8+101程序进行上机测序,测序实验操作按照制造商提供的操作说明书(参见Illumina/Solexa官方公布cBot)进行上机测序操作。In this example, the Illumina HiSeq2500 PE101+8+101 program was used for sequencing on the machine. The sequencing experiment was performed according to the manufacturer's operating instructions (see Illumina/Solexa official cBot) for sequencing.
(四)下机数据生物信息分析解读(4) Interpretation of biological data analysis of off-machine data
1)SOAPnuke filter:去除n≥10%和碱基质量值≤5的碱基数目>50%的reads;1) SOAPnuke filter: removes n ≥ 10% and the number of bases with a base mass value ≤ 5 > 50% of reads;
2)Bwaaln->sampe|samtools view|samtools sort:与reference序列比对,产生bam文件;2) Bwaaln->sampe|samtools view|samtools sort: Compare with the reference sequence to generate a bam file;
3)MarkDuplicates.jar:将同一个pe的相同的reads标记为重复;3) MarkDuplicates.jar: Mark the same reads of the same pe as duplicates;
4)GenomeAnalysisTK.jar-T RealignerTargetCreator、IndelRealigner:将比对不好的reads重新比对;4) GenomeAnalysisTK.jar-T RealignerTargetCreator, IndelRealigner: re-align the bad readings;
5)GenomeAnalysisTK.jar-T BaseRecalibrator、PrintReads:校正质量值;5) GenomeAnalysisTK.jar-T BaseRecalibrator, PrintReads: correct the quality value;
6)Filt_bam:去掉错配(mismatch)碱基≥3个的reads;6) Filt_bam: remove the mismatch base ≥ 3 reads;
7)QC:统计芯片的捕获效率、有效reads数、平均深度、重复率、覆盖度及未被覆盖的区间等信息;7) QC: information such as the capture efficiency, effective number of reads, average depth, repetition rate, coverage, and uncovered intervals of the statistical chip;
8)识别SNV/InDel/SV/CNV及筛选其中的高频变异位点: 8) Identify SNV/InDel/SV/CNV and screen for high frequency variant sites:
用MuTect(http://www.broadinstitute.org/cancer/cga/mutect)、varScan(http://massgenomics.org/varscan)流程识别出SNP变异;SNP mutations were identified using MuTect ( http://www.broadinstitute.org/cancer/cga/mutect ) and varScan ( http://massgenomics.org/varscan ) processes;
用gatk(https://www.broadinstitute.org/gatk/)、varScan、ForestSV(http://sebatlab.ucsd.edu/index.php/software-data)流程识别出InDel变异;Identify InDel variants using the process of gatk ( https://www.broadinstitute.org/gatk/ ), varScan, ForestSV ( http://sebatlab.ucsd.edu/index.php/software-data );
用contra.py(http://contra-cnv.sourceforge.net/)流程识别出CNV;Identify the CNV using the contra.py ( http://contra-cnv.sourceforge.net/ ) process;
用ForestSV(http://sebatlab.ucsd.edu/index.php/software-data)流程识别出SV;Identify the SV using the ForestSV ( http://sebatlab.ucsd.edu/index.php/software-data ) process;
所使用的筛选参数为:测序深度≥10X,在阴性(正常)样本中的变异率≤2%,在阳性样本中的变异率≥1%,在该待测样本数据中支持该变异的reads数≥3,与正常对照(体细胞)的读段支持比例具有显著差异(p≤0.05);The screening parameters used were: sequencing depth ≥ 10X, variability in negative (normal) samples ≤ 2%, variability in positive samples ≥ 1%, and the number of reads supporting the variation in the sample data to be tested ≥3, there is a significant difference between the read support ratio of the normal control (somatic cells) (p ≤ 0.05);
9)注释解读9) Interpretation of notes
注释变异的功能、reads支持数、变异频率、氨基酸变异及数据库Cosmic中的变异等;根据变异情况辅助判断疾病的可能来源。化疗药物对肿瘤细胞的杀伤效应与特定的一种(一组)基因的表达和/或多态性显著相关,通过相关基因的检测,预测化疗药物的疗效,选择合适的药物进行个体化化疗,已经成为提高疗效、减少无效治疗的合理选择。基于化疗药物以上特点,参考PharmGKB数据库,整合目前临床上所有的化疗药物以及与疗效相关的基因及疗效预测评判,形成化疗个体化用药解读数据库。并将化疗数据整合入肿瘤个体化信息流程,完成化疗药物的自动化解读。Annotate the function of variation, the number of reads support, the frequency of mutation, the variation of amino acids, and the variation in the database Cosmic; assist in determining the possible sources of disease based on the variation. The killing effect of chemotherapeutic drugs on tumor cells is significantly correlated with the expression and/or polymorphism of a specific (a group of) genes. The detection of related genes predicts the efficacy of chemotherapeutic drugs and selects appropriate drugs for individualized chemotherapy. It has become a reasonable choice to improve efficacy and reduce ineffective treatment. Based on the above characteristics of chemotherapeutic drugs, the PharmGKB database is used to integrate all the current chemotherapeutic drugs and the genes related to curative effect and predictive evaluation of therapeutic effects, and to form a database for interpretation of individualized drugs for chemotherapy. The chemotherapy data was integrated into the individualized information flow of the tumor to complete the automated interpretation of the chemotherapy drug.
靶向药物在肿瘤治疗中具有药效显著、副作用少的特点,但它对靶点(包括蛋白、DNA等)有依赖性,必须先对患者做靶点分析,才能确定患者能否用药。整合目前FDA批准的靶向药物,以及处于临床Ⅲ、Ⅳ的药物。依据NCCN临床指南,临床药物基因研究整理药物靶点基因与靶药疗效关系,形成肿瘤个体化靶药解读数据库。Targeted drugs have the characteristics of significant drug efficacy and few side effects in tumor therapy, but they are dependent on targets (including protein, DNA, etc.). Target analysis must be performed on patients before they can determine whether patients can take drugs. Integrate current FDA-approved targeted drugs, as well as drugs in clinical III and IV. According to the NCCN clinical guidelines, the clinical drug gene research collates the relationship between the drug target gene and the target drug, and forms a database of individualized target drug interpretation.
3.结果分析3. Analysis of results
该样本检测到EGFR基因第451位氨基酸错义突变,属于12号外显子,该变异位点位于蛋白胞外拓扑结构域内,在COSMIC数据库中暂无记载,但相同位点所产生的p.[R451H]错义突变记载1次,报道与肺癌相关(18948947)。功能预测显示该变异为有害性变异,预测可能对基因功能产生影响。The sample detected the 451 amino acid missense mutation of the EGFR gene, belonging to exon 12, which is located in the extracellular domain of the protein, which is not recorded in the COSMIC database, but the p. The R451H] missense mutation was recorded once and reported to be associated with lung cancer (18948947). Functional predictions show that the mutation is a deleterious mutation that may have an impact on gene function.
人表皮生长因子受体,原癌基因c-erbB1的表达产物,属于受体酪氨酸激酶家族成员。EGFR主要位于细胞膜表面,通过与配体的结合激活自身酪氨酸磷酸化,自磷酸化促进下游信号传导途径,包括MPAK,PI3K和JNK通路等,诱导细胞增殖,分化等。在许多实体肿瘤中存在EGFR的突变或异常表达。临床研究表明EGFR突变阳性(18号外显子突变、19号外显子缺失、21号外显子突变)的患者对EGFR-TKI敏感(23344264),而野生型患者基本无效(23883922);20号外显子的突变(T790M为主,***突变)与EGFR-TKI继发性耐药性相关(22263058)。 The human epidermal growth factor receptor, an expression product of the proto-oncogene c-erbB1, is a member of the receptor tyrosine kinase family. EGFR is mainly located on the surface of the cell membrane, and activates its own tyrosine phosphorylation by binding to ligands. Autophosphorylation promotes downstream signaling pathways, including MPAK, PI3K and JNK pathways, and induces cell proliferation and differentiation. Mutations or abnormal expression of EGFR are present in many solid tumors. Clinical studies have shown that patients with positive EGFR mutations (exon 18 mutation, exon 19 deletion, exon 21 mutation) are sensitive to EGFR-TKI (23344264), while wild-type patients are essentially ineffective (23883922); exon 20 Mutations (T790M-based, insertional mutations) are associated with secondary resistance to EGFR-TKI (22263058).
实施例2肺癌相关基因检测Example 2 Detection of lung cancer related genes
1.目标区域捕获芯片设计1. Target area capture chip design
基于TCGA、ICGC、COSMIC等数据库和相关和收集提取大量相关参考文献中的信息,采用迭代算法设计出能够用于或者辅助用于肺癌早筛诊断的目标区域捕获芯片LungPer。LungPer芯片包括了肺癌相关的驱动基因(Driver Gene)、高频突变基因、癌症相关12条信号通路中重要基因,靶药及化疗药物相关基因等,共计145个基因,250Kb。基因列表如表2所示。Based on TCGA, ICGC, COSMIC and other databases and related and collected information extracted from a large number of relevant references, an iterative algorithm was used to design a target area capture chip LungPer that can be used or assisted for early diagnosis of lung cancer. The LungPer chip includes a lung cancer-related driver gene (Driver Gene), a high-frequency mutation gene, an important gene in 12 cancer-related signaling pathways, a target drug, and a chemotherapeutic drug-related gene, and a total of 145 genes, 250 Kb. The list of genes is shown in Table 2.
2.以受检者外周血样本检测为例进行操作说明,样本来自天津妇幼保健院。2. Take the test of the peripheral blood samples of the subjects as an example. The samples are from Tianjin Maternal and Child Health Hospital.
同实施例1中的2。Same as 2 in Example 1.
3.测序分析3. Sequencing analysis
按照一般方法的步骤进行分析,测序结果如表5所示,检测结果见表6。The analysis was carried out according to the steps of the general method, and the sequencing results are shown in Table 5, and the test results are shown in Table 6.
表5table 5
Figure PCTCN2014093871-appb-000012
Figure PCTCN2014093871-appb-000012
表6Table 6
Figure PCTCN2014093871-appb-000013
Figure PCTCN2014093871-appb-000013
该样本检测到EGFR基因第451位氨基酸错义突变,属于12号外显子,该变异位点位于蛋白胞外拓扑结构域内,在COSMIC数据库中暂无记载,但相同位点所产生的p.[R451H]错义突变记载1次,报道与肺癌相关(18948947)。功能预测显示该变异为有害性变异,预测可能对基因功能产生影响。The sample detected the 451 amino acid missense mutation of the EGFR gene, belonging to exon 12, which is located in the extracellular domain of the protein, which is not recorded in the COSMIC database, but the p. The R451H] missense mutation was recorded once and reported to be associated with lung cancer (18948947). Functional predictions show that the mutation is a deleterious mutation that may have an impact on gene function.
人表皮生长因子受体,原癌基因c-erbB1的表达产物,属于受体酪氨酸激酶家族成员。EGFR主要位于细胞膜表面,通过与配体的结合激活自身酪氨酸磷酸化,自磷酸化促进下游信号传导途径,包括MPAK,PI3K和JNK通路等,诱导细胞增殖,分化等。在许多实体肿瘤中存在EGFR的突变或异常表达。临床研究表明EGFR突变阳性(18号外显子突变、19号外显子缺失、21号外显子突变)的患者对EGFR-TKI敏感(23344264),而野生型患者基本无效(23883922);20号外显子的突变(T790M为主,***突变)与EGFR-TKI继发性耐药性相关(22263058)。The human epidermal growth factor receptor, an expression product of the proto-oncogene c-erbB1, is a member of the receptor tyrosine kinase family. EGFR is mainly located on the surface of the cell membrane, and activates its own tyrosine phosphorylation by binding to ligands. Autophosphorylation promotes downstream signaling pathways, including MPAK, PI3K and JNK pathways, and induces cell proliferation and differentiation. Mutations or abnormal expression of EGFR are present in many solid tumors. Clinical studies have shown that patients with positive EGFR mutations (exon 18 mutation, exon 19 deletion, exon 21 mutation) are sensitive to EGFR-TKI (23344264), while wild-type patients are essentially ineffective (23883922); exon 20 Mutations (T790M-based, insertional mutations) are associated with secondary resistance to EGFR-TKI (22263058).
从突变分析结果看出该受检者检出与肺癌发生发展相关的重要变异,再结合临床诊断情况,可判断该受检者患结肺癌的风险及良恶性情况。From the results of the mutation analysis, it is found that the subject detects important mutations related to the development of lung cancer, and combined with the clinical diagnosis, the risk of the lung cancer and the benign and malignant conditions can be judged.
实施例3结直肠癌相关基因检测Example 3 Detection of Colorectal Cancer Related Genes
一、芯片设计 First, the chip design
基于TCGA、ICGC、COSMIC等数据库和相关和收集提取大量相关参考文献中的信息,采用迭代算法设计出能够用于或者辅助结直肠癌早筛诊断的目标区域捕获芯片ColorectalPer。ColorectalPer芯片包括了结直肠癌相关的驱动基因(Driver Gene)、高频突变基因、癌症相关12条信号通路中重要基因,靶药及化疗药物相关基因等,共计60个基因,如表3所示,共123Kb。Based on TCGA, ICGC, COSMIC and other databases and related and collected information extracted from a large number of relevant references, an iterative algorithm was used to design a target region capture chip ColorectalPer that can be used to assist or diagnose early diagnosis of colorectal cancer. The ColorectalPer chip includes a driver gene related to colorectal cancer, a high-frequency mutated gene, an important gene in 12 cancer-related signaling pathways, a target drug, and a chemotherapeutic drug-related gene, and a total of 60 genes, as shown in Table 3. A total of 123Kb.
二、以受检者外周血样本检测为例进行操作说明,样本来自天津妇幼保健院。Second, taking the test of the peripheral blood samples of the subjects as an example, the samples are from Tianjin Maternal and Child Health Hospital.
同实施例1中的2。Same as 2 in Example 1.
三、测序分析Third, sequencing analysis
按照一般方法的步骤进行分析,测序数据统计结果如表7,变异检测结果如表8所示。According to the steps of the general method, the statistical results of the sequencing data are shown in Table 7, and the results of the mutation detection are shown in Table 8.
表7Table 7
Figure PCTCN2014093871-appb-000014
Figure PCTCN2014093871-appb-000014
表8Table 8
GeneGene cHGVScHGVS pHGVSpHGVS FunctionFunction tumor_var_freq(%)Tumor_var_freq(%)
KRASKRAS c.[35G>A]c.[35G>A] p.[Gly12Asp]p.[Gly12Asp] missenseMissense 15.0715.07
ARID1AARID1A c.[805C>T]c.[805C>T] p.[Gln269*]p.[Gln269*] nonsenseNonsense 1.221.22
ROS1ROS1 c.[5557+750T>G]c.[5557+750T>G] .. intronIntron 1.011.01
NRASNRAS c.[291-59C>A]c.[291-59C>A] .. intronIntron 1.091.09
MSH2MSH2 c.[1663C>T]c.[1663C>T] p.[Arg555*]p.[Arg555*] nonsenseNonsense 9.389.38
该样本检出KRAS p.[Gly12Asp]的错义突变,该变异在COSMIC数据库中记载10303次,约60%均报道与大肠癌发病相关。KRAS第12号密码子位于GTP结构域上,是KRAS最常见突变。This sample detected a missense mutation of KRAS p. [Gly12Asp], which was recorded 10,303 times in the COSMIC database, and about 60% were reported to be associated with the pathogenesis of colorectal cancer. The KRAS codon 12 is located on the GTP domain and is the most common mutation in KRAS.
KRAS属于Ras基因家族成员,编码P21蛋白,在MAPK信号通路中起作用,是致癌基因,能够与GDP/GTP结合并促进GTP酶活性。当KRAS发生突变时不能被水解酶水解失活,处于持续激活状态,引起RAF/MAPK的上调,传递多种生存通路信号,从而使细胞过度生长、增殖,抵抗EGFR-TKIs的作用。其突变可以导致多种恶性肿瘤,包括肺癌,黏蛋白腺瘤,胰腺导管癌和结肠癌等。KRAS基因被激活最常见的方式是点突变,多发生在N端的第12、13和61、146密码子,其中以第12密码子突变最常见。不同突变位点对P21蛋白的活化机制不同,第12密码子突变可以减弱P21内在的GTP酶活性,并使细胞凋亡减少,细胞间接触抑制减弱。KRAS, a member of the Ras gene family, encodes the P21 protein and functions in the MAPK signaling pathway. It is an oncogene that binds to GDP/GTP and promotes GTPase activity. When KRAS is mutated, it cannot be hydrolyzed by hydrolase, and it is in a state of continuous activation, which causes up-regulation of RAF/MAPK and transmits a variety of survival pathway signals, thereby allowing cells to overgrow and proliferate and resist EGFR-TKIs. Mutations can lead to a variety of malignancies, including lung cancer, mucinous adenomas, pancreatic ductal cancer, and colon cancer. The most common way in which the KRAS gene is activated is point mutations, which occur at the N- terminal 12, 13, and 61, 146 codons, with the 12th codon mutation being the most common. Different mutation sites have different activation mechanisms for P21 protein. The 12th codon mutation can attenuate the intrinsic GTPase activity of P21 and reduce apoptosis and decrease the inhibition of cell contact.
该样本检出MSH2p.[Arg555*]的无义突变,该变异在COSMIC数据库暂无记载。MSH2第555号密码子所在区域功能研究不详,但此无义突变的发生会导致蛋白编码的提前终止,使得基因主要功能区无法表达,可能使基因功能受到损伤或丧失。This sample detected a nonsense mutation in MSH2p.[Arg555*], which was not recorded in the COSMIC database. The function of the region where MSH2 codon No. 555 is located is unknown, but the occurrence of this nonsense mutation will lead to the early termination of protein coding, which makes the main functional region of the gene unable to be expressed, which may damage or lose the function of the gene.
所编码的蛋白是DNA错配修复***(MMR)的组分之一,形成2种不同的异二聚体:MutSα(MSH2-MSH6异二聚体)以及MutSβ(MSH2-MSH3异二聚体),可与DNA错配 部位相结合因此启动DNA的修复功能。在错配结合之后,MutSα或β与MutLα异二聚体形成一个三元复合物,负责指导下游的MMR事件,包括链的识别、切除以及再合成。ATP的结合及水解在错配修复功能上起重要作用,ATP酶活性与MutSα相关。MutSα还可以在DNA同源重组修复功能上起作用。此基因与遗传性非息肉性结直肠癌类型I以及子宫内膜癌相关。The encoded protein is one of the components of the DNA mismatch repair system (MMR), forming two different heterodimers: MutSα (MSH2-MSH6 heterodimer) and MutSβ (MSH2-MSH3 heterodimer) Can be mismatched with DNA The combination of the parts thus initiates the repair function of the DNA. After mismatch binding, MutSα or β forms a ternary complex with MutLα heterodimer, which is responsible for directing downstream MMR events, including strand recognition, excision, and resynthesis. The binding and hydrolysis of ATP plays an important role in the mismatch repair function, and the ATPase activity is related to MutSα. MutSα can also play a role in DNA homologous recombination repair function. This gene is associated with hereditary nonpolyposis colorectal cancer type I and endometrial cancer.
从突变分析结果看出该受检者检出与结直肠癌发生发展相关的重要变异,再结合临床诊断情况,可判断该受检者患结直肠癌的风险及良恶性情况。From the results of the mutation analysis, it is seen that the subject detects important mutations related to the development of colorectal cancer, and combined with the clinical diagnosis, the risk of the colorectal cancer and the benign and malignant conditions can be judged.
实施例4妇科生殖道肿瘤相关基因检测Example 4 Detection of gynecological reproductive tract tumor related genes
一、目标区域捕获芯片设计First, the target area capture chip design
基于TCGA、ICGC、COSMIC等数据库和相关和收集提取大量相关参考文献中的信息,采用迭代算法设计出能够用于或者辅助妇科生殖道肿瘤早筛诊断的目标区域捕获芯片WCNPer。WCNPer芯片包括了妇科生殖道肿瘤相关的驱动基因(Driver Gene)、高频突变基因、癌症相关12条信号通路中重要基因,靶药及化疗药物相关基因等,共计43个基因,如表4所示,共300Kb。Based on TCGA, ICGC, COSMIC and other databases and related and collected information extracted from a large number of related references, an iterative algorithm was used to design a target region capture chip WCNPer that can be used to assist or assist in the early diagnosis of gynecological reproductive tract tumors. The WCNPer chip includes a driver gene related to gynecological genital tract tumors, a high-frequency mutated gene, an important gene in 12 signal pathways related to cancer, a target drug and a chemotherapeutic drug-related gene, and a total of 43 genes, as shown in Table 4. Show, a total of 300Kb.
二、以受检者的外周血血浆作为研究对象,样本来自天津妇幼保健院,参考实施例1进行试验及数据分析。2. The peripheral blood plasma of the subject was taken as the research object, and the sample was from Tianjin Maternal and Child Health Hospital. The experiment and data analysis were carried out with reference to Example 1.
三、结果分析Third, the results analysis
测序数据统计结果如表9,变异检测结果如表10所示。The statistical results of the sequencing data are shown in Table 9, and the results of the mutation detection are shown in Table 10.
表9Table 9
Figure PCTCN2014093871-appb-000015
Figure PCTCN2014093871-appb-000015
表10Table 10
Figure PCTCN2014093871-appb-000016
Figure PCTCN2014093871-appb-000016
Figure PCTCN2014093871-appb-000017
Figure PCTCN2014093871-appb-000017
该样本检出BRAF p.[G469V]的错义突变,该变异在COSMIC数据库中记载17次,在肺、大肠、胆道、上呼吸道、食道等肿瘤中检测发现。BRAF第469号密码子位于蛋白激酶结构域的ATP结合区,一项黑色素瘤的研究显示该突变为激活突变,可能导致BRAF从非活性状态变为活性状态或使得BRAF信号通路异常激活,与疾病的发生发展可能相关。This sample detected a missense mutation of BRAF p. [G469V], which was recorded 17 times in the COSMIC database and found in tumors such as lung, large intestine, biliary tract, upper respiratory tract, and esophagus. The BRAF codon 469 is located in the ATP-binding domain of the protein kinase domain. A melanoma study has shown that this mutation is an activating mutation that may cause BRAF to change from an inactive state to an active state or to abnormally activate the BRAF signaling pathway, and disease. The occurrence and development may be related.
BRAF基因编码MAPK通路中的丝氨酸苏氨酸蛋白激酶,该酶可将信号由Ras转导至MEK1/2,从而参与细胞功能的调控,影响细胞分类、分化和分泌。此基因产生的突变与多种类型的癌症相关,如结直肠癌、肺癌、肝癌、胰腺癌、甲状腺癌、卵巢癌等。在卵巢癌中,BRAF基因突变频率为8%,是卵巢癌发生发展过程中的driver基因。The BRAF gene encodes a serine threonine protein kinase in the MAPK pathway, which transduces the signal from Ras to MEK1/2, thereby participating in the regulation of cell function and affecting cell sorting, differentiation and secretion. Mutations produced by this gene are associated with many types of cancer, such as colorectal cancer, lung cancer, liver cancer, pancreatic cancer, thyroid cancer, ovarian cancer, and the like. In ovarian cancer, the mutation frequency of BRAF gene is 8%, which is the driver gene in the development of ovarian cancer.
该样本检出TP53p.[G266V]的错义突变,该变异在COSMIC数据库中记载43次,在肺部、大肠、胰腺、卵巢等肿瘤中检测发现。TP53第266号密码子位于序列特异的DNA结合结构域,是TP53发挥功能的重要结构域,该变异可能使TP53的完整功能受到影响或丧失,TP53是肿瘤发生发展中的Driver基因,完整功能受到影响或丧失,可能与疾病发生发展相关。This sample detected a missense mutation of TP53p.[G266V], which was recorded 43 times in the COSMIC database and detected in tumors such as lung, large intestine, pancreas and ovary. TP53 codon 266 is located in the sequence-specific DNA-binding domain and is an important domain for TP53 to function. This mutation may affect or lose the complete function of TP53. TP53 is a driver gene in tumorigenesis and development, and its full function is affected. Impact or loss may be related to the development of the disease.
TP53基因是迄今发现与肿瘤相关性最高的基因之一。作为重要的抑癌基因,在细胞周期调控,DNA损伤修复、细胞分化、凋亡和衰老等过程中发挥了关键作用。TP53基因与50%以上的人类恶性肿瘤有关。临床研究证实肿瘤中95.1%的p53点突变主要发生在高度保守的175、245、248、249、273和282位点。目前很多肿瘤治疗通过调控TP53蛋白实现。TP53基因在多种癌症中均有临床应用研究。TP53(外显子5-8)发生突变的乳腺癌患者预后较差,他莫昔芬疗效也明显降低。TP53的基因突变与功能丧失是卵巢癌中最常见的基因异常之一。The TP53 gene is one of the genes most recently found to be associated with tumors. As an important tumor suppressor gene, it plays a key role in cell cycle regulation, DNA damage repair, cell differentiation, apoptosis and senescence. The TP53 gene is involved in more than 50% of human malignancies. Clinical studies have confirmed that 95.1% of p53 point mutations in tumors occur mainly at the highly conserved sites 175, 245, 248, 249, 273 and 282. Many tumor treatments are currently achieved by regulating TP53 protein. The TP53 gene has clinical application in a variety of cancers. Breast cancer patients with mutations in TP53 (exons 5-8) have a poor prognosis and tamoxifen has a significantly reduced efficacy. Mutation and loss of function of TP53 is one of the most common genetic abnormalities in ovarian cancer.
从突变分析结果看出该受检者检出与妇科疾病相关的重要变异,再结合临床诊断情况,可判断该受检者患妇科肿瘤的风险及良恶性情况。 From the results of the mutation analysis, it is seen that the subject detects important variability related to gynecological diseases, and combined with the clinical diagnosis, the risk and benign and malignant condition of the gynecological tumor can be judged.

Claims (39)

  1. 一种检测目标区域变异的方法,其特征在于,包括,A method for detecting a variation of a target region, characterized in that
    (1)获取待测样本中的核酸,所述核酸由多个核酸片段组成,所述核酸片段来自断裂的基因组DNA和/或游离DNA;(1) obtaining a nucleic acid in a sample to be tested, the nucleic acid consisting of a plurality of nucleic acid fragments derived from fragmented genomic DNA and/or free DNA;
    (2)利用试剂盒捕获所述核酸片段,获得目标区域;(2) capturing the nucleic acid fragment using a kit to obtain a target region;
    (3)对所述目标区域进行序列测定,获得测序数据,所述测序数据由多个读段组成;(3) performing sequence determination on the target region to obtain sequencing data, the sequencing data being composed of a plurality of reading segments;
    (4)基于所述测序数据,检测所述目标区域中的变异;其中,(4) detecting a variation in the target region based on the sequencing data; wherein
    所述试剂盒包含探针,所述探针能够特异性识别以下预定区域:表1里的547个The kit comprises a probe that is capable of specifically recognizing the following predetermined regions: 547 of Table 1
    基因中的至少10个基因的基因区域。The gene region of at least 10 genes in the gene.
  2. 权利要求1的方法,其特征在于,所述预定区域为所述547个基因中的至少20个基因的基因区域。The method of claim 1, wherein said predetermined region is a gene region of at least 20 of said 547 genes.
  3. 权利要求1的方法,其特征在于,所述预定区域为所述547个基因中的至少30个基因的基因区域。The method of claim 1 wherein said predetermined region is a gene region of at least 30 of said 547 genes.
  4. 权利要求1的方法,其特征在于,所述预定区域为所述547个基因中的至少40个基因的基因区域。The method of claim 1 wherein said predetermined region is a gene region of at least 40 of said 547 genes.
  5. 权利要求1的方法,其特征在于,所述预定区域为所述547个基因中的表2所列145个基因的基因区域。The method of claim 1, wherein said predetermined region is a gene region of 145 genes listed in Table 2 of said 547 genes.
  6. 权利要求1的方法,其特征在于,所述预定区域为所述547个基因中的表3所列60个基因的基因区域。The method of claim 1, wherein said predetermined region is a gene region of 60 genes listed in Table 3 of said 547 genes.
  7. 权利要求1的方法,其特征在于,所述预定区域为所述547个基因中的表4所列43个基因的基因区域。The method of claim 1, wherein said predetermined region is a gene region of 43 genes listed in Table 4 of said 547 genes.
  8. 权利要求1的方法,其特征在于,所述预定区域为所述547个基因的基因区域。The method of claim 1 wherein said predetermined region is a gene region of said 547 genes.
  9. 权利要求1-8任一方法,其特征在于,所述探针的长度为25-300nt。A method according to any of claims 1-8, characterized in that the probe has a length of 25-300 nt.
  10. 权利要求1-9任一方法,其特征在于,步骤(2)和步骤(3),包括:The method of any of claims 1-9, wherein the step (2) and the step (3) comprise:
    (h)末端修复所述核酸片段,获得末端修复片段;(h) repairing the nucleic acid fragment at the end to obtain a terminal repair fragment;
    (i)加碱基A至所述末端修复片段的两端,获得粘性末端片段;(i) adding base A to both ends of the end repair fragment to obtain a sticky end fragment;
    (j)连接接头于所述粘性末端片段的两端,获得接头连接片段;(j) connecting a linker to both ends of the sticky end fragment to obtain a linker fragment;
    (k)对所述接头连接片段进行第一扩增,获得第一扩增产物;(k) performing a first amplification of the linker ligation fragment to obtain a first amplification product;
    (l)利用所述试剂盒对所述第一扩增产物进行捕获,获得所述目标区域;(1) capturing the first amplification product using the kit to obtain the target region;
    (m)对所述目标区域进行第二扩增,获得第二扩增产物;以及(m) performing a second amplification of the target region to obtain a second amplification product;
    (n)对所述第二扩增产物进行序列测定,获得所述测序数据。(n) performing sequence determination on the second amplification product to obtain the sequencing data.
  11. 权利要求1-10任一方法,其特征在于,步骤(4)包括, A method according to any one of claims 1 to 10, characterized in that the step (4) comprises
    将所述测序数据与参考序列进行第一比对,获得第一比对结果;Performing a first alignment of the sequencing data with a reference sequence to obtain a first alignment result;
    将所述第一比对结果与所述参考序列的一部分进行第二比对,获得第二比对结果,所述参考序列的一部分包括目标区域参考序列中的每个已知InDel位点,以及所述每个已知InDel位点上下游各1000bp的参考序列;Performing a second alignment of the first alignment result with a portion of the reference sequence to obtain a second alignment result, a portion of the reference sequence including each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site;
    基于所述第一比对结果和第二比对结果,同时检测所述目标区域中的SNP、InDel、SV和CNV变异。Based on the first alignment result and the second alignment result, SNP, InDel, SV, and CNV variations in the target region are simultaneously detected.
  12. 权利要求11的方法,其特征在于,所述参考序列为HG19。The method of claim 11 wherein said reference sequence is HG19.
  13. 权利要求11的方法,其特征在于,在所述第一比对之前,对所述测序数据进行过滤,所述过滤包括去除掉不确定碱基比例超过10%的读段和/或碱基质量值不大于5的碱基数的比例不小于50%的读段。The method of claim 11 wherein said sequencing data is filtered prior to said first alignment, said filtering comprising removing reads and/or base masses having an undetermined base ratio of more than 10% A ratio of the number of bases having a value of not more than 5 is not less than 50% of the read.
  14. 权利要求11-13任一的方法,其特征在于,所述测序数据包含读段对,在所述第二比对之前,去除掉第一比对结果中的一个读段对中的两个读段相同的读段对。The method of any of claims 11-13, wherein said sequencing data comprises a pair of reads, wherein two reads of one of the pair of first alignments are removed prior to said second alignment The same pair of segments in the segment.
  15. 权利要求11的方法,其特征在于,所述第一比对为全局比对,所述第二比对为局部比对。The method of claim 11 wherein said first alignment is a global alignment and said second alignment is a local alignment.
  16. 权利要求1-15任一方法,其特征在于,步骤(4)还包括,当所检测出的变异中的至少之一满足以下(i)或(ii),则判定所述待测样本为阳性样本:A method according to any one of claims 1 to 15, wherein the step (4) further comprises: determining that the sample to be tested is a positive sample when at least one of the detected variations satisfies the following (i) or (ii) :
    (i)在阴性对照样本中的读段支持数少于2条和在阳性对照样本中的突变率大于1%,(i) less than 2 reads in the negative control sample and greater than 1% in the positive control sample,
    (ii)测序深度不小于10X,至少有3个读段的支持,在阴性对照样本中的读段支持数少于2条,在阳性对照样本中的突变率大于1%,以及变异位点的读段支持量与正常对照样本相同位点的读段支持量具有显著差异。(ii) The sequencing depth is not less than 10X, supported by at least 3 reads, the number of reads supported in the negative control sample is less than 2, the mutation rate in the positive control sample is greater than 1%, and the mutation site The read support amount was significantly different from the read support amount at the same position as the normal control sample.
  17. 一种检测目标区域变异的装置,其特征在于,包括,A device for detecting a variation of a target region, characterized in that
    数据获取单元,用于获取所述目标区域的测序数据,所述测序数据由多个读段和/或多个读段对组成,在所述数据获取单元中进行:a data acquisition unit, configured to acquire sequencing data of the target area, where the sequencing data is composed of multiple read segments and/or multiple read segment pairs, and the data acquisition unit performs:
    获取待测样本中的核酸,所述核酸由多个核酸片段组成,所述核酸片段来自断裂的基因组DNA和/或游离DNA,Obtaining a nucleic acid in a sample to be tested, the nucleic acid consisting of a plurality of nucleic acid fragments derived from fragmented genomic DNA and/or free DNA,
    利用试剂盒捕获所述核酸片段,获得目标区域,对所述目标区域进行序列测定,Capturing the nucleic acid fragment with a kit to obtain a target region, and performing sequence determination on the target region,
    其中,所述试剂盒包含探针,所述探针能够特异性识别以下预定区域:表1里的547个基因中的至少10个基因的基因区域;Wherein the kit comprises a probe capable of specifically recognizing a predetermined region: a gene region of at least 10 of the 547 genes in Table 1;
    检测单元,用于基于来自数据获取单元的测序数据,检测所述目标区域变异。And a detecting unit, configured to detect the target region variation based on the sequencing data from the data acquiring unit.
  18. 权利要求17的装置,其特征在于,所述检测单元包括第一比对子单元、第二比对子单元和变异识别子单元,The apparatus of claim 17, wherein said detecting unit comprises a first comparing subunit, a second comparing subunit, and a mutation identifying subunit,
    所述第一比对子单元用以将来自数据获取单元的测序数据与参考序列进行第一比对,获得第一比对结果, The first comparison subunit is configured to perform first alignment on the sequencing data from the data acquisition unit and the reference sequence to obtain a first comparison result,
    所述第二比对子单元用以将来自所述第一比对子单元的第一比对结果与所述参考序列的一部分进行第二比对,获得第二比对结果,The second comparison subunit is configured to perform a second alignment of the first alignment result from the first comparison subunit with a portion of the reference sequence to obtain a second comparison result,
    所述变异识别子单元用以基于来自所述第一比对子单元的第一比对结果和来自所述第二比对子单元的第二比对结果,同时检测所述目标区域中的SNV、InDel、SV和CNV变异,其中,The mutation identification subunit is configured to simultaneously detect SNV in the target area based on a first alignment result from the first comparison subunit and a second alignment result from the second comparison subunit , InDel, SV, and CNV variants, among them,
    所述参考序列的一部分包括目标区域参考序列中的每个已知InDel位点,以及所述每个已知InDel位点上下游各1000bp的参考序列。A portion of the reference sequence includes each known InDel site in the target region reference sequence, and a reference sequence of 1000 bp each upstream and downstream of each known InDel site.
  19. 权利要求18的装置,其特征在于,所述参考序列为HG19。The device of claim 18 wherein said reference sequence is HG19.
  20. 权利要求18的装置,其特征在于,所述检测单元还包括第一过滤子单元,所述第一过滤子单元与所述第一比对子单元连接,用于在所述测序数据进入所述第一比对子单元之前,对所述测序数据进行过滤,所述过滤包括去除掉不确定碱基比例超过10%的读段和/或碱基质量值不大于5的碱基数的比例不小于50%的读段。The apparatus of claim 18, wherein said detecting unit further comprises a first filtering subunit, said first filtering subunit being coupled to said first aligning subunit for entering said sequencing data in said Before the first aligning subunit, filtering the sequencing data, the filtering includes removing the ratio of the undetermined base ratio exceeding 10% of the read segment and/or the base mass value not greater than 5 Less than 50% of the reads.
  21. 权利要求18-20任一装置,其特征在于,所述检测单元还包括第二过滤子单元,所述第二过滤子单元分别与所述第一比对子单元和所述第二比对子单元连接,用于在所述第一比对结果进入所述第二比对子单元之前,去除掉来自所述第一比对子单元的第一比对结果中的一个读段对中的两个读段相同的读段对。The apparatus according to any one of claims 18 to 20, wherein said detecting unit further comprises a second filtering subunit, said second filtering subunit and said first aligning subunit and said second aligning a unit connection for removing two of a read pair from the first comparison result of the first comparison sub-unit before the first comparison result enters the second comparison sub-unit The same read pair of reads.
  22. 权利要求18的装置,其特征在于,所述第一比对子单元中的第一比对为全局比对,所述第二比对子单元中的第二比对为局部比对。The apparatus of claim 18 wherein said first alignment of said first aligned sub-units is a global alignment and said second alignment of said second aligned sub-units is a local alignment.
  23. 权利要求18-22任一装置,所述检测单元还包括判定子单元,所述判定子单元用以判定来自所述变异识别子单元中的变异位点是否满足以下,当所述变异位点中的至少一个满足以下(i)或(ii)则判定所述待测样本为阳性样本:The apparatus of any one of claims 18-22, wherein the detecting unit further comprises a determining subunit, wherein the determining subunit is configured to determine whether the mutation site from the mutation identifying subunit satisfies the following, when the mutation site is At least one of the following (i) or (ii) determines that the sample to be tested is a positive sample:
    (i)在阴性对照样本中的读段支持数少于2条和在阳性对照样本中的突变率大于1%,(i) less than 2 reads in the negative control sample and greater than 1% in the positive control sample,
    (ii)测序深度不小于10X,至少有3个读段的支持,在阴性对照样本中的读段支持数少于2条,在阳性对照样本中的突变率大于1%,以及该变异位点的读段支持量与正常对照样本相同位点的读段支持量具有显著差异。(ii) The depth of sequencing is not less than 10X, supported by at least 3 reads, the number of reads supported in the negative control sample is less than 2, the mutation rate in the positive control sample is greater than 1%, and the mutation site The read support amount is significantly different from the read support amount at the same position of the normal control sample.
  24. 一种对肿瘤进行筛查的方法,其特征在于,包括:A method of screening a tumor, characterized by comprising:
    获取待测样本中的核酸,所述核酸由多个核酸片段组成,所述核酸片段来自断裂的基因组DNA和/或游离DNA;Obtaining a nucleic acid in a sample to be tested, the nucleic acid consisting of a plurality of nucleic acid fragments derived from fragmented genomic DNA and/or free DNA;
    利用试剂盒捕获所述核酸片段,获得目标区域;Capturing the nucleic acid fragment with a kit to obtain a target region;
    对所述目标区域进行序列测定,获得测序数据,所述测序数据由多个读段组成;Performing sequence determination on the target region to obtain sequencing data, the sequencing data being composed of a plurality of reading segments;
    基于所述测序数据,检测所述目标区域中的变异,基于检测出的变异中的至少之一满足以下(i)或者(ii),判定所述待测样本为阳性样本:Based on the sequencing data, detecting a variation in the target region, and satisfying the following (i) or (ii) based on at least one of the detected mutations, determining that the sample to be tested is a positive sample:
    (i)在阴性对照样本中的读段支持数少于2条和在阳性对照样本中的突变率大于 1%,(i) less than 2 reads in the negative control sample and greater than 2 in the positive control sample 1%,
    (ii)测序深度不小于10X,至少有3个读段的支持,在阴性对照样本中的读段支持数少于2条,在阳性对照样本中的突变率大于1%,以及其读段支持量与正常对照样本相同位点的读段支持量具有显著差异;其中,(ii) The sequencing depth is not less than 10X, supported by at least 3 reads, the number of reads supported in the negative control sample is less than 2, the mutation rate in the positive control sample is greater than 1%, and the read support The amount of read support at the same site as the normal control sample is significantly different;
    所述试剂盒包含探针,所述探针能够特异性识别以下预定区域:表1里的547个基因中的至少10个基因的基因区域。The kit comprises a probe that is capable of specifically recognizing a predetermined region: a gene region of at least 10 of the 547 genes in Table 1.
  25. 权利要求24的方法,其特征在于,所述预定区域为所述547个基因中的至少20个基因的基因区域。The method of claim 24, wherein said predetermined region is a gene region of at least 20 of said 547 genes.
  26. 权利要求24的方法,其特征在于,所述预定区域为所述547个基因中的至少30个基因的基因区域。The method of claim 24, wherein said predetermined region is a gene region of at least 30 of said 547 genes.
  27. 权利要求24的方法,其特征在于,所述预定区域为所述547个基因中的至少40个基因的基因区域。The method of claim 24, wherein said predetermined region is a gene region of at least 40 of said 547 genes.
  28. 权利要求24的方法,其特征在于,所述预定区域为所述547个基因中的表2所列145个基因的基因区域。The method of claim 24, wherein said predetermined region is a gene region of 145 genes listed in Table 2 of said 547 genes.
  29. 权利要求24的方法,其特征在于,所述预定区域为所述547个基因中的表3所列60个基因的基因区域。The method of claim 24, wherein said predetermined region is a gene region of 60 genes listed in Table 3 of said 547 genes.
  30. 权利要求24的方法,其特征在于,所述预定区域为所述547个基因中的表4所列43个基因的基因区域。The method of claim 24, wherein said predetermined region is a gene region of 43 genes listed in Table 4 of said 547 genes.
  31. 权利要求24的方法,其特征在于,所述预定区域为所述547个基因的基因区域。The method of claim 24, wherein said predetermined region is a gene region of said 547 genes.
  32. 权利要求24-31任一方法,其特征在于,所述探针的长度为25-300nt。A method according to any of claims 24-31, characterized in that said probe has a length of from 25 to 300 nt.
  33. 权利要求32的方法,其特征在于,所述探针的获得包括,获得初始探针集以及筛选所述初始探针集。The method of claim 32, wherein obtaining the probe comprises obtaining an initial probe set and screening the initial probe set.
  34. 权利要求33的方法,其特征在于,所述获得初始探针集包括:The method of claim 33 wherein said obtaining an initial set of probes comprises:
    确定所述基因区域的参考序列,Determining a reference sequence of the gene region,
    从所述参考序列的一端开始,在所述参考序列上依次获取DNA片段直至所述参考序列的另一端,其中,Starting from one end of the reference sequence, a DNA fragment is sequentially acquired on the reference sequence until the other end of the reference sequence, wherein
    一条DNA片段为一条初始探针,全部所述DNA片段构成所述初始探针集,所述DNA片段之间完全重叠、部分重叠或完全不重叠,所述初始探针集能够覆盖所述基因区域至少一次。A DNA fragment is an initial probe, and all of the DNA fragments constitute the initial probe set, the DNA fragments completely overlap, partially overlap or not overlap at all, and the initial probe set can cover the gene region At least once.
  35. 权利要求33的方法,其特征在于,所述获取初始探针集包括:The method of claim 33 wherein said obtaining an initial set of probes comprises:
    确定所述基因区域在参考基因组上的位置,获取所述基因区域的参考序列,Determining the position of the gene region on the reference genome, obtaining a reference sequence of the gene region,
    从所述参考序列一端的第一个核苷酸开始拷贝所述参考序列获取第一条DNA片段,Copying the reference sequence from the first nucleotide at one end of the reference sequence to obtain a first DNA fragment,
    从所述参考序列一端的第二个核苷酸开始拷贝所述参考序列获取第二条DNA片段, Copying the reference sequence from the second nucleotide at one end of the reference sequence to obtain a second DNA fragment,
    从所述参考序列一端的第三个核苷酸开始拷贝所述参考序列获取第三条DNA片段,Copying the reference sequence from the third nucleotide at one end of the reference sequence to obtain a third DNA fragment,
    这样依次获取后续DNA片段直至第N条DNA片段的一端超出所述参考序列的另一端,其中,The subsequent DNA fragments are sequentially obtained until one end of the Nth DNA fragment exceeds the other end of the reference sequence, wherein
    一条DNA片段为一条初始探针,全部所述DNA片段构成所述初始探针集,N为所述初始探针集中包含的初始探针的总数。One DNA fragment is an initial probe, and all of the DNA fragments constitute the initial probe set, and N is the total number of initial probes contained in the initial probe set.
  36. 权利要求34或35的方法,其特征在于,所述筛选初始探针集包括:The method of claim 34 or 35, wherein said screening the initial probe set comprises:
    将所述DNA片段与所述参考序列比对,获得每一条DNA片段在参考序列上的比对次数,过滤掉比对次数超过1的DNA片段。The DNA fragment is aligned with the reference sequence to obtain the number of alignments of each DNA fragment on the reference sequence, and the DNA fragment having an alignment number of more than 1 is filtered out.
  37. 权利要求36的方法,其特征在于,所述筛选初始探针还包括,去除掉GC含量不在35-70%的DNA片段。The method of claim 36, wherein said screening the initial probe further comprises removing DNA fragments having a GC content other than 35-70%.
  38. 权利要求24-37任一方法,其特征在于,所述肿瘤包括肺癌、结直肠癌、胃癌、乳腺癌、肾癌、胰腺癌、卵巢癌、子宫内膜癌、甲状腺癌、***、食管癌和肝癌。A method according to any one of claims 24-37, characterized in that said tumor comprises lung cancer, colorectal cancer, gastric cancer, breast cancer, renal cancer, pancreatic cancer, ovarian cancer, endometrial cancer, thyroid cancer, cervical cancer, esophageal cancer And liver cancer.
  39. 权利要求24-38任一方法,其特征在于,所述利用试剂盒捕获所述核酸片段,获得目标区域,以及对所述目标区域进行序列测定,获得测序数据,包括:A method according to any one of claims 24-38, characterized in that the kit captures the nucleic acid fragment, obtains a target region, and performs sequence determination on the target region to obtain sequencing data, including:
    (a)末端修复所述核酸片段,获得末端修复片段;(a) repairing the nucleic acid fragment at the end to obtain a terminal repair fragment;
    (b)加碱基A至所述末端修复片段的两端,获得粘性末端片段;(b) adding base A to both ends of the end repair fragment to obtain a sticky end fragment;
    (c)连接接头于所述粘性末端片段的两端,获得接头连接片段;(c) connecting a linker to both ends of the sticky end fragment to obtain a linker fragment;
    (d)对所述接头连接片段进行第一扩增,获得第一扩增产物;(d) performing a first amplification of the linker ligation fragment to obtain a first amplification product;
    (e)利用所述试剂盒对所述第一扩增产物进行捕获,获得所述目标区域;以及,(e) capturing the first amplification product using the kit to obtain the target region;
    (f)对所述目标区域进行第二扩增,获得第二扩增产物;(f) performing a second amplification on the target region to obtain a second amplification product;
    (g)对所述第二扩增产物进行序列测定,获得所述测序数据。 (g) performing sequence determination on the second amplification product to obtain the sequencing data.
PCT/CN2014/093871 2014-12-15 2014-12-15 Method for screening tumor, method and device for detecting variation of target region WO2016095093A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/093871 WO2016095093A1 (en) 2014-12-15 2014-12-15 Method for screening tumor, method and device for detecting variation of target region

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/093871 WO2016095093A1 (en) 2014-12-15 2014-12-15 Method for screening tumor, method and device for detecting variation of target region

Publications (1)

Publication Number Publication Date
WO2016095093A1 true WO2016095093A1 (en) 2016-06-23

Family

ID=56125560

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/093871 WO2016095093A1 (en) 2014-12-15 2014-12-15 Method for screening tumor, method and device for detecting variation of target region

Country Status (1)

Country Link
WO (1) WO2016095093A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107723351A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of high-flux detection method of Circulating tumor DNA lung cancer driving gene
CN107723364A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of screening method of susceptibility gene of colorectal cancer
CN107723352A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of Circulating tumor DNA liver cancer drives gene high-flux detection method
CN110570904A (en) * 2019-08-27 2019-12-13 深圳百诺精准医疗科技有限公司 tumor mutation analysis method, system, terminal and readable storage medium
CN110592208A (en) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN110806479A (en) * 2019-11-15 2020-02-18 复旦大学附属肿瘤医院 Detection panel of breast cancer related kinase mutation and application thereof
CN112086131A (en) * 2020-08-18 2020-12-15 西安医学院 Screening method of false positive variant sites in high-throughput sequencing
CN112270952A (en) * 2020-10-30 2021-01-26 广西师范大学 Method for identifying cancer drive pathway
CN112739828A (en) * 2018-06-11 2021-04-30 深圳华大生命科学研究院 Method and system for determining type of sample to be tested
CN113355421A (en) * 2021-07-03 2021-09-07 南京世和基因生物技术股份有限公司 Lung cancer early screening marker, model construction method, detection device and computer readable medium
CN113373225A (en) * 2021-06-10 2021-09-10 谱天(天津)生物科技有限公司 Combined analysis method for clinical sample gene and protein high-throughput detection result
CN113481299A (en) * 2021-06-30 2021-10-08 苏州京脉生物科技有限公司 Targeted sequencing panel for lung cancer detection, kit and method for obtaining targeted sequencing panel
CN113736878A (en) * 2021-08-24 2021-12-03 复旦大学附属肿瘤医院 Gene panel for detecting nervous system tumor, kit and application thereof
CN114410763A (en) * 2022-02-11 2022-04-29 武汉艾迪康医学检验所有限公司 NGS-based colorectal cancer gene mutation detection and analysis method
CN115691665A (en) * 2022-12-30 2023-02-03 北京求臻医学检验实验室有限公司 Transcription factor-based cancer early-stage screening and diagnosis method
CN115910349A (en) * 2023-01-09 2023-04-04 北京求臻医学检验实验室有限公司 Cancer early stage prediction method based on low-depth WGS sequencing end characteristics
CN116994656A (en) * 2023-09-25 2023-11-03 北京求臻医学检验实验室有限公司 Method for improving second generation sequencing detection accuracy

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103119179A (en) * 2010-07-23 2013-05-22 哈佛大学校长及研究员协会 Methods for detecting signatures of disease or conditions in bodily fluids
WO2014039556A1 (en) * 2012-09-04 2014-03-13 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103119179A (en) * 2010-07-23 2013-05-22 哈佛大学校长及研究员协会 Methods for detecting signatures of disease or conditions in bodily fluids
WO2014039556A1 (en) * 2012-09-04 2014-03-13 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107723364A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of screening method of susceptibility gene of colorectal cancer
CN107723352A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of Circulating tumor DNA liver cancer drives gene high-flux detection method
CN107723351A (en) * 2016-08-12 2018-02-23 嘉兴允英医学检验有限公司 A kind of high-flux detection method of Circulating tumor DNA lung cancer driving gene
CN112739828A (en) * 2018-06-11 2021-04-30 深圳华大生命科学研究院 Method and system for determining type of sample to be tested
CN112739828B (en) * 2018-06-11 2024-04-09 深圳华大生命科学研究院 Method and system for determining type of sample to be detected
CN110570904A (en) * 2019-08-27 2019-12-13 深圳百诺精准医疗科技有限公司 tumor mutation analysis method, system, terminal and readable storage medium
CN110592208A (en) * 2019-10-08 2019-12-20 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN110592208B (en) * 2019-10-08 2022-05-03 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN110806479A (en) * 2019-11-15 2020-02-18 复旦大学附属肿瘤医院 Detection panel of breast cancer related kinase mutation and application thereof
CN112086131A (en) * 2020-08-18 2020-12-15 西安医学院 Screening method of false positive variant sites in high-throughput sequencing
CN112086131B (en) * 2020-08-18 2024-05-24 西安医学院 Screening method for false positive variation sites in resequencing database
CN112270952B (en) * 2020-10-30 2022-04-05 广西师范大学 Method for identifying cancer drive pathway
CN112270952A (en) * 2020-10-30 2021-01-26 广西师范大学 Method for identifying cancer drive pathway
CN113373225A (en) * 2021-06-10 2021-09-10 谱天(天津)生物科技有限公司 Combined analysis method for clinical sample gene and protein high-throughput detection result
CN113481299A (en) * 2021-06-30 2021-10-08 苏州京脉生物科技有限公司 Targeted sequencing panel for lung cancer detection, kit and method for obtaining targeted sequencing panel
CN113481299B (en) * 2021-06-30 2022-05-10 苏州京脉生物科技有限公司 Targeted sequencing panel for lung cancer detection, kit and method for obtaining targeted sequencing panel
CN113355421A (en) * 2021-07-03 2021-09-07 南京世和基因生物技术股份有限公司 Lung cancer early screening marker, model construction method, detection device and computer readable medium
CN113736878A (en) * 2021-08-24 2021-12-03 复旦大学附属肿瘤医院 Gene panel for detecting nervous system tumor, kit and application thereof
CN114410763A (en) * 2022-02-11 2022-04-29 武汉艾迪康医学检验所有限公司 NGS-based colorectal cancer gene mutation detection and analysis method
CN115691665A (en) * 2022-12-30 2023-02-03 北京求臻医学检验实验室有限公司 Transcription factor-based cancer early-stage screening and diagnosis method
CN115910349A (en) * 2023-01-09 2023-04-04 北京求臻医学检验实验室有限公司 Cancer early stage prediction method based on low-depth WGS sequencing end characteristics
CN116994656A (en) * 2023-09-25 2023-11-03 北京求臻医学检验实验室有限公司 Method for improving second generation sequencing detection accuracy
CN116994656B (en) * 2023-09-25 2024-01-02 北京求臻医学检验实验室有限公司 Method for improving second generation sequencing detection accuracy

Similar Documents

Publication Publication Date Title
WO2016095093A1 (en) Method for screening tumor, method and device for detecting variation of target region
CN106715723B (en) Method for determining PIK3CA mutation state in sample
JP2020127416A (en) Methods and materials for assessing loss of heterozygosity
JP2022519159A (en) Analytical method of circulating cells
TW201840853A (en) Diagnostic applications using nucleic acid fragments
CN113151474A (en) Plasma DNA mutation analysis for cancer detection
AU2016293025A1 (en) System and methodology for the analysis of genomic data obtained from a subject
Turner et al. Kinase gene fusions in defined subsets of melanoma
Ledgerwood et al. The degree of intratumor mutational heterogeneity varies by primary tumor sub-site
CN105779434A (en) Kit and applications thereof
CN107849569B (en) Lung adenocarcinoma biomarker and application thereof
WO2018229487A1 (en) Method for determining the susceptibility of a patient suffering from proliferative disease to treatment using an agent which targets a component of the pd1/pd-l1 pathway
CN110117652A (en) Hepatocarcinoma early diagnosis method
Li et al. Oncogene mutation profiling reveals poor prognosis associated with FGFR1/3 mutation in liposarcoma
Hirsch et al. Molecular characterization of ulcerative colitis-associated colorectal carcinomas
US20230002831A1 (en) Methods and compositions for analyses of cancer
CN110004229A (en) Application of the polygenes as EGFR monoclonal antibody class Drug-resistant marker
CN105779433A (en) Kit and applications thereof
JP5865241B2 (en) Prognostic molecular signature of sarcoma and its use
CN107974504A (en) The method of lung cancer and colorectal cancer genetic test based on NGS methods
US20180044722A1 (en) Tri-color probes for detecting multiple gene rearrangements in a fish assay
WO2018103679A1 (en) Benign thyroid nodule-specific gene
CA3151627A1 (en) Use of simultaneous marker detection for assessing difuse glioma and responsiveness to treatment
CN108342488A (en) A kind of kit for detecting gastric cancer
US9976184B2 (en) Mutations in pancreatic neoplasms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14908128

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14908128

Country of ref document: EP

Kind code of ref document: A1