CN117947163A - Method for evaluating background level of variant nucleic acid sample - Google Patents

Method for evaluating background level of variant nucleic acid sample Download PDF

Info

Publication number
CN117947163A
CN117947163A CN202311849925.9A CN202311849925A CN117947163A CN 117947163 A CN117947163 A CN 117947163A CN 202311849925 A CN202311849925 A CN 202311849925A CN 117947163 A CN117947163 A CN 117947163A
Authority
CN
China
Prior art keywords
mutation
base
indel
sample
sites
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311849925.9A
Other languages
Chinese (zh)
Inventor
张之宏
吴帅来
邱福俊
汉雨生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Burning Rock Dx Co ltd
Original Assignee
Guangzhou Burning Rock Dx Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Burning Rock Dx Co ltd filed Critical Guangzhou Burning Rock Dx Co ltd
Priority to CN202311849925.9A priority Critical patent/CN117947163A/en
Publication of CN117947163A publication Critical patent/CN117947163A/en
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/11DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Plant Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present application relates to a method for detecting variant nucleic acids, in particular to a method for detecting the presence and/or amount of variant nucleic acids, which method comprises determining the presence and/or amount of variant nucleic acids based on a somatic mutation region and a background mutation region in a sample to be tested. The application also relates to the use of the method according to the application for sample detection.

Description

Method for evaluating background level of variant nucleic acid sample
The application is a divisional application of application number 202111600502.4, application date 2021, 12 months and 24 days, and the application name is 'a detection method of variant nucleic acid'.
Technical Field
The application relates to the biomedical field, in particular to a method for detecting variant nucleic acid.
Background
Detecting the presence and/or the duty cycle of circulating tumor DNA (ctDNA, circulating Tumor DNA) in peripheral blood is the primary method of performing the detection of minimal residual lesions (MRD, minimal Residual Disease). The minimal residual disease (MRD, minimal Residual Disease) refers to a small amount of cancer cells remaining in the body after cancer treatment, is a potential source of tumor recurrence and distant metastasis, and has good prognostic value in various solid tumors such as lung cancer, colorectal cancer, esophageal cancer and the like. At present, the positive or negative of the MRD is judged mainly by detecting the ctDNA content in peripheral blood after operation, and the first national consensus on the detection and clinical application of lung cancer MRD prescribes that when the ctDNA is used as the MRD detection, the detection limit of the ctDNA needs to be as low as 0.02 percent. Thus, there is an urgent need in the art for a method capable of accurately detecting the presence and/or the duty cycle of ctDNA.
Disclosure of Invention
In one aspect, the application provides a method of detecting the presence and/or amount of a variant nucleic acid, the method comprising determining the presence and/or amount of the variant nucleic acid based on a somatic mutation site and a background mutation site of a test region in a test sample, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the test sample.
In one aspect, the application provides an analysis device for a method of detecting the presence and/or quantity of a variant nucleic acid, the device comprising a judgment module for determining the presence and/or quantity of the variant nucleic acid based on the somatic mutation sites and the background mutation sites of a region to be tested in a sample to be tested, wherein the background mutation sites are determined by removing the somatic mutation sites from all mutation sites of the sample to be tested.
In one aspect, the present application provides a storage medium recording a program that can implement the method of the present application.
In one aspect, the present application provides an apparatus comprising a storage medium according to the present application.
In one aspect, the application provides a method according to the application for detecting and/or quantifying circulating tumor DNA in a test sample obtained from a subject.
The present application provides a method for detecting a variant nucleic acid, for example a method for detecting the presence and/or amount of a variant nucleic acid, the method comprising determining the presence and/or amount of the variant nucleic acid based on a somatic mutation site and a background mutation site of a region to be tested in a sample to be tested, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites of the sample to be tested. The detection method can accurately evaluate the sample ctDNA duty ratio and the significance level of the sample ctDNA.
Other aspects and advantages of the present application will become readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application are shown and described in the following detailed description. As those skilled in the art will recognize, the present disclosure enables one skilled in the art to make modifications to the disclosed embodiments without departing from the spirit and scope of the application as claimed. Accordingly, the drawings and descriptions of the present application are to be regarded as illustrative in nature and not as restrictive.
Drawings
The specific features of the application related to the application are shown in the appended claims. A better understanding of the features and advantages of the application in accordance with the present application will be obtained by reference to the exemplary embodiments and the accompanying drawings that are described in detail below. The drawings are briefly described as follows:
FIGS. 1A-1B show the observable signal frequency of 1 repeat unit inserted or deleted for different repeat unit repeat times.
FIGS. 2A-2B show the observable signal frequency of insertion or deletion of 1, 2 or 3 repeat units for different repeat unit repeat times.
FIGS. 3A-3B show the observable signal frequency of insertion or deletion of 1 repeat unit of 1,2 or 3 bases in length for different repeat unit repeat times.
FIG. 4 shows the observable signal frequencies for random insertions or deletions of 1 or 2 bases.
Fig. 5A-5B show the results of evaluating dilution ratio based on the number of different sites, wherein the abscissa is the number of sites, the ordinate is the evaluated dilution ratio, and the broken line represents the dilution ratio of the experiment.
FIG. 6 shows the detection sensitivity results of different detection methods.
FIGS. 7A-7E show the results of sensitivity and specificity of detection by the method of the application for samples diluted from different cell lines. FIG. 7A detects the detection of diluted samples for the H2009 cell line (human lung adenocarcinoma cells), including assays based on 88 positive sites and 265 negative sites; FIG. 7B detects the detection of diluted samples for HCC38 cell line (human ductal breast cancer cells), including assays based on 41 positive sites and 312 negative sites; FIG. 7C detects the detection of diluted samples for the H1437 cell line (human non-small cell lung cancer cells), including assays based on 48 positive sites and 305 negative sites; FIG. 7D detects the detection of diluted samples for HCC1395 cell lines (human breast cancer cells), including assays based on 85 positive sites and 268 negative sites; FIG. 7E detects the detection of diluted samples for the H2126 cell line (human lung cancer cell line), including assays based on 91 positive sites and 262 negative sites. Wherein, the horizontal axis 05pct represents 5e-03 dilution, 01pct represents 1e-03 dilution, 002pct represents 2e-04 dilution, 0004pct represents 4e-05 dilution, 00008pct represents 8e-06 dilution, and the negative sample may represent 0 dilution.
Detailed Description
Further advantages and effects of the present application will become readily apparent to those skilled in the art from the present disclosure, by describing embodiments of the present application with specific examples.
Definition of terms
In the present application, the term "variant nucleic acid" generally refers to a mutated nucleic acid fragment in which insertions, additions, deletions, and/or substitutions occur at one or more positions in the nucleic acid sequence. For example, the variant nucleic acid may comprise a variant nucleic acid derived from tumor tissue, such as ctDNA. For example, the variant nucleic acid may comprise a variant nucleic acid derived from fetal tissue. For example, the variant nucleic acid may comprise a variant nucleic acid derived from a heterologous tissue or organ.
In the present application, the term "somatic mutation" generally refers to a type of mutation that occurs in non-embryonic cells acquired from the acquired day. In the present application, the somatic mutation may include a genetic alteration that occurs in a somatic tissue (e.g., an extra-germ line cell). In the present application, the somatic mutation may include point mutations (e.g., exchange of a single nucleotide with another nucleotide (e.g., silent mutation, missense mutation, and nonsense mutation)), insertions and deletions (e.g., addition and/or removal of one or more nucleotides (e.g., insertions or deletions)), amplifications, gene duplications, copy number Changes (CNAs), rearrangements, and splice variants. The somatic mutations can be closely related to the growth, programming, senescence and apoptotic processes of the cell. For example, the somatic mutation may be associated with altered signaling pathways in tumorigenesis, angiogenesis, and/or metastasis of a tumor.
In the present application, the term "background mutation" generally refers to a mutation that can be used for background reference in a test sample. For example, the background mutation may be a heritable mutation in the subject, e.g., the background mutation may be a mutation that may be present in both normal tissue and tumor tissue of the subject. For example, to determine a more accurate background mutation, the method provided by the application can remove all mutations of the sample to be tested from somatic mutations detected in tumor tissue and other information corresponding to sites to be excluded, so as to exclude the influence of a definite mutation site or region on background calculation.
In the present application, the term "mutation site" generally refers to a site at which a nucleotide having a difference from the nucleotide sequence of a control sequence is located. For example, the control sequence may be a reference sequence used in gene sequencing (e.g., may be a human reference genome). In the present application, the mutation site may include a difference in nucleotide sequence at least 1 (e.g., 1,2, 3,4 or more) sites (e.g., the difference may include nucleotide substitutions, repetitions, deletions and/or additions). For example, the mutation site may include a nucleotide mutation at least 1 nucleotide site. The nucleotide mutation may be natural mutation or artificial mutation. The mutation site may include a Single Nucleotide Variation (SNV).
In the present application, the term "database" generally refers to an organized entity of related data, regardless of the manner in which the data or organized entity is represented. For example, the organized bodies of related data may take the form of tables, maps, grids, packets, datagrams, files, documents, lists, or any other form. In the present application, the database may include any data that is collected and maintained in a computer accessible manner.
In the present application, the term "calculation module" generally refers to a functional module for calculation. The calculation module may calculate the output value from the input value or draw a conclusion or result, e.g. the calculation module may be primarily for calculating the output value. The computing module may be a tangible computer, such as a processor of an electronic computer, a computer or an electronic device or network of computers with a processor, or a program, command line, or software package stored on an electronic medium.
In the present application, the term "processing module" generally refers to a functional module for data processing. The processing module may be based on processing the input values into statistically significant data, for example, a classification of the data for the input values. The processing module may be tangible, such as an electronic or magnetic medium for storing data, and a processor of an electronic computer, a computer or electronic device with a processor or a computer network, or a program, command line, or software package stored on an electronic medium.
In the present application, the term "judgment module" generally refers to a functional module for obtaining a relevant judgment result. In the present application, the judging module may calculate an output value or draw a conclusion or a result according to an input value, for example, the judging module may be mainly used for drawing the conclusion or the result. The determination module may be tangible, such as a processor of an electronic computer, a computer or an electronic device or network of computers with a processor, or a program, command line, or software package stored on an electronic medium.
In the present application, the term "sample obtaining module" generally refers to a functional module for obtaining said sample of a subject. For example, the sample acquisition module may include reagents and/or instrumentation required to obtain the sample (e.g., tissue sample, blood sample, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.). For example, blood lancets, blood collection tubes, and/or blood sample transport boxes may be included. For example, the device of the present application may contain no or 1 or more of the sample acquisition modules, and may optionally have a function of outputting a measured value of the sample of the present application.
In the present application, the term "receiving module" generally refers to a functional module for obtaining the measurement values in the sample. In the present application, the receiving module may input the sample of the present application (e.g., a tissue sample, a blood sample, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.). In the present application, the receiving module may input the measured value of the sample (e.g., tissue sample, blood sample, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.) of the present application. The receiving module may detect a state of the sample. For example, the data receiving module may optionally perform gene sequencing (e.g., second generation gene sequencing) as described herein on the sample. For example, the data receiving module may optionally include reagents and/or instrumentation necessary to perform the gene sequencing. The data receiving module may optionally detect sequencing depth, sequencing read length count, or sequencing sequence information.
In the present application, the term "second generation gene sequencing", high throughput sequencing "or" next generation sequencing "generally refers to second generation high throughput sequencing techniques and later developed higher throughput sequencing methods. The next generation sequencing platform includes, but is not limited to, existing Illumina et al sequencing platforms. With the continued development of sequencing technology, one skilled in the art will appreciate that other methods of sequencing methods and devices may also be employed for the present method. For example, second generation gene sequencing may have the advantages of high sensitivity, large throughput, high sequencing depth, or low cost. According to development history, influence, sequencing principle and technology difference, the following main methods are available: large-scale parallel signature sequencing (MASSIVELY PARALLEL Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), 454 pyrosequencing (454 pyrosequencing), illumina (Solexa) sequencing, ion semiconductor sequencing (Ion semi conductor sequencing), DNA nanosphere sequencing (DNA nano-ball sequencing), DNA nano-array and combination probe anchor ligation sequencing of Complete Genomics, and the like. The second generation gene sequencing may enable careful comprehensive analysis of the transcriptome and genome of a species, and is therefore also referred to as deep sequencing. For example, the methods of the application are equally applicable to first generation gene sequencing, second generation gene sequencing, third generation gene sequencing, or Single Molecule Sequencing (SMS).
In the present application, the term "test sample" generally refers to a sample that is to be tested and that is to be determined for the presence of variant nucleic acid in one or more gene regions on the sample. For example, the sample to be tested or its data may be pre-stored in a memory before the test is performed.
In the present application, the term "human reference genome" generally refers to a human genome that can perform a reference function in gene sequencing. The information of the human reference genome may be referred to UCSC (University of California, santa Cruz). The human reference genome may have different versions, for example, hg19, GRCH, 37 or ensembl 75.
In the present application, the term "sequencing depth" generally refers to the number of times a specific region (e.g., a specific gene, a specific interval, a specific base) is detected. Sequencing depth may refer to a stretch of base sequence detected by sequencing. For example, by aligning the sequencing depth to a human reference genome, and optionally deduplicating, the number of sequencing reads at a particular gene, a particular interval, or a particular base position can be determined and counted as the sequencing depth. In some cases, the sequencing depth may be correlated to the sequencing depth. For example, the sequencing depth may be affected by the status of the mutation in the gene.
In the present application, the term "sequencing data" generally refers to data of a short sequence obtained after sequencing. For example, the sequencing data includes a base sequence of a sequencing short sequence (sequencing read length), the number of sequencing reads, and the like.
In the present application, the term "significance test" generally refers to the manner in which it is determined whether the difference between the sample and the hypothetical distribution is significant. For example, it can be determined whether the somatic mutation of the test sample is significantly different by a significance test.
In the present application, the term "T-test" generally refers to a way of statistical hypothesis testing with student T-distribution. For example, it can be confirmed by T-test that somatic mutation of a certain target gene of a test sample is significant.
In the present application, the term "comprising" is generally intended to include the explicitly specified features, but not to exclude other elements.
In the present application, the term "about" generally means ranging from 0.5% to 10% above or below the specified value, e.g., ranging from 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% above or below the specified value.
Detailed Description
In one aspect, the application provides a method of detecting the presence and/or amount of a variant nucleic acid, which may comprise determining the presence and/or amount of the variant nucleic acid based on a somatic mutation site and a background mutation site of a region to be tested in a test sample, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites of the test sample. For example, the test region may be targeted and detected based on a probe or combination of probes. For example, the region to be tested may be selected based on tumor mutation regions known in the art. For example, the region to be tested may be selected based on the mutated region obtained after sequencing the tumor tissue. For example, the somatic mutation site may be selected based on sequencing data of a tumor sample of the subject. For example, the somatic mutation sites may be randomly selected based on somatic mutation of a tumor sample of the subject, or one or more sites with higher priority may be selected according to the ordering of the frequency of somatic mutation, etc. For example, 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 10 or more, 15 or more, 20 or more, 25 or more, 30 or more, 40 or more, 50 or more, 60 or more, or 100 or more sites are selected for somatic mutation sites.
For example, the variant nucleic acid may be selected from the group consisting of: circulating tumor nucleic acid, fetal free nucleic acid (alternatively referred to as circulating fetal nucleic acid), and circulating nucleic acid derived from heterologous organs and/or tissues. For example, the variant nucleic acid may be circulating tumor DNA.
In one aspect, the method provided by the application can further comprise base error correction of all mutation sites of the sample to be tested. Correction for base errors, for example, may be a correction means commonly used in the art.
For example, the base error correction may comprise correcting the base type of each position of a sequencing read from the same site based on a majority voting rule, determining a consensus sequence. For example, the correction includes adjusting the base quality of a site where the base type cannot be determined to 0. For example, the calibration of the application may comprise simultaneous calibration of sequencing reads of sense and antisense strands derived from the same site, i.e., calibration of sense and antisense strands derived from the same nucleic acid fragment retains a calibrated sequence of identity. For example, the correction of the application may comprise separate corrections for sequencing reads of the sense strand and the antisense strand derived from the same site, i.e., after correction of the sense strand and the antisense strand derived from the same nucleic acid fragment, two corrected consensus sequences are maintained, respectively.
For example, the base error correction may further comprise correcting the base type of each position of the sense strand and the antisense strand originating from the same site, retaining the respective said consensus sequences of the sense strand and the antisense strand. For example, the correction may comprise adjusting the base quality of the site of the non-identical base of the same site source to 0. For example, the correction may comprise deleting base information of the site of the non-identical base from which the same site originated. For example, the correction may comprise the base information of the site of the non-identical base from the same site source not being used for subsequent data analysis.
For example, the sequencing reads derived from the same site may comprise sequencing reads that are aligned to the same location of the human reference genome and comprise the same single molecule tag (UMI). For example, the sequencing reads derived from the same site may comprise sequencing reads that are aligned to substantially the same location in the human reference genome.
For example, the method may comprise determining the mutation site in the test sample based on the base error corrected site. For example, the method of the application may comprise selecting a mutation site in the test sample from the base error corrected sites.
For example, the method of the present application may further comprise obtaining the background mutation site by removing high frequency mutation sites from all mutation sites of the sample to be tested. For example, the background mutation sites of the present application may comprise somatic mutation sites and high frequency mutation sites for removing known tumors from the mutation sites of the sample to be tested, the remaining mutation sites.
For example, the high frequency mutation site may comprise a site with a mutation frequency of about 5e-03 or higher. For example, the high frequency mutation sites may be adjusted according to factors such as the accuracy of sequencing and the quality of the sample. For example, the high frequency mutation site may comprise a site with a mutation frequency of about 1e-03 or higher, 5e-03 or higher, 1e-02 or higher, 5e-02 or higher, 1e-01 or higher, or 5e-01 or higher.
For example, the method of the application may comprise removing quality control disqualified sequence information from the sequence information of the sample to be tested. For example, the quality control reject sequence information may comprise reject sequence information determined by sequencing quality control methods commonly used in the art. For example, the sequence information that is not quality-controlled may include sequence information of low quality sequencing reads, sequence information of low quality bases, and the like.
For example, the method may further comprise removing sequence information of low quality sequencing reads (reads) from sequence information of the sample to be tested. For example, low quality sequencing reads may contain sequencing reads that are misaligned or difficult to align. For example, a low quality sequencing read may be a sequencing read that has a low probability value of correct as a result of the alignment of the sequencing read to a human reference genomic location. For example, the low mass sequencing reads may comprise sequencing reads having a comparison mass of less than 60. For example, for a sequencing read that is misaligned or difficult to align, the sequencing information of the sequencing read may not be the sequence information of the alignment location. For example, the sequencing quality of a sequencing read can be confirmed by a sequencing instrument and quality control methods commonly used in the art. For example, the low quality sequencing reads may also comprise sequencing reads that include 8 or more base mismatches.
For example, the method further comprises removing the sequence information of the low quality bases from the sequence information of the sample to be tested. For example, the sequence information of the base having a small base mass after correction is removed. For example, the low-quality base may comprise a base having a corrected base quality of less than 20. For example, the sequencing accuracy may be 99.99% or more with a base mass of 20 bases.
For example, the methods of the application may further comprise determining a mutation frequency selected from the group consisting of: the somatic mutation frequency of the somatic mutation site and the background mutation frequency of the background mutation site are used to evaluate the level of site mutation significance. For example, the mutation frequency derived from a somatic mutation site may be a somatic mutation frequency. For example, the mutation frequency derived from the background mutation site may be the background mutation frequency.
For example, the mutation frequencies of the methods of the application can comprise multimeric mutation frequencies and/or insertion or deletion (INDEL) mutation frequencies; for example, the model used to calculate mutation frequencies may be the multimeric mutation frequencies of the sequencing data. For example, the model used to calculate the mutation frequency may be an insertion or deletion (INDEL) mutation frequency of sequencing data. For example, INDEL may represent an insertion or a deletion.
For example, the frequency of the mutation of the multimer may comprise the frequency at which a base at a specific position in a specific consecutively arranged base sequence is mutated to another base. For example, the single base mutation frequency may comprise the frequency at which a single base is mutated. For example, the frequency of the mutation of the multimer may include the frequency of the mutation of the bases at the intermediate positions in the base sequence arranged consecutively.
For example, the consecutively arranged base sequence may contain 2 or more bases consecutively arranged. For example, the consecutively arranged base sequence may contain 2 or more, 3 or more, 5 or more, 7 or more, or 9 or more bases which are consecutively arranged. For example, the consecutively arranged base sequence may contain 3 or 5 bases which are consecutively arranged.
For example, the frequency of the multimeric mutation may comprise the frequency of the mutation of the base at position 2 to another specific base in a specific contiguous arrangement of sequences. For example, regarding the frequency of the trimeric mutation, attention is paid to the frequency of the mutation of the second base to another base under the specific arrangement environment of the first base and the third base.
For example, the INDEL mutation frequency can comprise the following group: random INDEL mutation frequencies, and base repeat INDEL mutation frequencies.
For example, the INDEL mutation frequency can comprise a random INDEL mutation frequency. For example, the random INDEL mutation frequency may comprise a frequency of insertions or deletions of one or more bases. For example, the random INDEL mutation frequency can comprise a frequency of insertion or deletion of one or more bases after a particular one or more bases. For example, the random INDEL mutation frequency can comprise a frequency of insertion or deletion of one or more bases after a particular one base; for example, the random INDEL mutation frequency can comprise a frequency of insertion or deletion of one or more bases after a particular two or more bases.
For example, where a base is inserted or deleted, the random INDEL mutation frequency may comprise the frequency of insertion or deletion of a particular base after a particular base. For example, when 2 or more bases are inserted or deleted, the random INDEL mutation frequency may comprise the frequency of insertion or deletion of a base of a particular length after a particular one base. For example, when 2 or more bases are inserted or deleted, the specific base combination of the insertion or deletion may not be considered, and only the frequency of insertion or deletion of a base of a specific length after a specific base or bases may be considered.
For example, the INDEL mutation frequency can comprise a base repeat region INDEL mutation frequency. For example, the base repeat region INDEL mutation frequency can comprise a frequency of insertion or deletion of one or more base repeat units (units), the units being 1 or more in length.
For example, the base repeat region INDEL mutation frequency can comprise a frequency of insertion or deletion of one or more base repeat units (units), the units being 2 or more in length. For example, the Unit length is 2 bases or more, 3 bases or more, 4 bases or more, 5 bases or more, 6 bases or more, 7 bases or more, 8 bases or more, 9 bases or more, or 10 bases or more.
For example, the base repeat region INDEL mutation frequencies may comprise frequencies of a specific number of insertions or deletions of units in sequences of the same Unit length and the same Unit number of repetitions. For example, when a Unit is 2 or more in length, the specific base combination of the Unit may not be considered, and the frequency of insertion or deletion of one or more units in the Unit of a specific number of repetitions may only be considered. For example, the frequency of INDEL mutation in a defined base repeat region may comprise the frequency of insertion or deletion of a specific number of units in a sequence of the same Unit length and the same number of Unit repeats, wherein a Unit may comprise any sequence. For example, unit combinations of arbitrary base combinations in this case may be calculated by merging.
For example, the methods of the application may further comprise determining the level of significance of the presence of variant nucleic acid and/or the presence of mutation at the somatic mutation site in the test sample. For example, somatic mutation sites at which significant mutations occur can be used to assess the presence of variant nucleic acids. For example, the ratio of variant nucleic acids can be estimated using only the data for the somatic mutation sites where significant mutations occur.
For example, the method may comprise measuring the level of significance by determining the cumulative probability of treating the somatic mutation site as a background mutation. For example, the cumulative probability in this case can be estimated assuming that the candidate somatic mutation site is a background mutation. For example, the cumulative probability may be used to represent a level of significance.
For example, the method may include determining the cumulative probability based on a poisson distribution or a binomial distribution. For example, the method may include determining the cumulative probability based on a binomial distribution. For example, the method may include determining the cumulative probability based on a poisson distribution.
For example, the method may include determining the cumulative probability based on the following formula:
wherein P represents the cumulative probability, k is accumulated from 0 to x-1, x represents the depth of coverage of the sequence after mutation of the somatic mutation site, n represents the total depth of coverage of the somatic mutation site, P represents the background mutation frequency of the somatic mutation site, and e represents the natural logarithm.
For example, the method may include determining the cumulative probability based on the following formula:
Wherein P represents the cumulative probability, k is accumulated from 0 to x-1, x represents the depth of coverage of the sequence after mutation of the somatic mutation site, n represents the total depth of coverage of the somatic mutation site, and P represents the background mutation frequency of the somatic mutation site.
For example, the method may comprise determining the presence of a variant nucleic acid when the cumulative probability is less than a significance threshold. For example, the determination of the significance threshold may be adjusted by one skilled in the art based on the accuracy of the sequencing instrument and the quality of the sample to be tested. For example, the significance threshold is 0.05 or less. For example, the significance threshold is 0.05 or less, 0.01 or less, 0.005 or less, 0.001 or less, 0.0005 or less, or 0.0001 or less.
Such as the methods of the application, the methods may further comprise determining the presence and/or amount of variant nucleic acid in the test sample. For example, the methods of the application can be used to accurately determine the duty cycle of a variant nucleic acid, e.g., ctDNA, in a test sample. For example, the methods of the application for determining the duty cycle of a variant nucleic acid and/or deriving the level of significance of the duty cycle can be assessed based on data for somatic mutation sites and background mutation sites.
For example, the method may comprise determining the presence and/or amount of variant nucleic acid in the sample to be tested by a likelihood estimation algorithm. For example, the method may comprise determining the presence and/or amount of variant nucleic acid in the sample to be tested based on a poisson distribution or binomial distribution likelihood estimation algorithm. For example, the amount of the variant nucleic acid may comprise the ratio of circulating tumor DNA (ctDNA) in the test sample to total DNA in the test sample. For example, the maximum likelihood estimate of ctDNA duty cycle pi is determined by a maximum likelihood estimation algorithm.
For example, the method may comprise determining a maximum likelihood estimate of ctDNA duty cycle piWhen pi takes on the value of the/>When, the function i (pi) of the following formula takes a maximum value, for example, ln (x) as known in the art represents a calculated sign of a logarithm of solving x based on a natural logarithm e:
l(π)=∑iwilnli(π;xi,ni,pi,qi)
Where w i is the weight of the mutation site of the i-th individual cell, l i(π;xi,ni,pi,qi) is calculated by the formula:
Wherein f i is calculated by the following formula:
fi=πqi+(1-π)pi
n i represents the total coverage depth of the ith somatic mutation site, x i represents the coverage depth of the sequence after mutation of the ith somatic mutation site, q i represents the mutation frequency of the ith somatic mutation site in a tumor tissue sample, p i represents the background mutation frequency of the corresponding mutation in a sample to be tested, and e represents the natural logarithm.
For example, the method may comprise determining a maximum likelihood estimate of ctDNA duty cycle piWhen pi takes on the value of the/>The function i (pi) of the following formula takes the maximum value:
l(π)=∑iwilnli(π;xi,ni,pi,qi)
Where w i is the weight of the mutation site of the i-th individual cell, l i(π;xi,ni,pi,qi) is calculated by the formula:
Wherein f i is calculated by the following formula:
fi=πqi+(1-π)pi
n i represents the total coverage depth of the ith somatic mutation site, x i represents the coverage depth of the sequence after mutation of the ith somatic mutation site, q i represents the mutation frequency of the ith somatic mutation site in a tumor tissue sample, and p i represents the background mutation frequency of the corresponding mutation in a sample to be tested.
For example, the method may comprise determining a level of significance for a maximum likelihood estimate of the ctDNA duty cycle pi. For example, the method may comprise determining the significance level by a likelihood ratio test algorithm. For example, the method may comprise determining the significance level by a chi-square distribution based likelihood ratio test algorithm. For example, the method according to the application, based on likelihood ratio statisticsThe significance level is determined by a chi-square distribution probability density function with a value and degree of freedom of 1.
For example, the method may include determining the likelihood ratio statistic byThe value of the sum of the values,
Wherein, l (pi) = Σ iwilnli(π;xi,ni,pi,qi)
Wherein w i is the weight of the i-th individual cell mutation site, the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, and l i(π;xi,ni,pi,qi) is calculated by the following formula:
Wherein f i is calculated by the following formula:
fi=πqi+(1-π)pi
n i represents the total coverage depth of the ith somatic mutation site, x i represents the coverage depth of the sequence after mutation of the ith somatic mutation site, q i represents the mutation frequency of the ith somatic mutation site in a tumor tissue sample, p i represents the background mutation frequency of the corresponding mutation in a sample to be tested, and e represents the natural logarithm.
For example, the method includes determining the likelihood ratio statistic byThe value of the sum of the values,
Wherein, l (pi) = Σ iwilnli(π;xi,ni,pi,qi)
Wherein w i is the weight of the i-th individual cell mutation site, the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, and l i(π;xi,ni,pi,qi) is calculated by the following formula:
Wherein f i is calculated by the following formula:
fi=πqi+(1-π)pi
n i represents the total coverage depth of the ith somatic mutation site, x i is the coverage depth of the sequence after mutation of the ith somatic mutation site, q i represents the mutation frequency of the ith somatic mutation site in a tumor tissue sample, p i represents the background mutation frequency of the corresponding mutation in a sample to be detected, and e represents the natural logarithm; for example, each of the w i values may be the same, e.g., each of the w i values may be 1. For example, one skilled in the art can adjust the value of 0 to 1 for a particular w i depending on the importance of the actual i-th individual cell mutation site, e.g., mutation frequency or sequencing coverage depth of the site.
In one aspect, the application provides a method of detecting the presence and/or amount of a variant nucleic acid, which method may comprise determining the presence and/or amount of the variant nucleic acid based on a somatic mutation site and a background mutation site of a region to be tested in a sample to be tested, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites of the sample to be tested; optionally, base error correction can be performed on all mutation sites of the sample to be tested; optionally, the background mutation site of the application may comprise somatic mutation sites and high frequency mutation sites for removing known tumors from the mutation sites of the sample to be tested, the remaining mutation sites; optionally, sequence information with unqualified quality control can be removed from the sequence information of the sample to be detected; optionally, the type of the evaluation mutation frequency of the present application may be selected from the group consisting of single base mutation frequency, multimeric mutation frequency, and insertion or deletion (INDEL) mutation frequency; optionally, the cumulative probability of treating the somatic mutation site as a background mutation may be determined based on poisson distribution or binomial distribution; optionally, the presence and/or amount of variant nucleic acid in the sample to be tested may be estimated based on a poisson distribution or binomial distribution likelihood estimation algorithm and the significance level of the estimated value of the variant nucleic acid ratio may be determined.
For example, the minimal residual lesion detection method (PROPHET) of the present application can determine MRD positive or negative by analyzing tumor somatic mutations in second generation sequencing data generated by an amplicon method or a hybrid capture method, and can belong to the tumor awareness method (tumor-informed assay) strategy.
The detection method of the application can utilize definite tumor somatic mutation information, for example, can be obtained through tumor tissues, and is used for detecting tumor somatic mutation in peripheral blood, and specifically can be as follows: 1) Performing whole exome sequencing on the tumor tissue sample and the paired sample; 2) Comparing the sequencing results to a human reference genome based on commonly used alignment software such as bwa; 3) Detecting somatic mutations in tumor tissue based on commonly used somatic mutation analysis software such as mutect; 4) Sequencing the somatic mutations in priority, and selecting a certain number of mutations based on the priority; 5) Based on the mutations screened, hybridization capture probes were designed for subsequent use in peripheral blood sample detection. Mutations in the high-repetition region, the high-GC region, and regions homologous to other positional sequences can be filtered out prior to sequencing of somatic mutations to reduce the difficulty of hybridization capture. The somatic cell sequencing may be sequentially from high to low: 1) driving mutations (driver mutation), 2) causing amino acid sequence changes, including nonsuperimposal mutations, alternative splicing mutations, and in-frame/out-of-FRAME INDEL, etc., 3) synonymous mutations, each of these three types of mutations being arranged from high to low mutation frequencies.
The analysis of the present application may specifically comprise the steps of: 1) Data preparation, including correction of base errors based on specific molecular tags (UMI, unique Molecular Identifier), and alignment of corrected reads to the human genome; 2) Calculating a sample-specific background level based on read length (reads) comparison results; 3) Calculating mutation rate of the mutation site of the cell to be detected; 4) For each mutation site of the cells to be detected, evaluating the significance level of true mutation according to the background level; 5) Based on all somatic sites screened, the sample ctDNA duty cycle and the level of significance of the sample ctDNA were assessed according to background levels.
Data preparation
Since the probability of base errors in pool sequencing can be at the level of 1e-03, and detection at the level of 1e-04 is required for MRD detection, base correction can optionally be performed by UMI, or base errors in pool sequencing can be reduced using methods commonly used in the art, such as selecting a more accurate pool sequencing method. The data preparation step may produce a sequence alignment file in BAM format after UMI deduplication and base correction. The principle of UMI base correction is to correct base errors in the library-building sequencing process by using sequencing sequences of a plurality of PCR products from the same molecule source. The specific steps can be as follows: 1) Sequencing reads were aligned to a human reference genome based on commonly used second generation sequencing alignment software bwa (version 0.7.10); 2) Using the comparison information and UMI information, regarding all reads which are compared to the genome with the same position and the same UMI as reads of the same molecular source, classifying the reads into a unit and reserving the units with the numbers of the reads larger than a certain threshold value; 3) Determining the base at each position in the unit based on the majority voting rule, and finally generating a consistency reads representing the unit; 4) And comparing the consistency reads to the genome to generate the BAM file.
In order to utilize sequence information of the same molecule source for base error correction, the application can optionally adopt duplex library construction method of double-ended UMI during hybridization library construction. UMI duplex library construction can distinguish between molecules of different strand origin of double-stranded DNA, and this information can be used to correct each other in subsequent base correction. When base error correction is performed, firstly, based on UMI and alignment information, reads from the same DNA strand can be corrected based on a majority voting rule, an indeterminate base is set to be N and the mass is 0, the mass of other bases can be set to be the highest value, and a single-strand consistency sequence, namely SSCSs, is generated; the sequences from different strands of the same DNA are then aligned so that the quality of the non-identical bases in the double strand can be adjusted to 0, but both SSCSs can be retained. Since ctDNA molecules are only about 164bp, sequencing reads can typically reach about 150 bp. The method of the present application can optionally be re-calibrated using the sequencing read overlap of R1 and R2 from the same DNA strand source sequenced to adjust the non-identical base masses in R1 and R2 to 0. The method provided by the application can optionally distinguish reads from different strands of the same DNA, and can avoid losing part of correction information during subsequent base correction.
Sample-specific SNV background
Sample-specific background various multimeric mutation frequencies can optionally be calculated as sample-specific background frequencies based on sequencing target region BAM file alignment information. For example, the frequency of trimeric mutation can be calculated by taking into account whether the second base has been altered or not, the remaining two bases being fixed. For example, the trimer formed by one base at a certain position and left and right base of the target area is AGC, and the comparison result of the position comprises 4 ACs, 6 ATCs, 10 ACs and 99980 AGC, so that the conversion frequency of the AGC- > ACC trimer is 4e-05, the conversion frequency of the AGC- > ATC is 6e-05, and the conversion frequency of the AGC- > AAC is 1e-04. The trimer may be changed to oligomer with other length, and the calculation method may be similar to that of trimer. The specific steps of sample specific background calculation are as follows: 1) Counting the number of various trimers corresponding to all sites of the sequencing target region; 2) Removing all somatic mutation sites and trimer information corresponding to other sites to be removed so as to remove the influence of clear mutation sites or regions on background calculation; 3) All trimer information corresponding to sites with mutation frequencies above a certain threshold, such as 5e-03, can be removed to exclude interference of other potential mutations on background calculation; 4) The trimer information for the remaining sites was integrated together and the frequency of each mutation was calculated based on the trimer mutation type as the specific background mutation level for the sample. To exclude sequence alignment, and the effect of low quality base pair background noise assessment, filters can optionally be made for reads with alignment quality less than 60 or including 8 or more base mismatches, and lower base quality trimers can optionally be discarded when calculating background. When the sample specificity background is calculated, sequencing data information of the sample to be analyzed can be utilized, and the sample specificity background is independent of other normal samples or other samples in the same batch as a contrast, so that background fluctuation caused by factors among samples or factors in experimental batches can be eliminated. In addition, when the sample specificity background is calculated, all information of a sequencing target area is fully utilized, and information belonging to the same trimer at different positions is integrated and processed, so that the problem of inaccurate background evaluation caused by insufficient data is effectively solved.
Sample-specific InDel background
In order to fully utilize the mutation information of the sample, the method can also employ InDel mutation besides SNV. In calculating the sample-specific InDel background, it is divided into two major classes based on InDel sequence characteristics: 1) Random InDel, 2) base repeat region InDel, expressed as (Unit) n, where Unit represents a repeat Unit, which may be single base or multiple bases, and n represents the number of repetitions, typically 2 or more. The background step of calculating InDel is similar to SNV, specifically: 1) Counting the times of InDel signals and non-InDel signals sequenced by all sites of a sequencing target region based on a reference sequence; 2) Removing all somatic mutation sites and other information corresponding to the sites to be removed so as to remove the influence of clear mutation sites or regions on background calculation; 3) All information corresponding to sites with mutation frequencies above a certain threshold, such as 5e-03, can be removed to exclude interference of other potential mutations to background calculation; 4) The information of the remaining sites was integrated together and the frequency of each mutation was calculated based on the InDel type as the specific InDel background mutation level for this sample.
For random InDel, in the background statistics, the application can respectively count different types of InDel background values according to the type of the previous base of the position of the InDel and the insertion or deletion length of the InDel. In the case of single base insertion or deletion, the preceding base may be combined with the insertion or deletion base, and the relevant frequencies are counted, respectively, for example, when single base A is inserted or deleted, the background frequencies of TA- > T, GA- > G, CA- > C, T- > TA, G- > GA, C- > CA are counted, respectively. When 2 or more bases are inserted or deleted, the background mean for the insertion or deletion of the same length base can optionally be calculated without separate statistics due to the excessive number of combinations and the fewer single type of target sites.
For (Unit) n, unit is a mutation of a single base, the application counts the background value based on the kind of Unit in the reference sequence, the value of n, and the number of insertions or deletions, respectively. For (Unit) n, when Unit is 2 bases, optionally, irrespective of the specific sequence of Unit, inDel having a length of 2 and the same repetition number n is combined, and the corresponding background is calculated based on the number of insertions or deletions. For example, mutations GATAT- > GAT and CTGTG- > CTG are all mutations with Unit of 2 and n of 2 deleted once, and are combined when calculating background noise. When the Unit length is greater than 2, the processing method can be consistent with that when the Unit is 2.
For the length n of Unit, the application assumes that there is a correlation between the background error rate and n under the condition that the base type and the number of insertions or deletions of Unit are the same, and the specific relation is as follows:
Where p n(Unit|n1) represents the background error rate of inserting (deleting) n 1 in the case of n repetitions of a specific Unit, under this assumption, using all the site detection information satisfying the condition, it is possible to estimate The error rate for repeating the n-th site is:
Somatic mutation site mutation signal
According to the comparison information of the BAM file, calculating specific SNV mutation frequencies based on a trimer mode or corresponding mutation frequencies based on an InDel type for preselected cell sites to be detected. If the original trimer at a specific somatic cell to be detected is CAG and the somatic cell is mutated into A- > G, the mutation frequency of the CAG- > CGG can be calculated. Likewise, the effect of low mass alignment or low mass bases is optionally excluded in the calculation.
Somatic mutation site significance assessment
Binomial distribution, i.e. n independent bernoulli experiments, were repeated with only two possible outcomes in each experiment, and the two outcomes occurred opposite to each other and independent of each other, which fit the description of the sample background mutation scenario. In addition, when n of the binomial distribution is sufficiently large and the event occurrence probability p is sufficiently small, the observed event occurrence number approximately follows the poisson distribution (λ=np). The method of the application can therefore use the assumption of poisson distribution (x-Binom (n, p)) or binomial distribution (x-Poison (np)) to calculate somatic mutation significance. For example, the application adopts the poisson distribution assumption to calculate, and can have higher evaluation result accuracy.
According to the method, the cumulative probability P value under the background condition is calculated according to the observed value of the specific mutation of the somatic cell site and the mutation frequency in the sample background. When the P value is less than a certain threshold, then the site mutation frequency can be considered significantly above the sample background, the site being a true mutation. Assuming that the mutation type of the site to be detected is A- > G, the original trimer is CAG, and observing that the coverage depth of the site is n, wherein when the CGG times are x, the p value of the site mutation detection is:
Or alternatively
Where p (CAG→CGG) is the background frequency.
In addition to calculating SNV salience, the method is equally applicable to calculating INDEL salience. If a certain INDEL to be detected is AGGG- > AGG, the covering depth of the point is n1, the number of times of observing AGG is x1, n in the formula can be replaced by n1, x can be replaced by x1, and p (CAG- > CGG) can be replaced by p (AGGG- > AGG). By analogy, all types of INDEl or SNV mutations can be used to calculate their site significance in this way.
Sample ctDNA significance level and ctDNA duty cycle assessment
In practical applications, the effective sequencing depth of a sample is limited, subject to the amount of peripheral blood sampling and the cost of detection. When the ctDNA ratio is as low as 0.02% or less, if the average effective depth is 10000X, there are only about 2 or less mutation signals per point on average, so that it may be difficult for some somatic mutation sites to detect mutation signals in peripheral blood data, and it may be difficult to directly calculate the ctDNA ratio in consideration of the background levels of various mutations. The method can adopt a multi-site joint test method to judge whether ctDNA exists in the sample, and a likelihood method is used for estimating the ctDNA duty ratio in the sample. Assuming that the ctDNA duty cycle is pi, the frequency of the ith cell mutation to be detected in the tumor tissue sample is q i, and the background frequency of the corresponding mutation in the detection sample is p i, the expected f i of the cell mutation frequency in the detection sample satisfies the following conditions:
fi=πqi+(1-π)pi
Using likelihood method to estimate parameter pi, log likelihood function is:
l(π)=∑iwilnli(π;xi,ni,pi,qi)
Wherein w i is the weight of the ith cell mutation to be detected, the value of the weight can be set according to the type and the reliability of the mutation in the actual analysis, n i represents the effective coverage depth of the ith cell mutation to be detected, x i is the target mutation depth of the ith cell mutation to be detected, and l i (pi) is the posterior probability of the ith cell mutation to be detected:
Or alternatively
Wherein f i=πqi+(1-π)pi
Estimating the parameter pi by using a maximum likelihood estimation algorithm to obtain a maximum likelihood estimation valueThe null hypothesis pi=0 is tested using a likelihood ratio test algorithm, with likelihood ratio statistics of:
By means of The probability density function of the distribution may calculate the P value.
In another aspect, the application also provides a method for detecting the presence and/or amount of a variant nucleic acid, which method may comprise determining a set of somatic mutation sites based on the mutation priority of the somatic mutation sites of a variant sample to be tested, which set of somatic mutation sites may be used for detecting the presence and/or amount of a variant nucleic acid, which mutation priority may comprise, from high to low: driving mutations, non-synonymous mutations other than driving mutations, and synonymous mutations.
For example, the test variant sample may be derived from a sample obtained from the subject prior to receiving treatment. For example, the treatment may comprise tumor treatment.
For example, the somatic mutation site can be determined by comparing the test variant sample with a negative sample.
For example, non-synonymous mutations other than the driving mutation may be selected from the group consisting of: alternative splice mutations, insertions or deletions which do not result in a displacement of the gene reading frame (in-FRAME INDEL) and insertions or deletions which result in a displacement of the gene reading frame (out-of-FRAME INDEL).
For example, the method may comprise ordering the somatic mutation sites from high to low in mutation priority, wherein the somatic mutation sites may be ordered from high to low in mutation frequency in the same mutation priority.
For example, the method may comprise selecting the top ranked 5 or more mutation sites as the set of somatic mutation sites. For example, the methods of the application may comprise selecting as the set of somatic mutation sites, the highest ranked 1 or more, the highest 2 or more, the highest 3 or more, the highest 4 or more, the highest 5 or more, the highest 6 or more, the highest 7 or more, the highest 8 or more, the highest 9 or more, the highest 10 or more, the highest 15 or more, the highest 20 or more, the highest 25 or more, the highest 30 or more, the highest 40 or more, the highest 50 or more, or the highest 100 or more mutation sites.
For example, the method may further comprise determining a test region of the test sample based on the set of somatic mutation sites. For example, the method may further comprise determining nucleic acids that can bind to the test region based on the set of somatic mutation sites. For example, the methods of the application may comprise designing probes for detecting a test sample based on the collection of somatic mutation sites.
In another aspect, the present application also provides an analysis device for a method of detecting the presence and/or quantity of a variant nucleic acid, the device comprising a determination module or judgment module operable to determine the presence and/or quantity of the variant nucleic acid based on a somatic mutation site and a background mutation site of a region to be tested in a sample, wherein the background mutation site can be determined by removing the somatic mutation site from all mutation sites in the sample to be tested. For example, the analysis device of the method for detecting the presence and/or amount of a variant nucleic acid of the present application may comprise a module for performing the method for detecting the presence and/or amount of a variant nucleic acid of the present application.
In another aspect, the present application also provides a method for creating a database comprising a set of somatic mutation sites, the method may comprise determining a set of somatic mutation sites for detecting the presence and/or amount of variant nucleic acids based on a mutation priority of the somatic mutation sites of a variant sample to be tested, the mutation priority comprising, from high to low: driving mutations, non-synonymous mutations other than driving mutations, and synonymous mutations. For example, the method of database creation of the present application may include a method of determining a set of somatic mutation sites based on the mutation priorities of the somatic mutation sites of the variant samples to be tested.
In another aspect, the present application further provides a device for creating a database, where the database includes a set of somatic mutation sites, and the device includes a determining module, configured to determine the set of somatic mutation sites based on mutation priorities of somatic mutation sites of a variant sample to be tested, where the set of somatic mutation sites is used to detect the presence and/or quantity of variant nucleic acids, and the mutation priorities include, from high to low: driving mutations, non-synonymous mutations other than driving mutations, and synonymous mutations. For example, the database creation device of the present application may include a module that performs the database creation method of the present application.
In another aspect, the present application also provides a database, which may be built according to the method for building a database of the present application.
In another aspect, the present application also provides a storage medium, which records a program capable of executing the method of the present application. For example, the non-volatile computer-readable storage medium may include a floppy disk, a flexible disk, a hard disk, a Solid State Storage (SSS) (e.g., solid State Drive (SSD)), a Solid State Card (SSC), a Solid State Module (SSM)), an enterprise-level flash drive, a tape, or any other non-transitory magnetic medium, etc. The non-volatile computer-readable storage medium may also include punch cards, paper tape, optical discs (or any other physical medium having a hole pattern or other optically recognizable indicia), compact disc read-only memory (CD-ROM), rewritable optical discs (CD-RW), digital Versatile Discs (DVD), blu-ray discs (BD), and/or any other non-transitory optical medium.
In another aspect, the present application also provides an apparatus comprising the storage medium of the present application. For example, the apparatus of the present application may further comprise a processor coupled to the storage medium, the processor being configured to execute based on a program stored in the storage medium to implement the method of the present application.
In another aspect, the application also provides a method according to the application, which can be used for detecting and/or quantifying circulating tumor DNA in a test sample obtained from a subject. In the present application, the method can be used to determine the presence and/or amount of circulating tumor DNA in a test sample of the subject. For example, any one or more of the methods of the application may be of non-diagnostic interest. For example, any one or more of the methods of the application may be diagnostic.
In a further aspect, the application also provides a method according to the application, which may be used for diagnosis, prevention and/or concomitant treatment of a disease or residual disease.
In another aspect, the application also provides methods according to the application, which may be used for the prediction, selection and/or assessment of disease treatment methods. For example, a likelihood of a subject suffering from cancer or having a recurrence of cancer may be determined or assisted, which may benefit from the likelihood of an anti-cancer treatment, including chemotherapy, immunotherapy, radiation therapy, surgery, or a combination thereof.
In the present application, the method can be used in clinical practice (e.g., it can be speculated whether certain specific tumor treatment modalities are appropriate for the subject) by detecting the presence and/or amount of circulating tumor DNA in the test sample. In certain instances, the presence and/or amount of circulating tumor DNA in a test sample detected by the method may be used in clinical practice in combination with biomarkers known in the art.
Without intending to be limited by any theory, the following examples are meant to illustrate the methods and uses of the present application and the like and are not intended to limit the scope of the application.
Examples
Example 1
In the application, 25 real samples are selected, inDel observable signals are analyzed, and background mutation frequencies are estimated preliminarily. Base group
In the statistics, it was found that in InDel of the (Unit) n type, the observable signal frequency increases exponentially with increasing n with the same number of insertions or deletions of repeat units, as shown in FIGS. 1A-1B. Wherein Unit represents the base length of the repeating Unit, and n represents the number of repeating units.
At the same (Unit) n, 2 insertions or deletions are all less frequent than the observable signal of 1 insertion or deletion, and when 3 or more insertions or deletions are observed, the observable signal is weak, as shown in FIGS. 2A-2B.
Insertions or deletions of repeat units of 2-3 bases in length can be observed with comparable or increased signal frequencies compared to base repeat units of 1 base in length, as shown in FIGS. 3A-3B.
When 1 to 2 bases are randomly inserted or deleted, the observable signal frequency is very low, at the level of 1e to 7, as compared to the repeat unit insertion or deletion, as shown in FIG. 4.
Considering that ctDNA is relatively low in MRD detection, for example, at 2e-4 or less. When the repetition number n < = 3 or random insertion or deletion of the repeating unit is carried out, the observable InDel signal can be below 1e-5, so that the mutation can be incorporated into MRD analysis, and the accurate detection of the MRD can be realized for samples with ctDNA (ctDNA) ratio above 1 e-5.
Example 2
According to the application, 1 cell line to be detected and 1 background cell line are selected as research materials, diluted samples with 5 gradients of 5e-03,1e-03,2e-04,4e-05 and 8e-06 are obtained, and samples with different ctDNA (specific time series) ratios are simulated. 88 mutation sites different from the background cell line were selected from the cell lines to be tested for probe design and sequencing was captured. And finally, carrying out three times of sequencing on each diluted sample to obtain 15 diluted samples, wherein the stock building input amount of each sample is 30ng, and the average sequencing depth of a target area is 100000X. From these 88 sites, 5-60 mutation sites were then arbitrarily selected for analysis of ctDNA duty cycle, 50 times of repetition, so 150 analytical tests were performed per dilution gradient. When sample Pvalue <0.01 is selected as the threshold for sample detection, 100% detection of the sample can be completed at 5 mutation sites when the dilution gradient is 5e-3 or 1 e-3; 15 mutation sites or more can complete 100% detection of the sample when the dilution gradient is 2 e-4; 100% detection of the samples was achieved at 40 mutation sites or more at a dilution gradient of 4e-5, as shown in Table 1.
TABLE 1 sample detection ratio results
When the dilution ratio is evaluated, the dilution ratio can be more accurately calculated by 5 mutation sites when the dilution gradient is 5e-3 or 1 e-3; when the dilution gradient is 2e-4, 15 mutation sites can calculate the dilution ratio more accurately; at a dilution gradient of 4e-5, 40 mutation sites or more can calculate the dilution ratio more accurately, as shown in FIGS. 5A-5B.
Example 3
Comparing the background construction method used in the method (PROPHET) of the present application with the detection effect of the background construction method (INVAR) of 10bp region before and after the mutation site, parallel analysis of the two methods was performed on 15 samples of the sequencing in example 2. Similarly, 5-60 sites were arbitrarily selected from 88 sites, and the number of repetitions was 50. In addition, 5 to 60 negative sites were arbitrarily selected from 88 sites, and the number of repetitions was also 50, in order to evaluate the specificity of the detection. When sample Pvalue <0.01 is selected as the threshold, the detection sensitivity of the method of the application may be better than the INVAR method for less than 40 sites, as shown in table 2 and fig. 6.
Table 2 comparison of the detection effects of different methods
/>
Meanwhile, the known INVAR method uses sequencing information of 10bp before and after a mutation site and also uses the same target combination (panel) to capture sequencing information of sequencing a plurality of samples. Thus, the INVAR method can include, in addition to the mutation sites of the sample itself, sequencing information of 10bp before and after the mutation sites of other samples simultaneously sequenced, and thus the total number of selectable mutation sites is relatively large. The method is more suitable for panel of single sample, and can be suitable for detection environment with few total selectable mutation sites.
Example 4
In order to measure the effect of INDEL and SNV on ctDNA ratio evaluation, one standard data and one background cell line are selected for dilution, and the standard data and the background cell line are diluted into seven gradients of 2.5e-3, 1.25e-3, 6.25e-4, 3.125e-4, 1.6e-4, 8e-5 and 4e-5, so that samples with different ctDNA ratios are simulated. Within the sequencing window, the standard included a total of 28 effective mutations, 8 of the INDEL mutations and 20 of the SNV mutations. The average sequencing depth of dilutions was 60000X. The ctDNA ratios were analyzed with 8 INDELs, 8 SNVs (arbitrary picks), and 28 mutations, respectively, according to the present application, and the results are shown in table 3.
TABLE 3 analysis results of diluted samples of standards
Based on the results, it was found that when the SNV or INDEL was analyzed alone, the dilution level of 1.6e-4 and above could be accurately estimated, and the significance pvalue <0.01 was satisfied, and the INDEL calculation results could be better than SNV at 8e-5 or below; when analyzed with SNV and INDEL binding, the dilution gradient was accurately calculated at all dilutions of the experiment and satisfied significance pvalue <0.01.
Example 5
According to the application, 5 cell lines to be detected and 1 background cell line are selected as research materials, each cell line to be detected and the background cell line are diluted into 5 gradient diluted samples of 5e-03, 1e-03, 2e-04, 4e-05 and 8e-06, and samples with different ctDNA (specific deoxyribonucleic acid) ratios are simulated. 40-100 self-specific germ line mutations are selected from each cell line to be detected to be used as somatic mutation sites, and corresponding hybridization probes are designed for subsequent experiments. And finally, carrying out three repeated experiments on each diluted sample and background sample to obtain 90 sample data, wherein the library building input amount of each sample is 30ng, and the average sequencing depth of a target area is 100000X. The sequencing data are analyzed by the method of the application, and the site detection and sample detection conditions are calculated. Table 4 shows the results of sample blending ratio (simulated ctDNA duty cycle) evaluation and significance level, and FIGS. 7A-7E show the site detection at site pvalue < 0.05.
TABLE 4 cell line dilution sample analysis results
Based on the sample analysis results, the method can accurately evaluate the dilution level when the dilution gradient is 5e-03 to 4e-05, the sample significance pvalue is low, and the dilution level estimated value and the actual difference are large when the dilution gradient is 8 e-06. Based on the site analysis results, the sensitivity was reduced from 100% to about 15% at a dilution gradient of 5e-03 to 4e-05, but all were significantly higher than (1-specificity), and at a level of 8e-06, the sensitivity was reduced to about 5% which was relatively close to (1-specificity). Therefore, the detection method provided by the application can detect ctDNA as low as about 4e-05, and the detection limit of ctDNA is lower than the level of 2e-04 given by consensus, so that data support and assistance are provided for the subsequent application to detection of tiny residual focus.
The foregoing detailed description is provided by way of explanation and example and is not intended to limit the scope of the appended claims. Numerous variations of the presently illustrated embodiments of the application will be apparent to those of ordinary skill in the art and are intended to be within the scope of the appended claims and equivalents thereof.

Claims (10)

1. A method of assessing background levels of a variant nucleic acid sample, comprising:
data preparation, including correcting base errors in sequencing data of a sample to be tested based on a specific molecular tag (UMI), and comparing corrected read lengths (reads) to a human reference genome;
and calculating and obtaining the background level of the variant nucleic acid sample based on the comparison result of the corrected read length, wherein the background level comprises a sample specific SNV background level and/or a sample specific InDel background level.
2. The method of claim 1, wherein after said aligning corrected read lengths (reads) to a human reference genome, comprising:
Using the comparison information and UMI information, regarding all read lengths which are compared to the human reference genome and have the same position and the same UMI as read lengths of the same molecular source, and classifying the read lengths of the same molecular source into a unit;
Determining whether the number of read lengths in each cell is greater than a threshold, and retaining the cell in response to determining that the number of read lengths is greater than a threshold;
Determining bases at various positions within each of the retained units based on a majority voting rule, and ultimately generating a consensus sequence representing the unit;
and (5) aligning the identical sequences of all the units to a human reference genome to obtain the BAM file.
3. The method according to claim 1 or 2, wherein the correction of base errors in sequencing data of a sample to be tested based on a specific molecular tag (UMI) is base error correction using sequence information of the same molecular origin;
Optionally, the base error correction comprises: correcting the read length of the same strand from the same DNA based on the UMI and alignment information based on a majority voting rule, setting an indeterminate base to N and the mass to 0, and setting other base masses to the highest value to generate a single-strand consensus sequence (SSCSs); and correcting the read lengths of the sense strand and the antisense strand from the same DNA, respectively, adjusting the mass of the non-identical bases in the sense strand and the antisense strand to 0, and reserving the read lengths of the sense strand and the antisense strand as identical sequences, respectively.
4. A method according to claim 3, wherein the method further comprises:
And correcting again by using the overlapping part of the read length 1 and the read length 2 of the same strand derived from the same DNA after sequencing, and adjusting the inconsistent base quality in the read length 1 and the read length 2 to 0.
5. The method of claim 1, wherein the calculating a sample-specific SNV background level comprises:
Counting all trimer numbers corresponding to each position of the sequencing target region;
removing all somatic mutation sites and trimer information corresponding to other preset removed sites so as to remove the influence of clear mutation sites or areas on background calculation;
optionally, removing all trimer information corresponding to sites with mutation frequencies above a certain threshold, such as 5e-03, to exclude interference of other potential mutations on background calculation;
The trimeric information of the remaining sites is integrated together, and the frequency of each mutation is calculated based on the trimeric mutation type as the specific SNV background level of the variant nucleic acid sample.
6. The method of claim 5, wherein the method further comprises:
optionally filtering out read lengths having a alignment quality of less than 60 or comprising 8 or more base mismatches; or, optionally filtering out trimers having a base quality less than a preset threshold;
optionally, the preset threshold is 20.
7. The method of claim 1, wherein the InDel is selected from a random InDel or a base repeat InDel, wherein the base repeat InDel is represented by (Unit) n, wherein Unit represents a repeat Unit, is a single base or multiple bases, and n is a positive integer greater than or equal to 2.
8. The method of claim 7, wherein the calculating a sample-specific InDel background level comprises:
Counting the times of InDel signals and non-InDel signals sequenced by each site of the sequencing target region based on a reference sequence of the site;
removing all somatic mutation sites and other information corresponding to the sites which are preset to be removed so as to remove the influence of clear mutation sites or regions on background calculation;
Optionally, removing all information corresponding to sites with mutation frequencies above a certain threshold, such as 5e-03, to exclude interference of other potential mutations to background calculation;
The information of the remaining sites was integrated together, and the frequency of each mutation was calculated based on the InDel type as the background level of specific InDel for the variant nucleic acid sample.
9. The method of claim 7, wherein when InDel is random InDel, different types of background levels of specific InDel are respectively counted according to the kind of base preceding the position of the random InDel and the random InDel insertion or deletion length;
optionally, combining a previous base with the random InDel base upon insertion or absence of a single base, respectively counting the relevant background frequencies;
Alternatively, where 2 or more bases are inserted or deleted, the background frequency mean of the same length base inserted or deleted is directly calculated, optionally without separately counting the base combinations.
10. The method according to claim 7, wherein when InDel is a base repeat InDel, the background frequencies are counted based on the kind of Unit in the reference sequence, the value of n, and/or the number of bases inserted or deleted, respectively;
optionally, when the Unit is 2 bases, optionally, irrespective of the specific sequence of the Unit, merging indels with the length of 2 and the same repetition number n of all units, and calculating the corresponding background frequency according to the number of bases inserted or deleted.
CN202311849925.9A 2021-12-24 2021-12-24 Method for evaluating background level of variant nucleic acid sample Pending CN117947163A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311849925.9A CN117947163A (en) 2021-12-24 2021-12-24 Method for evaluating background level of variant nucleic acid sample

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111600502.4A CN114292912A (en) 2021-12-24 2021-12-24 Detection method of variant nucleic acid
CN202311849925.9A CN117947163A (en) 2021-12-24 2021-12-24 Method for evaluating background level of variant nucleic acid sample

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202111600502.4A Division CN114292912A (en) 2021-12-24 2021-12-24 Detection method of variant nucleic acid

Publications (1)

Publication Number Publication Date
CN117947163A true CN117947163A (en) 2024-04-30

Family

ID=80970042

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202311849925.9A Pending CN117947163A (en) 2021-12-24 2021-12-24 Method for evaluating background level of variant nucleic acid sample
CN202111600502.4A Pending CN114292912A (en) 2021-12-24 2021-12-24 Detection method of variant nucleic acid

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202111600502.4A Pending CN114292912A (en) 2021-12-24 2021-12-24 Detection method of variant nucleic acid

Country Status (2)

Country Link
CN (2) CN117947163A (en)
WO (1) WO2023115662A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114875118B (en) * 2022-06-30 2022-10-11 北京百图智检科技服务有限公司 Methods, kits and devices for determining cell lineage
CN116064755B (en) * 2023-01-12 2023-10-20 华中科技大学同济医学院附属同济医院 Device for detecting MRD marker based on linkage gene mutation
CN116356001B (en) * 2023-02-07 2023-12-15 江苏先声医学诊断有限公司 Dual background noise mutation removal method based on blood circulation tumor DNA
CN117144002B (en) * 2023-07-19 2024-06-25 苏州吉因加生物医学工程有限公司 Design method and application of personalized probe set for MRD detection
CN116676373B (en) * 2023-07-28 2023-11-21 臻和(北京)生物科技有限公司 Sample dilution factor quantification method and application thereof
CN117604086A (en) * 2023-11-17 2024-02-27 苏州吉因加生物医学工程有限公司 Quantitative method for ctDNA level of blood plasma of subject

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2925014T3 (en) * 2014-09-12 2022-10-13 Univ Leland Stanford Junior Identification and use of circulating nucleic acids
US11101019B2 (en) * 2016-12-08 2021-08-24 Life Technologies Corporation Methods for detecting mutation load from a tumor sample
US20210125683A1 (en) * 2017-09-15 2021-04-29 The Regents Of The University Of California Detecting somatic single nucleotide variants from cell-free nucleic acid with application to minimal residual disease monitoring
EP3781709A4 (en) * 2018-04-16 2022-11-30 Grail, LLC Systems and methods for determining tumor fraction in cell-free nucleic acid
EP3846613B1 (en) * 2018-09-05 2022-09-28 Oxford University Innovation Limited A method or system for identification of a causative mutation causing a phenotype of interest in a test sample
JP7340021B2 (en) * 2018-12-23 2023-09-06 エフ. ホフマン-ラ ロシュ アーゲー Tumor classification based on predicted tumor mutational burden
CN112111565A (en) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 Mutation analysis method and device for cell free DNA sequencing data

Also Published As

Publication number Publication date
CN114292912A (en) 2022-04-08
WO2023115662A1 (en) 2023-06-29

Similar Documents

Publication Publication Date Title
CN117947163A (en) Method for evaluating background level of variant nucleic acid sample
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
Oh et al. Comparison of accuracy of whole-exome sequencing with formalin-fixed paraffin-embedded and fresh frozen tissue samples
CN111304303B (en) Method for predicting microsatellite instability and application thereof
CN109767810B (en) High-throughput sequencing data analysis method and device
US20200219587A1 (en) Systems and methods for using fragment lengths as a predictor of cancer
US11929148B2 (en) Systems and methods for enriching for cancer-derived fragments using fragment size
US20210065842A1 (en) Systems and methods for determining tumor fraction
CN105143466A (en) Maternal plasma transcriptome analysis by massively parallel RNA sequencing
CN113674803A (en) Detection method of copy number variation and application thereof
CN112365922A (en) Microsatellite locus for detecting MSI, screening method and application thereof
US20190073445A1 (en) Identifying false positive variants using a significance model
CN113789371A (en) Method for detecting copy number variation based on batch correction
CN113278706B (en) Method for distinguishing somatic mutation from germline mutation
CN115954049B (en) Microsatellite unstable locus state detection method, system and storage medium
CN110232949B (en) Genome microsatellite wide-area length distribution estimation method considering tumor purity factor
CN114242164B (en) Analysis method, device and storage medium for whole genome replication
CN109390039B (en) Method, device and storage medium for counting DNA copy number information
WO2022262569A1 (en) Method for distinguishing somatic mutation and germline mutation
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
CN114093417B (en) Method and device for identifying chromosomal arm heterozygosity loss
CN114093428B (en) System and method for detecting low-abundance mutation under ctDNA ultrahigh sequencing depth
CA3099612C (en) Method of cancer prognosis by assessing tumor variant diversity by means of establishing diversity indices
RU2772912C1 (en) Method for analysing mitochondrial dna for non-invasive prenatal testing
CN118173172A (en) Gene fusion analysis method based on second-generation sequencing, product and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination