CN117253546A

CN117253546A - Method, system and storable medium for reducing targeted second-generation sequencing background noise

Info

Publication number: CN117253546A
Application number: CN202311315265.6A
Authority: CN
Inventors: 翟兵兵; 刘建红; 邓涛; 孙立超
Original assignee: Beijing Capitalbio Medlab Co ltd
Current assignee: Beijing Capitalbio Medlab Co ltd
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2023-12-19
Anticipated expiration: 2043-10-11
Also published as: CN117253546B

Abstract

The invention provides a method, a system and a storable medium for reducing targeted second-generation sequencing background noise, and relates to the field of belief analysis. The method for reducing the background noise of the targeted second-generation sequencing comprises the following steps: obtaining target capture sequencing data and comparing the target capture sequencing data with a reference genome to obtain a variation site of a sample to be noise reduced; calculating the reference and variation reads support number of each site of each sample to be denoised; calculating false positive incidence rate by adopting beta-binomial distribution, and screening out true positive sites; and finally removing the targeted capture sequencing data of the false positive sites. The method provided by the invention effectively reduces the influence of the background noise of the targeted second-generation sequencing through screening and calculation of a plurality of steps, has high accuracy and reliability, and has important scientific research value.

Description

Method, system and storable medium for reducing targeted second-generation sequencing background noise

Technical Field

The invention belongs to the field of belief analysis, and in particular relates to a method, a system, equipment, a computer readable storage medium and application thereof for reducing targeted second-generation sequencing background noise.

Background

Single nucleotide mutations (Single nucleotide variant, SNV) and InDel mutations (Insertion/Deletion mutation, inDel) are common types of variation in the human genome, many of which cause abnormal and uncontrolled cell growth, affecting the steady state development of a range of critical functions of the cell. The accurate detection and identification of SNV and InDel is very important in clinical diagnosis, medication guidance, treatment and prognosis. With the continued development of sequencing technology and bioinformatics, more and more sequencing platforms and mutation recognition algorithms are used to detect SNV and InDel. Among the most commonly used methods for detecting SNV and InDel are Polymerase Chain Reaction (PCR), first generation sequencing (Sanger sequencing) and second generation sequencing (NGS).

Targeted sequencing technology is a method that uses a second generation sequencing approach to isolate and sequence a set of specific regions of interest, such as genes or genomic regions. And (3) carrying out targeted sequencing, namely enriching and sequencing a specific region, and achieving a high-sensitivity method for screening variant sites by using a small data volume. Plays an important role in the fields of targeted medication guidance, tumor recurrence monitoring and the like. However, during targeted capture and sequencing, a significant amount of false positive noise can occur. These false positive noise usually originate from 3 aspects, (1) targeted capture; (2) Library amplification enrichment is generated by incorrect binding of PCR polymerase; (3) sequencing errors of the sequencer itself. In order to obtain high-accuracy SNV and InDel detection results, reducing false positive background noise is a very important precondition.

Common background noise reduction techniques include: single molecule tag (Unique Molecular Identifier, UMI) technology, GATK false positive removal tool (GATK VariantFiltration), varScan false positive removal tool (fpfilter), and the like. The UMI technology is a technology for sequencing a single DNA molecule multiple times by adding a molecular tag to a single DNA molecule after labeling to improve single base accuracy of the single molecule. UMI technology has high sensitivity, high accuracy characteristics, but needs special library kit and very high sequencing depth, and the cost is very high. Algorithm tools such as a GATK false positive removing tool, a VarScan false positive removing tool and the like are relatively poor in accuracy in removing false positive background noise.

Disclosure of Invention

Aiming at the problems of low accuracy and high cost of background noise removal in the process of targeted second-generation sequencing point mutation and insertion mutation, the application provides a method, a system, equipment and a computer-readable storage medium for reducing targeted second-generation sequencing background noise and application thereof, wherein false positive sites caused by sequencing errors and/or germ line mutation are removed based on a series of mutation site indexes, beta binomial distribution and/or single sample t test are adopted to identify the false positive sites, and targeted capture sequencing data of the false positive sites is removed to reduce the second-generation sequencing background noise.

The application discloses a method for reducing targeted second generation sequencing background noise, which comprises the following steps:

s11: acquiring targeted capture sequencing data of a sample to be denoised;

s21: comparing the target captured sequencing data of the sample to be noise reduced with a reference genome, and obtaining a file containing SNV and InDel mutation of the sample to be noise reduced through a mutation detection tool;

s31: calculating the forward chain reference reads support number of each site of each sample to be noise reduced, and recording as VzA; negative strand reference reads support number, noted VfA; the number of positive strand mutation reads supported is VzB; negative strand mutation reads support number, denoted VfB; the sum of the positive strand reference reads support number and the positive strand mutation reads support number is recorded as Zsum; the sum of the negative strand reference reads support number and the negative strand mutation reads support number is recorded as Fsum;

s41: calculating the positive chain false positive reads support number when the occurrence rate of the Zsum and the false positive reaches a threshold value by adopting beta-binomial distribution, and marking the positive chain false positive reads support number as zN; calculating the negative chain false positive reads support number when the occurrence rate of the Fsum and the false positive reaches a threshold value, and marking the negative chain false positive reads support number as fN; if zN is less than or equal to VzB or fN is less than or equal to Vfb, the positive site is true, otherwise, the positive site is false;

S5: removing the targeted capture sequencing data of the false positive site.

Further, the method further comprises:

s11, S21, and S31 are performed;

s32: calculating the variation frequency vaf = (VzB +vfb)/(zsum+fsum) of each site of each sample to be noise reduced;

s42: adopting single sample t test, calculating whether a significant difference exists between the site variation frequency vaf and the site overall filRate mean value under the condition that the site list filRate is taken as the overall and the site variation frequency is vaf; if there is a significant difference, then a true positive site, otherwise a false positive site; preferably, if P < 0.05, a true positive site;

s5 is performed.

Further, the analysis flow of the site list filerate includes:

s12: acquiring targeted capture sequencing data of a healthy peripheral blood sample;

s22: comparing the target captured sequencing data of the healthy peripheral blood sample with a reference genome, and obtaining a file containing SNV and InDel mutations of the healthy peripheral blood sample through a mutation detection tool;

s33: setting the sites with mutation frequencies exceeding a threshold as NA, calculating the number of samples which are not NA of each site, deleting the site if the number of the samples which are not NA exceeds the threshold, otherwise, reserving the site to obtain candidate sites; preferably, setting a site with the mutation frequency of more than or equal to 15% as NA, deleting the site if the number of samples of the NA is more than or equal to 2, otherwise, reserving the site to obtain a candidate site;

S34: each candidate site comprises a mutation frequency list of each healthy peripheral blood sample, wherein the mutation frequency list is a site list filerate; where a sample does not find a site, designated NA.

Further, the method further comprises:

s13: acquiring targeted capturing sequencing data of a sample to be noise reduced and a control blood sample;

s23: comparing the target capture sequencing data of the sample to be noise reduced and the control blood sample with a reference genome, and obtaining a file containing SNV and InDel mutation of the healthy peripheral blood sample through a mutation detection tool;

s31 and S32 are executed;

s35: calculating the positive strand reference reads support number for each site of each control blood sample, noted as BzA; negative strand reference reads support number, noted BfA; the number of positive strand mutation reads supported is BzB; negative strand mutation reads support number, noted BfB; variation frequency bvaf= (BzB + BfB)/(BzA + BfA + BzB + BfB) for each site of the control blood sample;

s43: identifying and filtering false positive sites caused by sequencing errors and/or germ line mutation based on the statistical data of S31, S32 and S35, removing the false positive sites, and reserving positive candidate sites; executing S41 and/or S42 on the positive candidate sites, wherein when S41 and S42 are executed, both the positive candidate sites are true positive sites, and otherwise, the positive candidate sites are false positive sites;

S5 is performed.

Further, the judgment standard for identifying and filtering false positive sites caused by sequencing errors and/or germ line mutation comprises that any one or more of the following conditions are met, the false positive sites are obtained, and otherwise, the positive candidate sites are obtained:

1) VzA <3 and VfB <3;

2) BzB > =2 and BfB > =2;

3) Bvaf >0 and vaf/Bvaf <5;

4) Repeat number > =5 and vaf <0.01;

5) The insertion or deletion length of the variation site is more than 50bp and vaf is less than 0.02;

6)Vaf<0.007。

further, the method further comprises:

s6: selecting the true positive sites, constructing references with different mutation frequencies, performing targeted capturing sequencing, and performing experimental verification; preferably, the different mutation frequencies include 1%, 2% and 5%.

Further, the mutation detection tool comprises any one or more of the following: varDict, GATK, varScan, freeBayes, mutect, strelka and Samtools.

A system for reducing targeted second generation sequencing background noise, the system comprising:

an acquisition unit: the target capture sequencing data are used for acquiring the sample to be noise reduced;

SNV, inDel mutation detection unit: the method comprises the steps of comparing target capture sequencing data of a sample to be noise reduced with a reference genome, and obtaining a file containing SNV and InDel mutation of the sample to be noise reduced through a mutation detection tool;

A statistics unit: the forward chain reference reads support number of each site used for calculating each sample to be noise reduced is recorded as VzA; negative strand reference reads support number, noted VfA; the number of positive strand mutation reads supported is VzB; negative strand mutation reads support number, denoted VfB; the sum of the positive strand reference reads support number and the positive strand mutation reads support number is recorded as Zsum; the sum of the negative strand reference reads support number and the negative strand mutation reads support number is recorded as Fsum;

positive site judgment unit: the positive chain false positive reads support number is recorded as zN when the Zsum and the false positive occurrence rate reach a threshold value by adopting beta-binomial distribution; calculating the negative chain false positive reads support number when the occurrence rate of the Fsum and the false positive reaches a threshold value, and marking the negative chain false positive reads support number as fN; if zN is less than or equal to VzB or fN is less than or equal to Vfb, the positive site is true, otherwise, the positive site is false;

noise reduction unit: targeted capture sequencing data for removal of the false positive sites.

An apparatus for reducing targeted second generation sequencing background noise, the apparatus comprising: a memory and a processor;

the memory is used for storing program instructions;

the processor is configured to invoke the program instructions, which when executed, are configured to perform the method of reducing background noise of targeted second generation sequencing described above.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of reducing targeted second generation sequencing background noise described above.

The invention has the advantages that:

1. according to the method disclosed by the application, the SNV and InDel mutation files containing the samples to be noise reduced are rapidly and accurately captured by carrying out targeted capture sequencing and reference genome comparison on the samples of the peripheral blood of the healthy crowd and combining with a mutation detection tool, the information of each site of the samples of the peripheral blood of the healthy crowd is synthesized, the mutation frequency of each sample of each site is calculated, the sites with the mutation frequency lower than 15% and the sample number higher than 1 are reserved, the baseline data of the mutation frequency of the sites are obtained, the false positive sites of the samples to be noise reduced are further screened based on the baseline data, the reliability and the accuracy of data analysis are improved, and a reliable basis is provided for the follow-up single-sample t test.

2. According to the method disclosed by the application, whether each position point is a positive position point is judged through Beta binomial distribution and/or single sample t distribution detection, the length of a repetition region of a side wing of the position point is comprehensively considered, the number of the variation position points is supported in a sample to be tested, the frequency of the variation position points is measured in the sample to be tested, the number of the variation position points is supported in a control blood sample, and the frequency information of the variation position points in the control blood sample is used for realizing high-efficiency and accurate detection and identification of SNV and InDel and reducing false alarm rate.

3. In order to improve the accuracy of candidate site detection and reduce the false rate of false positive site detection, the method disclosed by the application utilizes a control blood sample to help identify and filter germ line mutations.

4. In order to verify the accuracy of true positive site identification, the method disclosed by the application uses references with different mutation frequencies to carry out capturing and sequencing, and carries out experimental verification, and the result of the experimental verification proves that the method has high accuracy and good feasibility of noise reduction, and has important scientific research value and clinical application value in the fields of bioinformatics and clinical diagnosis.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for reducing background noise of targeted second-generation sequencing according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system for reducing background noise in targeted second generation sequencing according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an apparatus for reducing background noise in targeted second generation sequencing according to an embodiment of the present invention;

FIG. 4 is a flow chart of the S41 Beta-binomial distribution used in a method for reducing background noise of targeted second generation sequencing according to an embodiment of the present invention;

FIG. 5 is a flow chart of a single sample t-test S42 used in a method for reducing background noise in targeted second generation sequencing according to an embodiment of the present invention;

FIG. 6 is a flow chart of S41 and S42 employed in a method for reducing background noise in targeted second generation sequencing provided by an embodiment of the present invention;

FIG. 7 is a flow chart of S41 and S43 employed in a method for reducing background noise in targeted second generation sequencing provided by an embodiment of the present invention;

FIG. 8 is a flow chart of S42 and S43 employed in a method for reducing background noise in targeted second generation sequencing provided by an embodiment of the present invention;

FIG. 9 is a flow chart of S41, S42 and S43 used in a method for reducing background noise of targeted second generation sequencing according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a system for reducing background noise in targeted second generation sequencing, comprising 101, 201, 301, 401, and 501, in accordance with an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present invention and in the above figures, a plurality of operations appearing in a particular order are included, but it should be clearly understood that the operations may be performed in other than the order in which they appear herein or in parallel, the sequence numbers of the operations such as S101, S102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments according to the invention without any creative effort, are within the protection scope of the invention.

FIG. 1 is a method for reducing background noise in targeted second generation sequencing, according to an embodiment of the present invention, the method comprising:

s1: acquiring targeted capture sequencing data, including S11, S12 and S13;

s11: acquiring targeted capture sequencing data of a sample to be denoised;

s13: and acquiring targeted capture sequencing data of the sample to be noise reduced and the control blood sample.

S2: SNV, inDel mutation detection, including S21, S22 and S23;

s23: and comparing the target capture sequencing data of the sample to be noise reduced and the control blood sample with a reference genome, and obtaining a file containing SNV and InDel mutation of the healthy peripheral blood sample through a mutation detection tool. S3: statistics of sequencing data for the sites and calculation of baseline features, including S31, S32, S33, S34 and S35;

S35: calculating the positive strand reference reads support number for each site of each control blood sample, noted as BzA; negative strand reference reads support number, noted BfA; the number of positive strand mutation reads supported is BzB; negative strand mutation reads support number, noted BfB; variation frequency bvaf= (BzB + BfB)/(BzA + BfA + BzB + BfB) for each site of the control blood sample.

S4: judging whether the site is a false positive site or not, including S41, S42 and S43;

s43: identifying and filtering false positive sites caused by sequencing errors and/or germ line mutation based on the statistical data of S31, S32 and S35, removing the false positive sites, and reserving positive candidate sites; and executing S41 and/or S42 on the positive candidate sites, wherein when S41 and S42 are executed, the positive candidate sites are true positive sites when both the positive candidate sites are satisfied, and otherwise, the positive candidate sites are false positive sites.

S5: removing the targeted capture sequencing data of the false positive site.

The method is arranged and combined based on different application scenes and research purposes, and can comprise the following 6 embodiments:

in example 1, a method of reducing background noise in targeted second generation sequencing comprises: s11, S21, S31, S41 and S5 are performed, and the flow of the method is as shown in fig. 4;

in example 2, a method of reducing background noise in targeted second generation sequencing comprises: s11, S12, S21, S22, S31, S32, S33, S34, S42 and S5 are performed, and the flow of the method is as shown in fig. 5;

in example 3, a method of reducing background noise in targeted second generation sequencing comprises: s11, S12, S21, S22, S31, S32, S33, S34, S41, S42 and S5 are performed, and the flowchart of the method is shown in fig. 6;

in example 4, a method of reducing background noise in targeted second generation sequencing comprises: s13, S23, S31, S32, S35, S43, S41 and S5 are performed, and the flowchart of the method is shown in fig. 7;

in example 5, a method of reducing background noise in targeted second generation sequencing comprises: s12, S13, S22, S23, S31, S32, S33, S34, S35, S43, S42 and S5 are performed, and the flow of the method is as shown in fig. 8;

in example 6, a method of reducing background noise in targeted second generation sequencing comprises: s12, S13, S22, S23, S31, S32, S33, S34, S35, S43, S41, S42, and S5 are performed, and the flow of the method is as shown in fig. 9.

Taking example 6 as an example, the method for reducing background noise of targeted second generation sequencing comprises:

in one embodiment, the healthy peripheral blood sample is from 40 young healthy people, and the peripheral blood sample refers to a sample collected from peripheral blood of a human body, and the peripheral blood flows through whole body blood, including arterial blood and venous blood. Peripheral blood samples are typically collected from venous blood and can be used for various medical examinations and studies such as clinical assays, disease diagnosis, genetic testing, drug monitoring, and the like. The peripheral blood sample is collected, typically by puncturing the vein with a needle, placing the blood in a test tube or blood collection tube, and then sent to a laboratory for analysis and detection, which can provide information about various blood parameters such as blood cell count, hemoglobin level, clotting function, immunological index, biochemical index, etc.

in one embodiment, the sample is targeted for capture sequencing to a depth of greater than 1000X, then the sample is aligned to a reference genome, and vcf files of SNV, inDel mutations of the sample are obtained by vardic.

The mutation detection tool comprises any one or more of the following components: varDict, GATK, varScan, freeBayes, mutect, strelka and Samtools.

GATK (Genome Analysis Toolkit) is a widely used genomic analysis kit for detecting and annotating single nucleotide variations and structural variations in the genome. GATK provides a range of tools and algorithms including mutation detection, mutation filtering, genotyping quality assessment, and the like. The method adopts advanced statistical model and algorithm, can efficiently and accurately detect various types of variation, such as SNP, inDel and the like, and carries out accurate genotype quality assessment and filtration on the variation. The variant filtration tool therein may be used for false positive removal. The variant filtration may filter out variant sites with high false positive rates, e.g., quality and depth based filtering, filter out low frequency variants, filter out SNPs and indels on sites, etc., according to user defined filtering conditions. These filtering conditions can be adjusted according to the actual situation to reduce false positive noise.

VarScan is a tool for detecting and annotating single nucleotide variations and structural variations in DNA sequencing data. It identifies possible mutation sites by statistical methods based on sequencing data comparing the sequencing sample and the control sample, and provides corresponding mutation information and comments. VarScan can detect multiple types of variations, such as SNP, inDel, CNV, etc., and provide a rich variation filtering and annotation option. The fpfilter tool therein may be used for false positive removal. fpfilter filters the variant results according to statistical principles and user-defined filtering conditions. For example, filtering can be performed according to indexes such as sequencing depth, mutation frequency, sequencing error rate and the like so as to reduce false positive noise.

FreeBayes is an open source mutation detection tool for detecting single nucleotide mutations and insertions/deletions from sequencing data. The method utilizes a Bayesian statistical method to identify possible mutation sites by comparing sequencing data of a sequencing sample and a reference genome, and provides corresponding mutation information and comments. FreeBayes has high sensitivity and accuracy and supports parallel runs and batch processing.

Mutct is a tool specifically used to detect single nucleotide variations in tumor samples. It identifies possible tumor-specific mutation sites by statistical models and algorithms based on sequencing data comparing tumor samples with control samples, and provides corresponding mutation information and comments. Mutect has high sensitivity and specificity, and can accurately detect low-frequency tumor variation.

Strelka is a mutation detection tool for whole-exome sequencing data for detecting single nucleotide and structural variations. By comparing sequencing data of tumor samples and control samples, possible mutation sites are accurately identified by utilizing various algorithms and statistical models, and mutation information and comments are provided. Strelka has high accuracy and sensitivity, and is capable of detecting various types of variations.

Samtools is a common set of sequencing data processing tools that includes the ability to detect single nucleotide variations. It can process and analyze sequencing data, including reading and converting sequencing file formats, aligning sequencing reads, identifying and filtering variations, and the like. Samtools provides a variety of commands and options that allow for customized data processing and analysis as desired.

Vardic is a high performance mutation detection tool for detecting single nucleotide variations and insertions/deletions from sequencing data. It identifies possible mutation sites by using mutation frequencies and combining algorithms based on comparing sequencing data of the sequencing samples and the reference genome, and provides corresponding mutation information and comments. Vardic has high sensitivity and accuracy and is capable of processing large-scale sequencing data.

the positive strand reference reads support number refers to the number of non-mutated reads from the positive strand, the negative strand reference reads support number refers to the number of non-mutated reads from the negative strand, the positive strand mutated reads support number refers to the number of mutated reads from the positive strand, and the negative strand mutated reads support number refers to the number of mutated reads from the negative strand. Sequencing depth (all reads covering the mutation site) = VzA + VfA + VzB +vfb.

mutation frequency, vaf=number of reads mutated/sequencing depth, i.e., vaf= (number of reads support for positive strand mutation+number of reads support for negative strand mutation)/(number of supports for positive strand reference reads+number of supports for negative strand reference reads+number of supports for reads for positive strand mutation+number of supports for reads for negative strand mutation).

s34: each candidate site comprises a mutation frequency list of each healthy peripheral blood sample, wherein the mutation frequency list is a site list filerate; wherein a sample does not find a site, denoted NA;

in one embodiment, the vcf files of 40 samples are integrated to obtain each mutation site, the site mutation frequency of each sample is calculated, if the mutation frequency of the site of each sample exceeds 15%, the NA value is set, the number of samples with each mutation site not being the NA value is calculated, and the sites with the number of samples exceeding 1 are reserved, namely the candidate sites.

In one embodiment, collecting a sample to be denoised and a control blood sample thereof for targeted capturing sequencing, wherein the sequencing depth of the sample to be denoised is higher than 1000X, the sequencing depth of the control blood sample is higher than 200X, then comparing the sample to be denoised and the control blood sample thereof with a reference genome, and obtaining a vcf file of SNV and InDel somatic mutation of the sample through VarDict; the control blood sample is mainly used for removing germ line mutation (mutation carried in embryo development period of human is characterized in that mutation exists in normal cells and tumor cells), the mutation frequency of germ line mutation is theoretically 50% (heterozygous) or 100% (homozygous), so that the germ line mutation can be accurately detected and then removed by the sequencing depth of 200X, and the purpose of reducing the sequencing cost is achieved in actual production.

in one embodiment, the criterion for identifying and filtering false positive sites due to sequencing errors and/or germ line mutations includes any one or more of the following conditions being met, and if not, the false positive sites are positive selection sites:

1) VzA <3 and VfB <3;

2) BzB > =2 and BfB > =2;

3) Bvaf >0 and vaf/Bvaf <5;

4) Repeat number > =5 and vaf <0.01;

6)Vaf<0.007。

the beta-binomial distribution is a probability distribution that describes the number of successes in a series of independent bijections. The method is formed by combining a beta distribution and a binomial distribution, wherein the beta distribution is used for describing uncertainty of success probability of two tests, and the binomial distribution is used for describing probability distribution of success times. The two parameters of the beta distribution represent the prior probabilities of success and failure, respectively, while the parameters of the binomial distribution represent the total number of trials and the probability of success.

In one embodiment, the probability mass formula for the beta-binomial distribution is:

f(k)＝\binom{n}{k}\frac{B(k+a,n-k+b)}{B(a,b)},

the positive-strand false-positive reads support number zN (calculation process is n=scipy.stats.betabinom.ppf (0.90, zsum, zb, za)) in the case of the known-site positive-strand reads support number fsum and false-positive occurrence probability 90% is calculated by the inverse operation of the cumulative distribution of probabilities (calculation by betabinom.ppf in the scipy.stats packet), and the negative-strand false-positive reads support number fN (fn=scipy.stats.betabinom.ppf (0.90, fsum, fb, fa)) in the case of the known-site negative-strand reads support number fsum and false-positive occurrence probability 90% is calculated if zN is higher than VzB or fN is higher than VfB, the site is considered to be a false-positive site.

the single sample t-test is a statistical method for comparing whether the average value of a sample differs significantly from a known overall average value. It is suitable for the case that the sample data obeys normal distribution.

In a specific embodiment, a single sample t test is performed, and under the condition that a known site list filRate is calculated as a whole and a site frequency is vaf, whether a significant difference p exists between a site frequency vaf and a site overall filRate mean (p=scipy.stats.ttest_1samp (), vaf)), if p < 0.05, the site is considered to be a true positive site, and if no significant difference exists, the site is considered to be a false positive site.

S5: removing the targeted capture sequencing data of the false positive site.

In one embodiment, the method further comprises S6: selecting the true positive sites, constructing references with different mutation frequencies, capturing and sequencing, and performing experimental verification; the different mutation frequencies include 1%, 2% and 5%.

In one embodiment, the method comprises the steps of ^TM Tumor Mutation DNA Mix v2 AF10 (cat# 0710-0094) was used as a reference, and mixed with human genomic DNA derived from the GM24385 cell line in proportion to construct a reference with different mutation frequencies.

The verification sites are selected as follows:

mixed gDNA references were subjected to capture sequencing at mutation frequencies of 1%, 2% and 5%, 3 samples per frequency, and 9 samples were tested in total.

The experimental results are:

analytical sensitivity: 99.00% (97.10% -99.66%);

assay specificity: 100.00% (100.00% -100.00%);

positive predictive value: 99.33% (97.59% -99.82%);

negative predictive value: 100.00% (100.00% -100.00%);

accuracy: 100.00% (100.00% -100.00%);

based on the analysis index, the analysis method has very good effect.

Sensitivity refers to the proportion of all true positive samples that the method can correctly identify. The sensitivity of the method was 99%, indicating that of all true positive samples, 99% could be correctly identified. In addition, the interval (97.10% -99.66%) represents the confidence interval of the index, i.e. the sensitivity of the method may vary within this interval under different sample sets and experimental conditions.

Specificity refers to the proportion of all true negative samples that the method can exclude correctly. The specificity of this method was 100.00%, indicating that of all true negative samples, 100.00% could be properly excluded. The confidence interval for the specificity is also (100.00% -100.00%), meaning that the specificity will not change under different sample sets and experimental conditions.

The positive predictive value is the proportion of the samples judged to be positive by the method that are actually positive. The positive predictive value of this method was 99.33%, indicating that 99.33% of all samples judged positive by this method were truly positive. Confidence intervals (97.59% -99.82%) represent the range of positive predictive value variation under different sample sets and experimental conditions.

The negative predictive value is a proportion of the samples determined to be negative by the method that are actually negative. The negative predictive value of this method was 100.00%, indicating that 100.00% of all samples judged negative by this method were truly negative. As with specificity, the confidence interval for the negative predictive value is (100.00% -100.00%), meaning that the negative predictive value will not change under different sample sets and experimental conditions.

Accuracy refers to the correct proportion of the method to all samples. The accuracy of this method was 100.00%, indicating that the method was judged to be correct in all samples. Confidence intervals were (100.00% -100.00%), indicating that accuracy did not change under different sample sets and experimental conditions.

In conclusion, the analysis method shows very good performance in terms of sensitivity, specificity, positive predictive value, negative predictive value, accuracy and the like. This means that the method is able to identify true positive and negative samples with high accuracy and is able to exclude most false positive and false negative results. This also demonstrates that the disclosed methods of reducing background noise in second generation sequencing are of great significance to genomic research and clinical diagnosis, and can improve the reliability of the results.

FIG. 2 is a system for reducing background noise in targeted second generation sequencing, provided by an embodiment of the present invention, the system comprising:

101: an acquisition unit: the target capture sequencing data are used for acquiring the sample to be noise reduced;

102: the healthy peripheral blood sample acquisition unit is used for acquiring targeted capture sequencing data of the healthy peripheral blood sample;

103: the system comprises a sample to be noise-reduced and a control blood sample acquisition unit, a target acquisition and sequencing unit and a target acquisition and sequencing unit, wherein the sample to be noise-reduced and the control blood sample acquisition unit are used for acquiring target acquisition and sequencing data of the sample to be noise-reduced and the control blood sample;

201: SNV, inDel mutation detection unit: the method comprises the steps of comparing target capture sequencing data of a sample to be noise reduced with a reference genome, and obtaining a file containing SNV and InDel mutation of the sample to be noise reduced through a mutation detection tool;

202: healthy peripheral blood sample SNV, inDel mutation detection unit: the method comprises the steps of obtaining a file containing SNV and InDel mutations of a healthy peripheral blood sample through a mutation detection tool based on target capture sequencing data of the healthy peripheral blood sample and comparison of the target capture sequencing data with a reference genome;

203: SNV and InDel mutation detection units of samples to be noise reduced and control blood samples: the method comprises the steps of comparing target capture sequencing data based on the sample to be noise reduced and a control blood sample with a reference genome, and obtaining a file containing SNV and InDel mutation of a healthy peripheral blood sample through a mutation detection tool;

301: a statistics unit: the forward chain reference reads support number of each site used for calculating each sample to be noise reduced is recorded as VzA; negative strand reference reads support number, noted VfA; the number of positive strand mutation reads supported is VzB; negative strand mutation reads support number, denoted VfB; the sum of the positive strand reference reads support number and the positive strand mutation reads support number is recorded as Zsum; the sum of the negative strand reference reads support number and the negative strand mutation reads support number is recorded as Fsum;

302: a second statistical unit: for calculating the mutation frequency vaf = (VzB +vfb)/(zsum+fsum) of each site of each sample to be noise reduced;

303: a third statistical unit: setting the sites with mutation frequencies exceeding a threshold as NA, calculating the number of samples which are not NA of each site, deleting the site if the number of samples which are not NA exceeds the threshold, otherwise, reserving the site to obtain candidate sites; preferably, setting a site with the mutation frequency of more than or equal to 15% as NA, deleting the site if the number of samples of the NA is more than or equal to 2, otherwise, reserving the site to obtain a candidate site;

304: fourth statistical unit: the method comprises the steps of generating a mutation frequency list of each candidate site containing each healthy peripheral blood sample, wherein the mutation frequency list is a site list file; wherein a sample does not find a site, denoted NA;

305: fifth statistical unit: the positive strand reference reads support number for each site used to calculate each control blood sample was recorded as BzA; negative strand reference reads support number, noted BfA; the number of positive strand mutation reads supported is BzB; negative strand mutation reads support number, noted BfB; variation frequency bvaf= (BzB + BfB)/(BzA + BfA + BzB + BfB) for each site of the control blood sample.

401: positive site judgment unit: the positive chain false positive reads support number is recorded as zN when the Zsum and the false positive occurrence rate reach a threshold value by adopting beta-binomial distribution; calculating the negative chain false positive reads support number when the occurrence rate of the Fsum and the false positive reaches a threshold value, and marking the negative chain false positive reads support number as fN; if zN is less than or equal to VzB or fN is less than or equal to Vfb, the positive site is true, otherwise, the positive site is false;

402: single sample t-test unit: the method is used for adopting single sample t test, calculating whether a significant difference exists between the site variation frequency vaf and the average value of the site overall filRate under the condition that the site list filRate is taken as the overall and the site variation frequency is vaf; if there is a significant difference, then a true positive site, otherwise a false positive site; preferably, if P < 0.05, a true positive site;

403: site filtration unit: for identifying and filtering false positive sites due to sequencing errors and/or germline mutations based on the statistics of 301, 302 and 305, removing false positive sites, retaining positive candidate sites; the positive candidate sites are executed 401 and/or 402, and when 401 and 402 are executed, both are true positive sites when satisfied, otherwise false positive sites.

501: noise reduction unit: targeted capture sequencing data for removal of the false positive sites.

The system can be arranged and combined based on different application scenes and research purposes, and the following 6 embodiments are available:

in example 1, a system for reducing background noise in targeted second generation sequencing comprises: 101. 201, 301, 401 and 501, a schematic diagram of the system is shown in fig. 10;

in example 2, a system for reducing background noise in targeted second generation sequencing comprises: 101. 102, 201, 202, 301, 302, 303, 304, 402, and 501;

in example 3, a system for reducing background noise in targeted second generation sequencing comprises: 101. 102, 201, 202, 301, 302, 303, 304, 401, 402, and 501;

in example 4, a system for reducing background noise in targeted second generation sequencing comprises: 103. 203, 301, 302, 305, 403, 401, and 501;

in example 5, a system for reducing background noise in targeted second generation sequencing comprises: 102. 103, 202, 203, 301, 302, 303, 304, 305, 403, 402, and 501;

in example 6, a system for reducing background noise in targeted second generation sequencing comprises: 102. 103, 202, 203, 301, 302, 303, 304, 305, 403, 401, 402, and 501.

FIG. 3 is an apparatus for reducing background noise in targeted second generation sequencing provided by an embodiment of the present invention, the apparatus comprising: a memory and a processor; the apparatus may further include: input means and output means.

The memory, processor, input device, and output device may be connected by a bus or other means. FIG. 3 illustrates an example of a bus connection; wherein the memory is used for storing program instructions; the processor is configured to invoke the program instructions, which when executed, is configured to implement one of the methods described above for reducing background noise in targeted second generation sequencing.

In some embodiments, the memory may be understood as any device holding a program and the processor may be understood as a device using the program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method of reducing targeted second generation sequencing background noise as described above.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working procedures of the above-described systems, apparatuses and units may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

While the foregoing describes a computer device provided by the present invention in detail, those skilled in the art will appreciate that the foregoing description is not meant to limit the invention thereto, as long as the scope of the invention is defined by the claims appended hereto.

Claims

1. A method of reducing background noise in targeted second generation sequencing, the method comprising:

s11: acquiring targeted capture sequencing data of a sample to be denoised;

S5: removing the targeted capture sequencing data of the false positive site.

2. The method of reducing targeted second generation sequencing background noise of claim 1, further comprising:

s11, S21, and S31 are performed;

s5 is performed.

3. The method for reducing background noise in targeted second generation sequencing of claim 2, wherein the analysis procedure of the site list filRate comprises:

4. The method of reducing targeted second generation sequencing background noise of claim 1, further comprising:

s31 and S32 are executed;

S43: identifying and filtering false positive sites caused by sequencing errors and/or germ line mutation based on the statistical data of S31, S32 and S35, removing the false positive sites, and reserving positive candidate sites; s41 and/or S42 are/is carried out on the positive candidate sites, and the positive candidate sites are true positive sites when both the S41 and the S42 are carried out, otherwise the positive candidate sites are false positive sites;

s5 is performed.

5. The method for reducing background noise in targeted second-generation sequencing according to claim 4, wherein the criterion for identifying and filtering false positive sites due to sequencing errors and/or germline mutations comprises any one or more of the following conditions being met, if false positive sites are met, otherwise positive candidate sites are met:

1) VzA <3 and VfB <3;

2) BzB > =2 and BfB > =2;

3) Bvaf >0 and vaf/Bvaf <5;

4) Repeat number > =5 and vaf <0.01;

6)Vaf<0.007。

6. the method of reducing targeted second generation sequencing background noise of claim 1, further comprising:

7. The method of reducing background noise in targeted second generation sequencing of claim 1, wherein the mutation detection tool comprises any one or more of the following: varDict, GATK, varScan, freeBayes, mutect, strelka and Samtools.

8. A system for reducing background noise in targeted second generation sequencing, the system comprising:

9. An apparatus for reducing background noise in targeted second generation sequencing, the apparatus comprising: a memory and a processor;

the memory is used for storing program instructions;

the processor is configured to invoke program instructions, which when executed, are configured to perform the method of reducing background noise of targeted second-generation sequencing of any of claims 1-7.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of reducing targeted second generation sequencing background noise of any of claims 1-7.