CN116959579B - System for reducing errors of second generation sequencing system - Google Patents
System for reducing errors of second generation sequencing system Download PDFInfo
- Publication number
- CN116959579B CN116959579B CN202311207718.3A CN202311207718A CN116959579B CN 116959579 B CN116959579 B CN 116959579B CN 202311207718 A CN202311207718 A CN 202311207718A CN 116959579 B CN116959579 B CN 116959579B
- Authority
- CN
- China
- Prior art keywords
- snp
- palindromic
- palindromic sequence
- sequence
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 24
- 108091081548 Palindromic sequence Proteins 0.000 claims description 38
- 238000000034 method Methods 0.000 claims description 28
- 230000035772 mutation Effects 0.000 claims description 17
- 238000001914 filtration Methods 0.000 claims description 7
- 108020004414 DNA Proteins 0.000 claims description 5
- 206010028980 Neoplasm Diseases 0.000 claims description 4
- 201000011510 cancer Diseases 0.000 claims description 2
- 238000012216 screening Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 description 9
- 230000002255 enzymatic effect Effects 0.000 description 6
- 238000001976 enzyme digestion Methods 0.000 description 6
- 238000003776 cleavage reaction Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000007017 scission Effects 0.000 description 4
- 238000010276 construction Methods 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 230000036438 mutation frequency Effects 0.000 description 2
- 230000008439 repair process Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- GUAHPAJOXVYFON-ZETCQYMHSA-N (8S)-8-amino-7-oxononanoic acid zwitterion Chemical compound C[C@H](N)C(=O)CCCCCC(O)=O GUAHPAJOXVYFON-ZETCQYMHSA-N 0.000 description 1
- 102000004533 Endonucleases Human genes 0.000 description 1
- 108010042407 Endonucleases Proteins 0.000 description 1
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003766 bioinformatics method Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 238000002703 mutagenesis Methods 0.000 description 1
- 231100000350 mutagenesis Toxicity 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention relates to the technical field of medical molecular biology, in particular to a system for reducing errors of a second generation sequencing system.
Description
Technical Field
The invention relates to the technical field of medical molecular biology, in particular to a system for reducing errors of a second generation sequencing system.
Background
For tumor DNA detection, DNA fragmentation is firstly carried out on the basis of a short-reading long high-throughput sequencing platform, and the fragmentation is divided into two types, namely mechanical disruption (an ultrasonic method) and endonuclease-based (an enzymatic cleavage method). The enzyme digestion method does not need consumable materials, and can be easily integrated into an automatic warehouse-building process, so that the ultrasonic method is gradually replaced. However, due to certain preference of the enzyme digestion method, artificial mutation can be introduced in the process of library establishment, and the conventional data filtering method totally removes chimeric reads and loses the chimeric reads to truly generate mutation, so that sensitivity is reduced and detection frequency is inaccurate. Therefore, a blacklist needs to be established to filter the mutations, so that the accuracy of the result is ensured.
Disclosure of Invention
Aiming at the defects of the background technology, in order to obtain higher stability while maintaining accuracy, the invention establishes a specific blacklist of an enzyme digestion method based on a second generation sequencing platform, filters artificially introduced mutation in a second generation sequencing library and improves detection accuracy.
A system for reducing errors in a second generation sequencing system, comprising:
the second generation sequencing module is used for second generation sequencing of the DNA sample;
the blacklist module is used for comparing the second generation sequencing data and screening false positive mutation points;
wherein the blacklist module comprises artificial mutation sites.
Further, the second generation sequencing is second generation sequencing using an Illumina sequencing platform.
Further, the blacklist is established by the following method:
s1, respectively extending the sequencing data up and down by 50bp based on a hot spot interval related to cancer, and using the sequencing data as a reference sequence for searching a palindromic sequence;
s2, manually dividing the reference sequences to obtain n-delta sigma [ (L-K) +1 ] subsequences, wherein L is the length of the extended reference sequence, K is the length of the representative palindromic sequence, and the range of K is 2-L/2;
s3, using a getSeq function, acquiring a palindromic sequence with a position i as a center, and checking whether the expanded base meets palindromic characteristics;
s4, merging and reserving the palindromic structures with the overlapped areas by using a merge out function;
s5, judging whether the reserved palindromic sequence has missing bases or not, converting the position relationship of the missing bases into SNP information, and storing the SNP information into a snp_list to form a blacklist.
Further, in the step S3, bases which do not satisfy the palindromic properties are found in the expansion process, the positions of the mismatched bases are recorded, when 3 mismatched sites continuously appear, the continuous expansion is stopped, and palindromic sequence coordinates are obtained.
Further, in the step S4, filtering parameters are set, and palindromic sequences within the length range of 17bp-40bp are reserved.
Further, in S5:
when the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list;
when the length of the palindromic sequence is even and one or two missing bases exist, converting the palindromic sequence into SNP information according to the position relationship of the missing bases and storing the SNP information into a snp_list;
when the palindromic sequence is even in length and two missing bases are present and adjacent in position, it is combined into one MNP message and saved to the snp_list.
Further, in S5, when the sites in the snp_list are located at the extreme ends of the palindromic sequence, these sites are ignored; when the sites in the snp_list are not at the very end of the palindromic sequence, these sites are added to the blacklist.
The beneficial effects are that: according to the system for reducing errors of the second-generation sequencing system, provided by the invention, false positive mutation characteristics generated by a statistical enzyme digestion method are utilized to generate blacklists aiming at the easily-generated chimeric areas of different panels by utilizing a bioinformatics method, and a filtering principle is set, so that only mutation positioned in the chimeric areas is filtered, and the sensitivity and accuracy of detection are improved.
Drawings
FIG. 1 is a flow chart of blacklist establishment according to the present invention;
FIG. 2 is a graph showing SNV detection comparison based on second generation sequencing by an enzymatic cleavage method and a mechanical disruption method;
FIG. 3 is a summary of the features of false positive mutations in the enzymatic cleavage process;
FIG. 4 is a schematic representation of the enzymatic mutagenesis;
FIG. 5 is a graph showing SNV detection by the enzyme digestion method and the mechanical disruption method based on the system of the present invention.
Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be described in detail with reference to the following detailed description and the accompanying drawings. The experimental procedures, which do not address the specific conditions in the examples below, are generally carried out under conventional conditions or under conditions recommended by the manufacturer. The test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores. Percentages and parts are by weight unless otherwise indicated. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, any methods and materials similar or equivalent to those described herein can be used in the present invention. The preferred methods and materials described herein are presented for illustrative purposes only.
The inventors detected SNV differences by comparing an ultrasonic method with an enzymatic cutting method, found that the mutation detected by the enzymatic cutting method is significantly more than that detected by a mechanical breaking method (shown in FIG. 2), and that the probability of false positive is extremely high when individual sites are repeatedly detected in different samples. Summarizing the features of these sites, finding that these sites are located at the unpaired sequence or junction inside the two palindromic sequences immediately adjacent (as shown in FIG. 3), it is assumed that chimeric reads are generated between the two palindromic sequences through cleavage disruption and repair processes, the mechanism is shown in FIG. 4, and because of the mismatch of palindromic sequence regions, 1 strand is used as a template in the repair process, and mutations are introduced, the mutation types of which are consistent with the complementary paired sequences of the template strand.
Based on the characteristics, the inventor considers that unmatched sites in a palindromic sequence region in a specific panel need to be searched, a false positive mutation site blacklist is generated, a filtering principle is set, and only true positive mutation is filtered, so that the accuracy of a result is ensured.
Example 1 blacklist establishment
Taking 1123 panels (1123 genes, covering genomic region approximately 2 Mb) as an example, blacklist sites were generated. Firstly, respectively extending a bed interval of a given panel to 50bp upstream of a start site and downstream of a stop site to serve as a reference sequence for searching a palindromic sequence, so as to avoid missed detection caused by that a palindromic region is positioned at the end of the bed;
then artificially generating a series of K-mers for segmenting sequences, and searching for a palindromic structure; in total n can be produced
= Σ [ (L-K) ]+1 subsequences, wherein L represents the length of the reference sequence, K represents the palindromic sequence length, K
The range of (2) is L/2); using a getSeq (string, i) function for obtaining a palindromic sequence centered at position i, in which function the left and right sides are extended centered at position i while checking whether the extended bases satisfy palindromic properties (base complementation principle a-T, C-G); if bases that do not meet the palindromic properties are found during the extension process, this means that one or more mismatched sites are present and the unmatched base positions are recorded, which may be a single base or two adjacent bases. Stopping continuing to expand when 3 unmatched sites continuously appear, and obtaining palindromic sequence coordinates;
combining the palindromic structures with overlapping areas by using a mergeOut (seq_dic, ch, start, end, seq) function, setting filtering parameters, and only reserving palindromic sequences within the length range of 17bp-40 bp;
for each retained palindromic sequence, different treatments are performed according to the presence or absence of the missing base: if the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list; if the palindromic sequence is even in length and one or two missing bases exist, the palindromic sequence is converted into SNP information according to the positional relationship of the missing bases and stored in a snp_list. If the palindromic sequence is even in length and there are two missing bases adjacent in position, it is combined into one MNP (multiple site polymorphism) information and saved to the snp_list.
Judging whether the points in the snp_list are at the extreme end of the palindromic sequence, and if so, ignoring the points; if not, these sites are added to the blacklist.
Example 2 Single sample data extraction and detection
Respectively carrying out DNA library construction on 54 paired tumor samples by using an Anzan enzyme digestion method library construction kit and a KAPA mechanical breaking method kit, and then carrying out hybridization capturing and on-machine sequencing; performing quality control on the original result, removing a joint sequence, removing low-quality data and removing too short reads; comparing the data with human genome, removing repeated sequence by using Picard software, identifying SNV variation by using Vardict software, comparing the SNV variation with a blacklist, if the variation exists in the blacklist, keeping the mutation frequency by more than 10%, and if the mutation frequency is less than or equal to 10%, filtering; comparing the mutation detection results of the two, the consistency of the two is obviously improved (see figure 5).
Finally, it should be noted that the above description is only a preferred embodiment of the present invention, and that many similar changes can be made by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (1)
1. A system for reducing errors in a second generation sequencing system, comprising:
the second generation sequencing module is used for second generation sequencing of the DNA sample;
the blacklist module is used for comparing the second generation sequencing data and screening false positive mutation points;
wherein the blacklist module comprises artificial mutation sites;
the blacklist module is established by adopting the following method:
s1, respectively extending the sequencing data up and down by 50bp based on a hot spot interval related to cancer, and using the sequencing data as a reference sequence for searching a palindromic sequence;
s2, manually dividing the reference sequences to obtain n-delta sigma [ (L-K) +1 ] subsequences, wherein L is the length of the extended reference sequence, K is the length of the representative palindromic sequence, and the range of K is 2-L/2;
s3, using a getSeq function, acquiring a palindromic sequence with a position i as a center, and checking whether the expanded base meets palindromic characteristics;
s4, merging and reserving the palindromic structures with the overlapped areas by using a merge out function;
s5, judging whether the reserved palindromic sequence has missing bases or not, converting the position relationship of the missing bases into SNP information, and storing the SNP information into a snp_list to form a blacklist;
in the step S3, the base which does not meet the palindromic property is found in the expansion process, the unmatched base positions are recorded, when 3 unmatched sites continuously appear, the continuous expansion is stopped, and palindromic sequence coordinates are obtained;
in the step S4, filtering parameters are set, and palindromic sequences within the length range of 17bp-40bp are reserved;
in the step S5: when the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list;
when the length of the palindromic sequence is even and one or two missing bases exist, converting the palindromic sequence into SNP information according to the position relationship of the missing bases and storing the SNP information into a snp_list;
when the length of the palindromic sequence is even, two missing bases exist and are adjacent to each other, combining the two missing bases into MNP information, and storing the MNP information into a snp_list;
in S5, when the locus in the snp_list is positioned at the extreme end of the palindromic sequence, the locus is ignored; when the sites in the snp_list are not at the very end of the palindromic sequence, these sites are added to the blacklist.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311207718.3A CN116959579B (en) | 2023-09-19 | 2023-09-19 | System for reducing errors of second generation sequencing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311207718.3A CN116959579B (en) | 2023-09-19 | 2023-09-19 | System for reducing errors of second generation sequencing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116959579A CN116959579A (en) | 2023-10-27 |
CN116959579B true CN116959579B (en) | 2023-12-22 |
Family
ID=88458691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311207718.3A Active CN116959579B (en) | 2023-09-19 | 2023-09-19 | System for reducing errors of second generation sequencing system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116959579B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658983A (en) * | 2018-12-20 | 2019-04-19 | 深圳市海普洛斯生物科技有限公司 | A kind of method and apparatus identifying and eliminate false positive in variance detection |
CN111415707A (en) * | 2020-03-10 | 2020-07-14 | 四川大学 | Prediction method of clinical individualized tumor neoantigen |
CN112116956A (en) * | 2020-09-29 | 2020-12-22 | 深圳裕策生物科技有限公司 | Tumor single sample TMB detection method and device based on second-generation sequencing |
CN116064755A (en) * | 2023-01-12 | 2023-05-05 | 华中科技大学同济医学院附属同济医院 | Device for detecting MRD marker based on linkage gene mutation |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090123928A1 (en) * | 2007-10-11 | 2009-05-14 | The Johns Hopkins University | Genomic Landscapes of Human Breast and Colorectal Cancers |
-
2023
- 2023-09-19 CN CN202311207718.3A patent/CN116959579B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658983A (en) * | 2018-12-20 | 2019-04-19 | 深圳市海普洛斯生物科技有限公司 | A kind of method and apparatus identifying and eliminate false positive in variance detection |
CN111415707A (en) * | 2020-03-10 | 2020-07-14 | 四川大学 | Prediction method of clinical individualized tumor neoantigen |
CN112116956A (en) * | 2020-09-29 | 2020-12-22 | 深圳裕策生物科技有限公司 | Tumor single sample TMB detection method and device based on second-generation sequencing |
CN116064755A (en) * | 2023-01-12 | 2023-05-05 | 华中科技大学同济医学院附属同济医院 | Device for detecting MRD marker based on linkage gene mutation |
Also Published As
Publication number | Publication date |
---|---|
CN116959579A (en) | 2023-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105518151B (en) | Identification and use of circulating nucleic acid tumor markers | |
WO2018041062A1 (en) | Multi-position double-tag connector set for detecting gene mutation and preparation method therefor and application thereof | |
CN113661249A (en) | Compositions and methods for isolating cell-free DNA | |
CN110033829B (en) | Fusion detection method of homologous genes based on differential SNP markers | |
CN110520542A (en) | Method for targeting nucleic acid sequence enrichment and the application in the nucleic acid sequencing of error correcting | |
Sehn | Insertions and deletions (indels) | |
WO2023035889A1 (en) | Gene fusion detection method and apparatus | |
KR20160141680A (en) | Method of next generation sequencing using adapter comprising barcode sequence | |
US20230170042A1 (en) | Structural variation detection in chromosomal proximity experiments | |
CN112301115B (en) | FGFRs gene mutation detection method based on high-throughput sequencing and probe sequence | |
JP2023526252A (en) | Detection of homologous recombination repair defects | |
CN110468189B (en) | Method and device for detecting sample body cell variation based on single-sample second-generation sequencing | |
Shiraishi et al. | Precise characterization of somatic complex structural variations from paired long-read sequencing data with nanomonsv | |
CN116959579B (en) | System for reducing errors of second generation sequencing system | |
CN105528532B (en) | A kind of characteristic analysis method in rna editing site | |
CN110651050A (en) | Targeted enrichment method and kit for detecting low-frequency mutation | |
CN115896256A (en) | Method, device, equipment and storage medium for detecting RNA insertion deletion mutation based on second-generation sequencing technology | |
CN114005490B (en) | Circulating tumor DNA fusion detection method based on second-generation sequencing technology | |
CN115954052A (en) | Method and system for screening monitoring sites of tiny residual lesions of solid tumors | |
CN112251512B (en) | Target genome for gene detection of non-small cell lung cancer patient and related evaluation method, application and kit | |
CN104846072B (en) | The biological markers of prostate cancer, therapy target and application thereof | |
CN114746560A (en) | Methods, compositions, and systems for improved binding of methylated polynucleotides | |
CN105925663A (en) | Kit and application thereof, and method and system for detecting area target variation | |
CN116994656B (en) | Method for improving second generation sequencing detection accuracy | |
WO2018219581A1 (en) | Method and system for nucleic acid sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |