CN116959579B - System for reducing errors of second generation sequencing system - Google Patents

System for reducing errors of second generation sequencing system Download PDF

Info

Publication number
CN116959579B
CN116959579B CN202311207718.3A CN202311207718A CN116959579B CN 116959579 B CN116959579 B CN 116959579B CN 202311207718 A CN202311207718 A CN 202311207718A CN 116959579 B CN116959579 B CN 116959579B
Authority
CN
China
Prior art keywords
snp
palindromic
palindromic sequence
sequence
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311207718.3A
Other languages
Chinese (zh)
Other versions
CN116959579A (en
Inventor
张怡然
陈慧娟
王冰
段小红
郝艳同
蔡丽丽
周启明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qiuzhen Medical Laboratory Co ltd
Original Assignee
Beijing Qiuzhen Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qiuzhen Medical Laboratory Co ltd filed Critical Beijing Qiuzhen Medical Laboratory Co ltd
Priority to CN202311207718.3A priority Critical patent/CN116959579B/en
Publication of CN116959579A publication Critical patent/CN116959579A/en
Application granted granted Critical
Publication of CN116959579B publication Critical patent/CN116959579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of medical molecular biology, in particular to a system for reducing errors of a second generation sequencing system.

Description

System for reducing errors of second generation sequencing system
Technical Field
The invention relates to the technical field of medical molecular biology, in particular to a system for reducing errors of a second generation sequencing system.
Background
For tumor DNA detection, DNA fragmentation is firstly carried out on the basis of a short-reading long high-throughput sequencing platform, and the fragmentation is divided into two types, namely mechanical disruption (an ultrasonic method) and endonuclease-based (an enzymatic cleavage method). The enzyme digestion method does not need consumable materials, and can be easily integrated into an automatic warehouse-building process, so that the ultrasonic method is gradually replaced. However, due to certain preference of the enzyme digestion method, artificial mutation can be introduced in the process of library establishment, and the conventional data filtering method totally removes chimeric reads and loses the chimeric reads to truly generate mutation, so that sensitivity is reduced and detection frequency is inaccurate. Therefore, a blacklist needs to be established to filter the mutations, so that the accuracy of the result is ensured.
Disclosure of Invention
Aiming at the defects of the background technology, in order to obtain higher stability while maintaining accuracy, the invention establishes a specific blacklist of an enzyme digestion method based on a second generation sequencing platform, filters artificially introduced mutation in a second generation sequencing library and improves detection accuracy.
A system for reducing errors in a second generation sequencing system, comprising:
the second generation sequencing module is used for second generation sequencing of the DNA sample;
the blacklist module is used for comparing the second generation sequencing data and screening false positive mutation points;
wherein the blacklist module comprises artificial mutation sites.
Further, the second generation sequencing is second generation sequencing using an Illumina sequencing platform.
Further, the blacklist is established by the following method:
s1, respectively extending the sequencing data up and down by 50bp based on a hot spot interval related to cancer, and using the sequencing data as a reference sequence for searching a palindromic sequence;
s2, manually dividing the reference sequences to obtain n-delta sigma [ (L-K) +1 ] subsequences, wherein L is the length of the extended reference sequence, K is the length of the representative palindromic sequence, and the range of K is 2-L/2;
s3, using a getSeq function, acquiring a palindromic sequence with a position i as a center, and checking whether the expanded base meets palindromic characteristics;
s4, merging and reserving the palindromic structures with the overlapped areas by using a merge out function;
s5, judging whether the reserved palindromic sequence has missing bases or not, converting the position relationship of the missing bases into SNP information, and storing the SNP information into a snp_list to form a blacklist.
Further, in the step S3, bases which do not satisfy the palindromic properties are found in the expansion process, the positions of the mismatched bases are recorded, when 3 mismatched sites continuously appear, the continuous expansion is stopped, and palindromic sequence coordinates are obtained.
Further, in the step S4, filtering parameters are set, and palindromic sequences within the length range of 17bp-40bp are reserved.
Further, in S5:
when the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list;
when the length of the palindromic sequence is even and one or two missing bases exist, converting the palindromic sequence into SNP information according to the position relationship of the missing bases and storing the SNP information into a snp_list;
when the palindromic sequence is even in length and two missing bases are present and adjacent in position, it is combined into one MNP message and saved to the snp_list.
Further, in S5, when the sites in the snp_list are located at the extreme ends of the palindromic sequence, these sites are ignored; when the sites in the snp_list are not at the very end of the palindromic sequence, these sites are added to the blacklist.
The beneficial effects are that: according to the system for reducing errors of the second-generation sequencing system, provided by the invention, false positive mutation characteristics generated by a statistical enzyme digestion method are utilized to generate blacklists aiming at the easily-generated chimeric areas of different panels by utilizing a bioinformatics method, and a filtering principle is set, so that only mutation positioned in the chimeric areas is filtered, and the sensitivity and accuracy of detection are improved.
Drawings
FIG. 1 is a flow chart of blacklist establishment according to the present invention;
FIG. 2 is a graph showing SNV detection comparison based on second generation sequencing by an enzymatic cleavage method and a mechanical disruption method;
FIG. 3 is a summary of the features of false positive mutations in the enzymatic cleavage process;
FIG. 4 is a schematic representation of the enzymatic mutagenesis;
FIG. 5 is a graph showing SNV detection by the enzyme digestion method and the mechanical disruption method based on the system of the present invention.
Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be described in detail with reference to the following detailed description and the accompanying drawings. The experimental procedures, which do not address the specific conditions in the examples below, are generally carried out under conventional conditions or under conditions recommended by the manufacturer. The test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores. Percentages and parts are by weight unless otherwise indicated. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, any methods and materials similar or equivalent to those described herein can be used in the present invention. The preferred methods and materials described herein are presented for illustrative purposes only.
The inventors detected SNV differences by comparing an ultrasonic method with an enzymatic cutting method, found that the mutation detected by the enzymatic cutting method is significantly more than that detected by a mechanical breaking method (shown in FIG. 2), and that the probability of false positive is extremely high when individual sites are repeatedly detected in different samples. Summarizing the features of these sites, finding that these sites are located at the unpaired sequence or junction inside the two palindromic sequences immediately adjacent (as shown in FIG. 3), it is assumed that chimeric reads are generated between the two palindromic sequences through cleavage disruption and repair processes, the mechanism is shown in FIG. 4, and because of the mismatch of palindromic sequence regions, 1 strand is used as a template in the repair process, and mutations are introduced, the mutation types of which are consistent with the complementary paired sequences of the template strand.
Based on the characteristics, the inventor considers that unmatched sites in a palindromic sequence region in a specific panel need to be searched, a false positive mutation site blacklist is generated, a filtering principle is set, and only true positive mutation is filtered, so that the accuracy of a result is ensured.
Example 1 blacklist establishment
Taking 1123 panels (1123 genes, covering genomic region approximately 2 Mb) as an example, blacklist sites were generated. Firstly, respectively extending a bed interval of a given panel to 50bp upstream of a start site and downstream of a stop site to serve as a reference sequence for searching a palindromic sequence, so as to avoid missed detection caused by that a palindromic region is positioned at the end of the bed;
then artificially generating a series of K-mers for segmenting sequences, and searching for a palindromic structure; in total n can be produced
= Σ [ (L-K) ]+1 subsequences, wherein L represents the length of the reference sequence, K represents the palindromic sequence length, K
The range of (2) is L/2); using a getSeq (string, i) function for obtaining a palindromic sequence centered at position i, in which function the left and right sides are extended centered at position i while checking whether the extended bases satisfy palindromic properties (base complementation principle a-T, C-G); if bases that do not meet the palindromic properties are found during the extension process, this means that one or more mismatched sites are present and the unmatched base positions are recorded, which may be a single base or two adjacent bases. Stopping continuing to expand when 3 unmatched sites continuously appear, and obtaining palindromic sequence coordinates;
combining the palindromic structures with overlapping areas by using a mergeOut (seq_dic, ch, start, end, seq) function, setting filtering parameters, and only reserving palindromic sequences within the length range of 17bp-40 bp;
for each retained palindromic sequence, different treatments are performed according to the presence or absence of the missing base: if the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list; if the palindromic sequence is even in length and one or two missing bases exist, the palindromic sequence is converted into SNP information according to the positional relationship of the missing bases and stored in a snp_list. If the palindromic sequence is even in length and there are two missing bases adjacent in position, it is combined into one MNP (multiple site polymorphism) information and saved to the snp_list.
Judging whether the points in the snp_list are at the extreme end of the palindromic sequence, and if so, ignoring the points; if not, these sites are added to the blacklist.
Example 2 Single sample data extraction and detection
Respectively carrying out DNA library construction on 54 paired tumor samples by using an Anzan enzyme digestion method library construction kit and a KAPA mechanical breaking method kit, and then carrying out hybridization capturing and on-machine sequencing; performing quality control on the original result, removing a joint sequence, removing low-quality data and removing too short reads; comparing the data with human genome, removing repeated sequence by using Picard software, identifying SNV variation by using Vardict software, comparing the SNV variation with a blacklist, if the variation exists in the blacklist, keeping the mutation frequency by more than 10%, and if the mutation frequency is less than or equal to 10%, filtering; comparing the mutation detection results of the two, the consistency of the two is obviously improved (see figure 5).
Finally, it should be noted that the above description is only a preferred embodiment of the present invention, and that many similar changes can be made by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (1)

1. A system for reducing errors in a second generation sequencing system, comprising:
the second generation sequencing module is used for second generation sequencing of the DNA sample;
the blacklist module is used for comparing the second generation sequencing data and screening false positive mutation points;
wherein the blacklist module comprises artificial mutation sites;
the blacklist module is established by adopting the following method:
s1, respectively extending the sequencing data up and down by 50bp based on a hot spot interval related to cancer, and using the sequencing data as a reference sequence for searching a palindromic sequence;
s2, manually dividing the reference sequences to obtain n-delta sigma [ (L-K) +1 ] subsequences, wherein L is the length of the extended reference sequence, K is the length of the representative palindromic sequence, and the range of K is 2-L/2;
s3, using a getSeq function, acquiring a palindromic sequence with a position i as a center, and checking whether the expanded base meets palindromic characteristics;
s4, merging and reserving the palindromic structures with the overlapped areas by using a merge out function;
s5, judging whether the reserved palindromic sequence has missing bases or not, converting the position relationship of the missing bases into SNP information, and storing the SNP information into a snp_list to form a blacklist;
in the step S3, the base which does not meet the palindromic property is found in the expansion process, the unmatched base positions are recorded, when 3 unmatched sites continuously appear, the continuous expansion is stopped, and palindromic sequence coordinates are obtained;
in the step S4, filtering parameters are set, and palindromic sequences within the length range of 17bp-40bp are reserved;
in the step S5: when the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list;
when the length of the palindromic sequence is even and one or two missing bases exist, converting the palindromic sequence into SNP information according to the position relationship of the missing bases and storing the SNP information into a snp_list;
when the length of the palindromic sequence is even, two missing bases exist and are adjacent to each other, combining the two missing bases into MNP information, and storing the MNP information into a snp_list;
in S5, when the locus in the snp_list is positioned at the extreme end of the palindromic sequence, the locus is ignored; when the sites in the snp_list are not at the very end of the palindromic sequence, these sites are added to the blacklist.
CN202311207718.3A 2023-09-19 2023-09-19 System for reducing errors of second generation sequencing system Active CN116959579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311207718.3A CN116959579B (en) 2023-09-19 2023-09-19 System for reducing errors of second generation sequencing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311207718.3A CN116959579B (en) 2023-09-19 2023-09-19 System for reducing errors of second generation sequencing system

Publications (2)

Publication Number Publication Date
CN116959579A CN116959579A (en) 2023-10-27
CN116959579B true CN116959579B (en) 2023-12-22

Family

ID=88458691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311207718.3A Active CN116959579B (en) 2023-09-19 2023-09-19 System for reducing errors of second generation sequencing system

Country Status (1)

Country Link
CN (1) CN116959579B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658983A (en) * 2018-12-20 2019-04-19 深圳市海普洛斯生物科技有限公司 A kind of method and apparatus identifying and eliminate false positive in variance detection
CN111415707A (en) * 2020-03-10 2020-07-14 四川大学 Prediction method of clinical individualized tumor neoantigen
CN112116956A (en) * 2020-09-29 2020-12-22 深圳裕策生物科技有限公司 Tumor single sample TMB detection method and device based on second-generation sequencing
CN116064755A (en) * 2023-01-12 2023-05-05 华中科技大学同济医学院附属同济医院 Device for detecting MRD marker based on linkage gene mutation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090123928A1 (en) * 2007-10-11 2009-05-14 The Johns Hopkins University Genomic Landscapes of Human Breast and Colorectal Cancers

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658983A (en) * 2018-12-20 2019-04-19 深圳市海普洛斯生物科技有限公司 A kind of method and apparatus identifying and eliminate false positive in variance detection
CN111415707A (en) * 2020-03-10 2020-07-14 四川大学 Prediction method of clinical individualized tumor neoantigen
CN112116956A (en) * 2020-09-29 2020-12-22 深圳裕策生物科技有限公司 Tumor single sample TMB detection method and device based on second-generation sequencing
CN116064755A (en) * 2023-01-12 2023-05-05 华中科技大学同济医学院附属同济医院 Device for detecting MRD marker based on linkage gene mutation

Also Published As

Publication number Publication date
CN116959579A (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN105518151B (en) Identification and use of circulating nucleic acid tumor markers
WO2018041062A1 (en) Multi-position double-tag connector set for detecting gene mutation and preparation method therefor and application thereof
CN113661249A (en) Compositions and methods for isolating cell-free DNA
CN110033829B (en) Fusion detection method of homologous genes based on differential SNP markers
CN110520542A (en) Method for targeting nucleic acid sequence enrichment and the application in the nucleic acid sequencing of error correcting
Sehn Insertions and deletions (indels)
WO2023035889A1 (en) Gene fusion detection method and apparatus
KR20160141680A (en) Method of next generation sequencing using adapter comprising barcode sequence
US20230170042A1 (en) Structural variation detection in chromosomal proximity experiments
CN112301115B (en) FGFRs gene mutation detection method based on high-throughput sequencing and probe sequence
JP2023526252A (en) Detection of homologous recombination repair defects
CN110468189B (en) Method and device for detecting sample body cell variation based on single-sample second-generation sequencing
Shiraishi et al. Precise characterization of somatic complex structural variations from paired long-read sequencing data with nanomonsv
CN116959579B (en) System for reducing errors of second generation sequencing system
CN105528532B (en) A kind of characteristic analysis method in rna editing site
CN110651050A (en) Targeted enrichment method and kit for detecting low-frequency mutation
CN115896256A (en) Method, device, equipment and storage medium for detecting RNA insertion deletion mutation based on second-generation sequencing technology
CN114005490B (en) Circulating tumor DNA fusion detection method based on second-generation sequencing technology
CN115954052A (en) Method and system for screening monitoring sites of tiny residual lesions of solid tumors
CN112251512B (en) Target genome for gene detection of non-small cell lung cancer patient and related evaluation method, application and kit
CN104846072B (en) The biological markers of prostate cancer, therapy target and application thereof
CN114746560A (en) Methods, compositions, and systems for improved binding of methylated polynucleotides
CN105925663A (en) Kit and application thereof, and method and system for detecting area target variation
CN116994656B (en) Method for improving second generation sequencing detection accuracy
WO2018219581A1 (en) Method and system for nucleic acid sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant