CN116959579B

CN116959579B - System for reducing errors of second generation sequencing system

Info

Publication number: CN116959579B
Application number: CN202311207718.3A
Authority: CN
Inventors: 张怡然; 陈慧娟; 王冰; 段小红; 郝艳同; 蔡丽丽; 周启明
Original assignee: Beijing Qiuzhen Medical Laboratory Co ltd
Current assignee: Beijing Qiuzhen Medical Laboratory Co ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-22
Anticipated expiration: 2043-09-19
Also published as: CN116959579A

Abstract

The invention relates to the technical field of medical molecular biology, in particular to a system for reducing errors of a second generation sequencing system.

Description

System for reducing errors of second generation sequencing system

Technical Field

Background

For tumor DNA detection, DNA fragmentation is firstly carried out on the basis of a short-reading long high-throughput sequencing platform, and the fragmentation is divided into two types, namely mechanical disruption (an ultrasonic method) and endonuclease-based (an enzymatic cleavage method). The enzyme digestion method does not need consumable materials, and can be easily integrated into an automatic warehouse-building process, so that the ultrasonic method is gradually replaced. However, due to certain preference of the enzyme digestion method, artificial mutation can be introduced in the process of library establishment, and the conventional data filtering method totally removes chimeric reads and loses the chimeric reads to truly generate mutation, so that sensitivity is reduced and detection frequency is inaccurate. Therefore, a blacklist needs to be established to filter the mutations, so that the accuracy of the result is ensured.

Disclosure of Invention

Aiming at the defects of the background technology, in order to obtain higher stability while maintaining accuracy, the invention establishes a specific blacklist of an enzyme digestion method based on a second generation sequencing platform, filters artificially introduced mutation in a second generation sequencing library and improves detection accuracy.

A system for reducing errors in a second generation sequencing system, comprising:

the second generation sequencing module is used for second generation sequencing of the DNA sample;

the blacklist module is used for comparing the second generation sequencing data and screening false positive mutation points;

wherein the blacklist module comprises artificial mutation sites.

Further, the second generation sequencing is second generation sequencing using an Illumina sequencing platform.

Further, the blacklist is established by the following method:

s1, respectively extending the sequencing data up and down by 50bp based on a hot spot interval related to cancer, and using the sequencing data as a reference sequence for searching a palindromic sequence;

s2, manually dividing the reference sequences to obtain n-delta sigma [ (L-K) +1 ] subsequences, wherein L is the length of the extended reference sequence, K is the length of the representative palindromic sequence, and the range of K is 2-L/2;

s3, using a getSeq function, acquiring a palindromic sequence with a position i as a center, and checking whether the expanded base meets palindromic characteristics;

s4, merging and reserving the palindromic structures with the overlapped areas by using a merge out function;

s5, judging whether the reserved palindromic sequence has missing bases or not, converting the position relationship of the missing bases into SNP information, and storing the SNP information into a snp_list to form a blacklist.

Further, in the step S3, bases which do not satisfy the palindromic properties are found in the expansion process, the positions of the mismatched bases are recorded, when 3 mismatched sites continuously appear, the continuous expansion is stopped, and palindromic sequence coordinates are obtained.

Further, in the step S4, filtering parameters are set, and palindromic sequences within the length range of 17bp-40bp are reserved.

Further, in S5:

when the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list;

when the length of the palindromic sequence is even and one or two missing bases exist, converting the palindromic sequence into SNP information according to the position relationship of the missing bases and storing the SNP information into a snp_list;

when the palindromic sequence is even in length and two missing bases are present and adjacent in position, it is combined into one MNP message and saved to the snp_list.

Further, in S5, when the sites in the snp_list are located at the extreme ends of the palindromic sequence, these sites are ignored; when the sites in the snp_list are not at the very end of the palindromic sequence, these sites are added to the blacklist.

The beneficial effects are that: according to the system for reducing errors of the second-generation sequencing system, provided by the invention, false positive mutation characteristics generated by a statistical enzyme digestion method are utilized to generate blacklists aiming at the easily-generated chimeric areas of different panels by utilizing a bioinformatics method, and a filtering principle is set, so that only mutation positioned in the chimeric areas is filtered, and the sensitivity and accuracy of detection are improved.

Drawings

FIG. 1 is a flow chart of blacklist establishment according to the present invention;

FIG. 2 is a graph showing SNV detection comparison based on second generation sequencing by an enzymatic cleavage method and a mechanical disruption method;

FIG. 3 is a summary of the features of false positive mutations in the enzymatic cleavage process;

FIG. 4 is a schematic representation of the enzymatic mutagenesis;

FIG. 5 is a graph showing SNV detection by the enzyme digestion method and the mechanical disruption method based on the system of the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be described in detail with reference to the following detailed description and the accompanying drawings. The experimental procedures, which do not address the specific conditions in the examples below, are generally carried out under conventional conditions or under conditions recommended by the manufacturer. The test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores. Percentages and parts are by weight unless otherwise indicated. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, any methods and materials similar or equivalent to those described herein can be used in the present invention. The preferred methods and materials described herein are presented for illustrative purposes only.

The inventors detected SNV differences by comparing an ultrasonic method with an enzymatic cutting method, found that the mutation detected by the enzymatic cutting method is significantly more than that detected by a mechanical breaking method (shown in FIG. 2), and that the probability of false positive is extremely high when individual sites are repeatedly detected in different samples. Summarizing the features of these sites, finding that these sites are located at the unpaired sequence or junction inside the two palindromic sequences immediately adjacent (as shown in FIG. 3), it is assumed that chimeric reads are generated between the two palindromic sequences through cleavage disruption and repair processes, the mechanism is shown in FIG. 4, and because of the mismatch of palindromic sequence regions, 1 strand is used as a template in the repair process, and mutations are introduced, the mutation types of which are consistent with the complementary paired sequences of the template strand.

Based on the characteristics, the inventor considers that unmatched sites in a palindromic sequence region in a specific panel need to be searched, a false positive mutation site blacklist is generated, a filtering principle is set, and only true positive mutation is filtered, so that the accuracy of a result is ensured.

Example 1 blacklist establishment

Taking 1123 panels (1123 genes, covering genomic region approximately 2 Mb) as an example, blacklist sites were generated. Firstly, respectively extending a bed interval of a given panel to 50bp upstream of a start site and downstream of a stop site to serve as a reference sequence for searching a palindromic sequence, so as to avoid missed detection caused by that a palindromic region is positioned at the end of the bed;

then artificially generating a series of K-mers for segmenting sequences, and searching for a palindromic structure; in total n can be produced

= Σ [ (L-K) ]+1 subsequences, wherein L represents the length of the reference sequence, K represents the palindromic sequence length, K

The range of (2) is L/2); using a getSeq (string, i) function for obtaining a palindromic sequence centered at position i, in which function the left and right sides are extended centered at position i while checking whether the extended bases satisfy palindromic properties (base complementation principle a-T, C-G); if bases that do not meet the palindromic properties are found during the extension process, this means that one or more mismatched sites are present and the unmatched base positions are recorded, which may be a single base or two adjacent bases. Stopping continuing to expand when 3 unmatched sites continuously appear, and obtaining palindromic sequence coordinates;

combining the palindromic structures with overlapping areas by using a mergeOut (seq_dic, ch, start, end, seq) function, setting filtering parameters, and only reserving palindromic sequences within the length range of 17bp-40 bp;

for each retained palindromic sequence, different treatments are performed according to the presence or absence of the missing base: if the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list; if the palindromic sequence is even in length and one or two missing bases exist, the palindromic sequence is converted into SNP information according to the positional relationship of the missing bases and stored in a snp_list. If the palindromic sequence is even in length and there are two missing bases adjacent in position, it is combined into one MNP (multiple site polymorphism) information and saved to the snp_list.

Judging whether the points in the snp_list are at the extreme end of the palindromic sequence, and if so, ignoring the points; if not, these sites are added to the blacklist.

Example 2 Single sample data extraction and detection

Respectively carrying out DNA library construction on 54 paired tumor samples by using an Anzan enzyme digestion method library construction kit and a KAPA mechanical breaking method kit, and then carrying out hybridization capturing and on-machine sequencing; performing quality control on the original result, removing a joint sequence, removing low-quality data and removing too short reads; comparing the data with human genome, removing repeated sequence by using Picard software, identifying SNV variation by using Vardict software, comparing the SNV variation with a blacklist, if the variation exists in the blacklist, keeping the mutation frequency by more than 10%, and if the mutation frequency is less than or equal to 10%, filtering; comparing the mutation detection results of the two, the consistency of the two is obviously improved (see figure 5).

Finally, it should be noted that the above description is only a preferred embodiment of the present invention, and that many similar changes can be made by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A system for reducing errors in a second generation sequencing system, comprising:

wherein the blacklist module comprises artificial mutation sites;

the blacklist module is established by adopting the following method:

s5, judging whether the reserved palindromic sequence has missing bases or not, converting the position relationship of the missing bases into SNP information, and storing the SNP information into a snp_list to form a blacklist;

in the step S3, the base which does not meet the palindromic property is found in the expansion process, the unmatched base positions are recorded, when 3 unmatched sites continuously appear, the continuous expansion is stopped, and palindromic sequence coordinates are obtained;

in the step S4, filtering parameters are set, and palindromic sequences within the length range of 17bp-40bp are reserved;

in the step S5: when the length of the palindromic sequence is odd and a missing base exists, converting the position of the missing base into SNP information, and storing the SNP information into a snp_list;

when the length of the palindromic sequence is even, two missing bases exist and are adjacent to each other, combining the two missing bases into MNP information, and storing the MNP information into a snp_list;

in S5, when the locus in the snp_list is positioned at the extreme end of the palindromic sequence, the locus is ignored; when the sites in the snp_list are not at the very end of the palindromic sequence, these sites are added to the blacklist.