CN114334006B

CN114334006B - Method and device for introducing noise in enzyme digestion library building mode

Info

Publication number: CN114334006B
Application number: CN202111649916.6A
Authority: CN
Inventors: 蒋才; 戴鹏; 程陶然; 朱文鑫
Original assignee: Naonda Nanjing Biological Technology Co ltd
Current assignee: Naonda Nanjing Biological Technology Co ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-11-29
Anticipated expiration: 2041-12-29
Also published as: CN114334006A

Abstract

The invention discloses a method and a device for introducing noise in a way of filtering enzyme digestion library building. The method comprises the following steps: acquiring enzyme digestion database construction double-end sequencing off-line data, comparing the data to a reference genome sequence, extracting read segments containing Soft Clip marks from the reference genome sequence, and counting the Soft Clip base number in each read segment; recording the read segments with the Soft Clip base number larger than a threshold value T1 as candidate processing sequences, and extracting the comparison positions of the candidate processing sequences and Soft Clip base sequences; extending the length of D before and after the comparison position of each candidate processing sequence to obtain an extension region, searching a sequence similar to the Soft Clip base sequence, and if the similarity is greater than a threshold value T2, determining that a read containing enzyme cutting noise is stored in a removal file; and filtering out the contained reads to obtain a comparison file for removing noise. The artifact sequence is effectively filtered, and the detection accuracy is improved.

Description

Method and device for introducing noise in enzyme digestion library building mode

Technical Field

The invention relates to a biological information analysis method, in particular to a method and a device for introducing noise in a way of filtering enzyme digestion library building.

Background

With the development of Next-Generation Sequencing (NGS) technology, researchers' requirements for library building efficiency have increased. Genome fragmentation is the first step in library construction due to the limited read length of the current sequencer from each manufacturer, and the common fragmentation methods at present are mechanical fragmentation and enzymatic fragmentation. The mechanical fragmentation mainly uses ultrasonic waves to break a genome, the principle is that the genome is fragmented by ultrasonic stretching resonance, and the method has the advantages that the generated fragments are stable, uniform and non-preferential, and the method is the gold standard of fragmentation in the second generation sequencing library construction at present. However, ultrasonic disruption also has its insurmountable limitations, such as high material cost of instruments, samples with different qualities and degradation degrees needing to be searched for different disruption times, DNA damage caused by overlong disruption times, and the like. Therefore, convenient, economical and efficient enzyme section segmentation methods are gradually used for manual and automatic library building. The method randomly breaks the genome by using the fragmentation enzyme, has the obvious characteristics of mildness, can better keep the integrity of DNA, and can obviously simplify the library building process and shorten the time cost.

However, under the action of enzyme, the partial inverted repeat sequences on the DNA fragments are combined together abnormally to form a neck ring structure, and the neck ring structure can generate repeated abnormal sequences which belong to artificial synthesis error introduction and are called as the Artifact sequences after downstream end repair and PCR amplification. The Artifact sequence is a main noise sequence possibly introduced by the method for constructing the database by enzyme digestion NGS, and because the Artifact sequence is an abnormal combination of real molecules on a DNA fragment and is not an error generated in the sequencing process, the quality of the base of the Artifact sequence is high, and even if the proportion of the sequence is low, the sequence still has a remarkable influence on mutation analysis, especially on the detection of low-frequency mutation. The frequency of false positive mutation caused by the Artifact sequence is about 0.1-30%, and the false positive mutation is difficult to remove through a molecular tag, so that a method for eliminating noise interference introduced by an enzyme digestion library building mode to the maximum extent is urgently needed.

Disclosure of Invention

The invention provides a method and a device for introducing noise in a way of filtering enzyme digestion library building, which aim to solve the problem that the noise is difficult to eliminate in the prior art.

According to a first aspect of the application, a filtering method for noise introduced by an enzyme digestion library building mode is provided, which comprises the following steps: acquiring initial comparison result files of the enzyme digestion database construction double-end sequencing machine unloading data and a reference genome sequence; extracting the read segments containing the Soft Clip marks from the initial comparison result file, and counting the number of bases of the Soft Clip in each read segment; recording the reads with the Soft Clip base number larger than a threshold value T1 as candidate processing sequences, and extracting the alignment positions of the candidate processing sequences on a reference genome and the Soft Clip base sequences in the candidate processing sequences; extending the length of D before and after the comparison position of each candidate processing sequence on the reference genome to obtain an extension region, searching a sequence similar to the Soft Clip base sequence in each extension region, and if the similarity of the similar sequence is greater than a threshold value T2, taking the read as a read containing enzyme cutting noise and storing the read in a removal file; filtering out reads contained in the removed files from the initial comparison result file to obtain a comparison file which is free of noise introduced by an enzyme digestion library building mode; wherein, the similarity refers to the comparison matching rate with the Soft Clip base sequence in the extension region.

Furthermore, in the step of extracting the read segments containing the Soft Clip marks from the initial comparison result file and counting the base number of the Soft Clip in each read segment, the initial comparison result file is cut into a plurality of parts for multi-process parallel processing, preferably, when the initial comparison result file is cut into a plurality of parts for multi-process parallel processing, the size of each segmented file is calculated according to int (M/N) +1 according to the line number M and the process number N of the initial comparison result file to obtain the equal segmented comparison file.

Further, marking the reads with Soft Clip base number greater than threshold T1 as candidate processing sequences includes the following cases: (i) Soft Clip bases distributed at the front end or the rear end of the reading segment, and if the number of the Soft Clip bases is larger than a threshold value T1, the reading segment is marked as a candidate processing sequence; (ii) Soft Clip bases which are simultaneously present at the front end and the rear end of the reading segment, and when the number of the Soft Clip bases at least one end is larger than a threshold value T1, the reading segment is marked as a candidate processing sequence; (iii) And Soft Clip bases appearing at the front end and the rear end of the read segment at the same time, wherein the front end Soft Clip base number and the rear end Soft Clip base number are respectively smaller than a threshold value T1, and when the sum of the front end Soft Clip base number and the rear end Soft Clip base number is larger than the threshold value T1, the read segment is not marked as a candidate processing sequence.

Further, according to the alignment position of each candidate processing sequence on the reference genome, the D length is extended according to the following rules: (i) When the alignment position is located in the chromosome starting position region and the length from the chromosome starting position is less than D, only extending to the chromosome starting position; (ii) When the alignment position is located in the chromosome end position region and the distance is that the length of the chromosome end position is less than D, only extending to the chromosome end position; (iii) When the comparison position is positioned in the middle of the chromosome and the lengths from the initial position of the chromosome and the end position of the chromosome are both greater than D, the length of D is extended forwards and backwards; preferably, D is from 200 to 400bp, more preferably from 250 to 350bp.

Further, a sequence similar to the base sequence of the Soft Clip is searched in each extension region by a local alignment method; preferably, the scoring mechanism introduced by the method of local alignment is as follows: marking the same basic group as 2 points, marking the mismatched basic group as-3 points, marking the occurrence of vacancy as-10 points, marking the continuous vacancy as not to be scored, and finding the optimal similar sequence of the Soft Clip basic group sequence in the extension region according to the optimal score.

Further, searching a sequence similar to the base sequence of the Soft Clip in each extension region, and if the similarity of the similar sequence is greater than a threshold value T2, regarding the read as a read containing enzyme cutting noise, including the following situations: (i) Soft Clip bases which are distributed at the front end or the rear end of the reading section and are larger than a threshold value T1, or Soft Clip bases which are only arranged at one end of the front end and the rear end of the reading section and are larger than the threshold value T1, when a similar sequence with the similarity of the Soft Clip base sequence larger than the threshold value T2 exists in the extension region, the reading section is regarded as a sequence containing enzyme cutting noise; (ii) And simultaneously, soft Clip bases are present at the front end and the rear end of the reading section, the front end and the rear end of the reading section are both larger than the Soft Clip bases with the threshold value T1, and when the Soft Clip sequences at the two ends can find similar sequences with the similarity larger than the threshold value T2 in the extension area, the reading section is regarded as a sequence containing enzyme cutting noise.

According to a second aspect of the present application, there is provided a device for filtering noise introduced by an enzyme digestion library building method, the device comprising: the acquisition module is configured to acquire initial comparison result files of enzyme digestion database construction double-end sequencing off data and a reference genome sequence; the extraction statistical module is set to extract the reading segments containing the Soft Clip marks from the initial comparison result file and count the base numbers of the Soft Clip in the reading segments; the tag extraction module is used for marking the reads with the Soft Clip base number larger than a threshold value T1 as candidate processing sequences and extracting the alignment positions of the candidate processing sequences on a reference genome and the Soft Clip base sequences in the candidate processing sequences; the extension similar comparison module is set to extend the length of D before and after the comparison position of each candidate processing sequence on the reference genome to obtain an extension region, and searches for a sequence similar to the base sequence of the Soft Clip in each extension region, if the similarity of the similar sequence is greater than a threshold value T2, the read is regarded as a read containing enzyme cutting noise and is stored in a removal file; the noise removal module is used for filtering out the reads contained in the removed files from the initial comparison result file to obtain a comparison file which is used for removing the noise introduced by the enzyme digestion library building mode; wherein, the similarity refers to the alignment matching rate of the base sequence of the Soft Clip in the extension region.

Further, the extraction statistics module includes a plurality of extraction statistics sub-modules, and the plurality of extraction statistics sub-modules perform parallel processing, and preferably, the plurality of extraction statistics sub-modules are divided as follows: and calculating the size of each segmented file according to the line number M and the process number N of the initial comparison result file and int (M/N) +1 to obtain an equal segmentation comparison file.

Further, the tag extraction module includes: the first marking module is arranged for marking the read as a candidate processing sequence when the Soft Clip base number of the Soft Clip base number distributed at the front end or the rear end of the read is larger than a threshold value T1; the second marking module is set to mark the read as a candidate processing sequence when the Soft Clip base number at least one end is larger than a threshold value T1 for the Soft Clip base numbers which are simultaneously present at the front end and the rear end of the read; and the third marking module is set to respectively mark the front-end and rear-end Soft Clip base numbers of the Soft Clip base numbers which are simultaneously present at the front end and the rear end of the read as candidate processing sequences when the front-end and rear-end Soft Clip base numbers are respectively smaller than a threshold value T1 and the sum of the front-end and rear-end Soft Clip base numbers is larger than the threshold value T1.

Further, the extended similarity alignment module comprises: the first extension module is configured to extend to the chromosome starting position only when the comparison position is located in the chromosome starting position region and the length from the chromosome starting position is less than D; a second extension module configured to extend only to the chromosomal end position when the alignment position is located in the chromosomal end position region and the distance is such that the length of the chromosomal end position is less than D; a third extension module, configured to extend the length of D back and forth when the alignment position is located in the middle of the chromosome and the lengths from the start position of the chromosome and the end position of the chromosome are both greater than D; preferably, D is from 200 to 400bp, more preferably from 250 to 350bp.

Further, the extended similarity comparison module comprises a local comparison module; preferably, the scoring mechanism of the local alignment module is as follows: marking the same basic group as 2 points, marking the mismatched basic group as-3 points, marking the occurrence of vacancy as-10 points, marking the continuous vacancy as not to be scored, and finding the optimal similar sequence of the Soft Clip basic group sequence in the extension region according to the optimal score.

Further, the extended similarity alignment module further comprises: the first judging module of the noise sequence is set to be a sequence containing enzyme cutting noise for Soft Clip bases which are distributed at the front end or the rear end of the reading section and are larger than a threshold value T1, or Soft Clip bases which are only one end of the front end and the rear end of the reading section and are larger than the threshold value T1, when a similar sequence with the similarity of the Soft Clip base sequence larger than the threshold value T2 exists in the extension region; and the second judging module of the noise sequence is set to be Soft Clip bases with the front end and the rear end larger than a threshold value T1 for the Soft Clip bases which are simultaneously arranged at the front end and the rear end of the reading segment, and when the Soft Clip sequences at the two ends can find similar sequences with the similarity larger than the threshold value T2 in the extension area, the reading segment is regarded as a sequence containing enzyme cutting noise.

According to a third aspect of the present application, a computer-readable storage medium is provided, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the above-mentioned method for filtering noise introduced by the enzyme library building method.

According to a fourth aspect of the present application, a processor is provided, where the processor is configured to run a program, where the program is run to execute the above-mentioned method for filtering noise introduced by the enzyme library construction.

By applying the technical scheme of the application, the read segment containing the Soft Clip mark is extracted from the initial comparison result file, the candidate processing sequence is screened out based on the number of bases of the Soft Clip in the initial comparison result file, the similar sequence with the similarity larger than the threshold value with the candidate processing sequence is further screened out in the extension area extending for a certain length from two sides of the comparison position where the candidate processing sequence is located, if the similar sequence exists, the read segment is considered to be the read segment containing enzyme cutting noise, and thus all the read segments containing noise are filtered from the initial comparison result file, and the comparison file with the noise removed is obtained. The method can effectively filter the artifact sequence introduced into the enzyme digestion library, thereby improving the detection accuracy.

Drawings

The accompanying drawings may provide further details of the invention, and the exemplary embodiments and descriptions thereof are provided for the purpose of illustration and not limitation of the invention. In the drawings:

FIG. 1 is a flow chart illustrating a method for filtering noise generated by an enzyme library construction method according to the present invention;

FIG. 2 is a histogram illustrating the SoftClip reads proportion, hardClip reads proportion and Artifact reads proportion in the database created by ultrasonic disruption, and a method for creating libraries by different enzymes before filtering by the filtering method of the present invention in an embodiment of the present invention.

FIG. 3 is a histogram of SoftClip reads, hardClip reads, and Artifact reads in different cut and ultrasound break database creation data after filtering by the filtering method of the present invention in an embodiment of the present invention.

FIG. 4 shows a schematic diagram of artifact noise introduced by an artificial error in an enzyme library construction method.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.

Interpretation of terms:

and (3) comparison: short sequences (also called reads) generated by the second-generation sequencing are compared on a reference genome, and an obtained comparison result file is an SAM (sample access manager) or BAM (library analysis model) file, wherein the file comprises CIGAR (common information identifier) information. The CIGAR protection is the comparison result information, which indicates the comparison condition of all bases of one read and is positioned in the sixth column. The method generally comprises seven MIDNSP types, wherein M represents Match; i represents Insertion, insert; d represents Deletion, deletion; n represents Skipped bases on the reference, S represents Soft clipping, soft clipping; h represents Hard clipping; p represents Padding.

Soft Clip: refers to the SEQuence that exists in SEQ (segment SEQuence) even though the genome is not aligned, when the CIGAR column corresponds to the symbol of S (Soft). Namely: although the alignment does not refer to the genome, there is still sequence present on reads in the BAM/SAM alignment file, i.e., this portion of the sequence is not truncated and discarded.

Hard Clip: sequences that do not match and do not exist in the SAM/BAM alignment file, i.e., sequences that are truncated and discarded, contain the symbol of H (Hard) in the CIGAR column, but do not have a corresponding sequence at the sequence position in the sixth column.

Examples are as follows:

reference genomic sequence: AGCTAGCATCGTGTGTGTGTGACCGGTCTAGGAAGCAGGAATCTGCG

Sequencing read aggGTGTAACC-GACTAGtttt

In the above example alignment, capital letters indicate matches (not perfect matches, some bases are mismatches), -indicate deletions, and lowercase letters indicate end-unmatched sequences, which are clipping sequences. If the read only aligns to the position of the genome, the CIGAR information is 3S8M1D6M4S; if the sequence is aligned to a plurality of positions of the genome, the aligned CIGAR information is 3H8M1D6M4H. In addition to the difference between the alignment positions, the sequences in the output data of S and H are different, the sequence labeled S is displayed in the bam file, and the sequence labeled H is deleted. For example, 3S8M1D6M4S outputs a sequence aggGTAACCGACTAGtttt in bam, and a sequence GTGTAACCGACTAG in 3H8M1D6M4H.

As mentioned in the background art, the noise introduced by enzyme digestion library building can affect the accuracy of the detection of low-frequency mutation, but no effective removal method exists at present, and a new improvement scheme is provided for improving the condition.

Example 1

In this embodiment, a method for introducing noise in a manner of filtering enzyme digestion library construction is provided. As shown in fig. 1, the method comprises the steps of:

s101, acquiring enzyme digestion library construction double-end sequencing unloading data and comparing the data to an initial comparison result file of a reference genome sequence;

s102, extracting read segments (reads) containing Soft Clip marks from the initial comparison result file, and counting the number of Soft Clip bases in each read segment;

s103, recording the reads with the Soft Clip base number larger than a threshold value T1 as candidate processing sequences, and extracting the comparison positions of the candidate processing sequences on a reference genome and the Soft Clip base sequences in the candidate processing sequences;

s104, extending the length of D before and after the comparison position of each candidate processing sequence on the reference genome to obtain an extension region, searching a sequence similar to the Soft Clip base sequence in each extension region, and if the similarity of the similar sequence is greater than a threshold value T2, regarding the read as a read containing enzyme digestion noise and storing the read in a removal file;

s105, filtering out the read segments contained in the removed files from the initial comparison result file to obtain a comparison file which is used for removing noise introduced by the enzyme digestion library building mode;

wherein, the similarity refers to the alignment matching rate of the base sequence of the Soft Clip in the extension region.

In the above embodiment, the read segment containing the Soft Clip tag is extracted from the initial alignment result file, the candidate processing sequence is screened out based on the number of base numbers of the Soft Clip in the initial alignment result file, the similar sequence with similarity greater than the threshold value to the candidate processing sequence is further screened out in the extension region extending for a certain length from two sides of the alignment position where the candidate processing sequence is located, if the similar sequence exists, the read segment is considered to be the read segment containing enzyme digestion noise, so that all the read segments containing noise are filtered out from the initial alignment result file, and the alignment file with noise removed is obtained. The method can effectively filter the artifact sequence introduced into the enzyme digestion library, thereby improving the detection accuracy.

Specifically, off-line sequencing data is compared with a reference genome, and a plurality of sam files are obtained after compression, sorting, grouping and decompression are carried out according to conventional pretreatment steps.

These sam files contain Soft Clip information. Some of these Soft clips are artifact noises (FIG. 4) introduced by manual errors in the enzyme library construction method. In some embodiments, the principle of such noise generation is as follows:

first, due to the nature of restriction enzymes, sticky ends with palindromic structure may be generated during the digestion process, and artifact may be generated on one of the strands (FIG. 4).

Second, due to the nature of the palindrome, in this embodiment, a palindrome on one strand may cause the strand to bend and mate complementarily with itself to form a neck loop structure.

Third, during the use of PCR end repair, the ends of the neck-loop structures on the strand will bind due to the positive strand fill-in repair, and the neck-loop structures break, resulting in a piece of the sequence in the strand being split into two parts, one part being bound to the other strand.

Fourth, in both strands, the sequence on the minus strand is originally present and therefore perfectly matches the reference genome; and a segment of sequence on the positive strand comes from the negative strand, so that part of the sequence cannot be completely aligned with the reference genome in the global alignment, and the segment of sequence is Soft Clip, which is also artifact noise that we need to filter.

In the embodiment of the invention, considering that the data volume of the comparison result file to be processed is huge and the processing takes long time, in order to shorten the processing time, in a preferred embodiment, the initial comparison result file is divided into a plurality of sub-files and a multi-process parallel processing mode is adopted. The splitting pattern adopted is as follows: and calculating the size of each divided file according to the line number M and the process number N of the compared files and int (M/N) +1 to obtain the equally divided compared files.

In step S102, the reads containing the Soft Clip flag are extracted from the initial alignment result file, and the specific operation of counting the base number of the Soft Clip in each read is as follows:

reads containing Soft clips are extracted from each sam file. If one read is only aligned to the unique position of the genome and part of bases in the read cannot match the genome, the bases are marked as Soft Clip. The sixth column in the initial alignment result file sam file contains the alignment result, where S indicates that the cut sequence exists in the sequence, i.e. Soft Clip information needs to be extracted. Other reads that do not contain Soft Clip are considered non-artifact and do not contain any enzyme cutting noise.

The reads with the number of Soft Clip bases larger than the threshold value T1 are recorded as candidate processing sequences, and specific operation examples are as follows: setting a threshold value T1 for the length of the base number of the Soft Clip, and if the length of the Soft Clip in reads is smaller than the threshold value, judging that the Soft Clip is not artifact; on the contrary, if the Soft Clip length in reads is smaller than the threshold, the sequence is determined to be processed.

It is worth noting that one or two Soft clips may appear in the same reads and are distributed at two ends of the reads, and the judgment and the distinguishing processing are needed. For reads containing different Soft Clip segments, the distinguishing and screening method is as follows:

in a preferred embodiment, marking reads with Soft Clip base number greater than threshold T1 as candidate processing sequences includes the following cases: (i) Soft Clip bases distributed at the front end or the rear end of the read segment, and if the number of the Soft Clip bases is larger than a threshold value T1, the read segment is marked as a candidate processing sequence; (ii) Soft Clip bases which are simultaneously present at the front end and the rear end of the reading segment, and when the number of the Soft Clip bases at least one end is larger than a threshold value T1, the reading segment is marked as a candidate processing sequence; (iii) And Soft Clip bases appearing at the front end and the rear end of the read segment simultaneously, wherein the base numbers of the Soft Clip at the front end and the rear end are respectively smaller than a threshold value T1, and when the sum of the base numbers of the Soft Clip at the front end and the rear end is larger than the threshold value T1, the read segment is not recorded as a candidate processing sequence.

Preferably, reads with different Soft Clip segments need to be stored in different files respectively.

Files storing reads with only one section or two sections of Soft Clip are processed respectively.

1) Aiming at files with only one segment, in the embodiment, the base sequence of the Soft Clip segment needs to be extracted respectively, and then the base sequence of the Soft Clip segment is locally compared with a reference sequence to obtain the comparison rate;

2) For a file containing two Soft Clip sequences, each Soft Clip sequence needs to be judged and compared one by one, specifically, if the length of the left end is greater than a threshold T1 and the length of the right end is less than the threshold T1 or the length of the left end is less than the threshold T1 and the length of the right end is greater than the threshold T1, only the Soft Clip sequence greater than the threshold needs to be compared with a reference sequence; if the two-terminal sequences are both larger than the threshold value T1, the comparison is needed to obtain the comparison rate.

It should be noted that the reference sequence of the local alignment is extracted from the base sequences that are extended by length D before and after the genome alignment position of the candidate processing sequences on the reference genome, and the extension rule is as follows:

(i) When the alignment position is located in the chromosome starting position region and the length from the chromosome starting position is less than D, only extending to the chromosome starting position;

(ii) When the alignment position is located in the chromosome end position region and the distance is that the length of the chromosome end position is less than D, only extending to the chromosome end position;

(iii) When the comparison position is positioned in the middle of the chromosome and the lengths from the initial position of the chromosome and the end position of the chromosome are both greater than D, the length of D is extended forwards and backwards;

preferably, D is from 200 to 400bp, more preferably from 250 to 350bp.

Particularly, the alignment software for local alignment in the embodiment of the invention is the local alignment software autonomously developed based on the Smith-Waterman algorithm, and the optimal similar sequence of the Soft Clip base sequence in the extension sequence is found according to the optimal score.

A threshold T2 is set for the above-mentioned comparison ratio, and the obtained comparison ratio is compared with the threshold T2. The specific implementation mode is as follows:

(i) Soft Clip bases which are distributed at the front end or the rear end of the reading section and are larger than a threshold value T1, or Soft Clip bases which are only arranged at one end and are larger than the threshold value T1 are arranged at the front end and the rear end of the reading section at the same time, and when similar sequences with the similarity larger than the threshold value T2 to the Soft Clip base sequences exist in the extension region, the reading section is regarded as a sequence containing enzyme cutting noise;

(ii) And simultaneously, soft Clip bases are present at the front end and the rear end of the reading section, the front end and the rear end of the reading section are both larger than the Soft Clip bases with the threshold value T1, and when the Soft Clip sequences at the two ends can find similar sequences with the similarity larger than the threshold value T2 in the extension area, the reading section is regarded as a sequence containing enzyme cutting noise.

Finally, the sequences considered to contain the cleavage noise were stored in a Remove (Remove) file. And according to the Remove file obtained by simultaneously comparing the files in multiple processes, removing the read contained in the Remove file from the initial comparison result file to obtain the comparison file with noise introduced by the enzyme digestion library building mode removed.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no particular act is required to implement the invention.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of hardware devices such as software plus necessary detection instruments. Based on such understanding, in the technical solutions of the present application, a part of data processing may be embodied in the form of a software product, and the computer software product may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some parts of the embodiments of the present application.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

It is obvious to those skilled in the art that some of the modules or steps of the present application described above can be implemented in a general-purpose computing device, they can be centralized on a single computing device or distributed on a network composed of a plurality of computing devices, and alternatively, they can be implemented by program codes executable by the computing devices, so that they can be stored in a storage device and executed by the computing devices, or they can be separately manufactured as individual integrated circuit modules, or a plurality of modules or steps in them can be manufactured as a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

Example 2

This embodiment provides a device of noise is introduced to mode of building storehouse of filtering enzyme digestion, and the device includes: an acquisition module, an extraction statistic module, a mark extraction module, an extension similarity comparison module and a noise removal module, wherein,

the acquisition module is configured to acquire initial comparison result files of the enzyme digestion database construction double-end sequencing unloading data and the reference genome sequence;

the extraction statistical module is set to extract the read segments containing the Soft Clip marks from the initial comparison result file and count the number of Soft Clip bases in each read segment;

the marker extraction module is used for marking the reading segment with the Soft Clip base number larger than the threshold value T1 as a candidate processing sequence, and extracting the comparison position of each candidate processing sequence on the reference genome and the Soft Clip base sequence in the candidate processing sequence;

the extension similar comparison module is set to extend the length of D before and after the comparison position of each candidate processing sequence on the reference genome to obtain an extension region, and searches for a sequence similar to the base sequence of the Soft Clip in each extension region, if the similarity of the similar sequence is greater than a threshold value T2, the read is regarded as a read containing enzyme cutting noise and is stored in a removal file;

the noise removal module is used for filtering out the reads contained in the removed files from the initial comparison result file to obtain a comparison file which is used for removing the noise introduced by the enzyme digestion library building mode;

Optionally, the extraction statistics module includes a plurality of extraction statistics sub-modules, and the plurality of extraction statistics sub-modules perform parallel processing, and preferably, the plurality of extraction statistics sub-modules are divided as follows: and calculating the size of each segmented file according to the line number M and the process number N of the initial comparison result file and int (M/N) +1 to obtain an equal segmentation comparison file.

Optionally, the tag extraction module comprises: the first marking module is arranged for marking the read as a candidate processing sequence when the Soft Clip base number of the Soft Clip base number distributed at the front end or the rear end of the read is larger than a threshold value T1; the second marking module is set to mark the read as a candidate processing sequence when the Soft Clip base number at least one end is larger than a threshold value T1 for the Soft Clip base numbers which are simultaneously present at the front end and the rear end of the read; and the third marking module is set to respectively mark the front-end and rear-end Soft Clip base numbers of the Soft Clip bases which are simultaneously present at the front end and the rear end of the read as candidate processing sequences when the front-end and rear-end Soft Clip base numbers are respectively smaller than a threshold value T1 and the sum of the front-end and rear-end Soft Clip base numbers is larger than the threshold value T1.

Optionally, the extended similarity alignment module comprises: the first extension module is configured to extend to the chromosome starting position only when the comparison position is located in the chromosome starting position region and the length from the chromosome starting position is less than D; a second extension module configured to extend only to the chromosomal end position when the alignment position is located in the chromosomal end position region and the distance is such that the length of the chromosomal end position is less than D; a third extension module, configured to extend the length of D back and forth when the alignment position is located at the middle position of the chromosome and the lengths from the start position of the chromosome and the end position of the chromosome are both greater than D; preferably, D is from 200 to 400bp, more preferably from 250 to 350bp.

Optionally, the extended similarity comparison module comprises a local comparison module; preferably, the scoring mechanism of the local alignment module is as follows: marking the same base as 2 points, marking the mismatched base as-3 points, marking the occurring vacancy as-10 points, marking the continuous vacancy as no score, and finding the optimal similar sequence of the Soft Clip base sequence in the extension region according to the optimal score.

Optionally, the extended similarity alignment module further comprises: the first judging module of the noise sequence is set to be a sequence containing enzyme cutting noise for the Soft Clip sequence which is distributed at the front end or the rear end of the reading section and is larger than a threshold value T1, or only one Soft Clip base which is larger than the threshold value T1 and is simultaneously arranged at the front end and the rear end of the reading section, and when the extension region has a similar sequence with the similarity of the Soft Clip base sequence being larger than the threshold value T2, the reading section is regarded as the sequence containing the enzyme cutting noise; and the second judging module of the noise sequence is set to be Soft Clip bases with the front end and the rear end larger than a threshold value T1 for the Soft Clip bases which are simultaneously arranged at the front end and the rear end of the reading segment, and when the Soft Clip sequences at the two ends can find similar sequences with the similarity larger than the threshold value T2 in the extension area, the reading segment is regarded as a sequence containing enzyme cutting noise.

Example 3

The embodiment provides a computer-readable storage medium, wherein the storage medium comprises a stored program, and when the program runs, the device where the storage medium is located is controlled to execute the filtering method for introducing noise in the enzyme library construction mode.

The processor is used for running a program, wherein the program runs to execute the filtering method for introducing noise in the enzyme library construction mode.

Example 4

Different enzyme digestion database building methods and ultrasonic breaking database building methods are adopted for a certain intestinal cancer sample to build a database, sequencing data obtained in different database building modes are analyzed according to SoftClip reads proportion, hardClip reads proportion and Artifact reads proportion before and after the intestinal cancer sample is processed by the noise filtering method, and the analysis results are shown in figures 2 and 3.

FIG. 2 shows a histogram of SoftClip reads, hardClip reads, and Artifac reads in database construction data with different enzyme cutting and ultrasonic disruption before the filtering method is used, wherein the abscissa is a sample, the ordinate is SoftClip reads, hardClip reads, and Artifac reads, respectively, and V-Enfrag, K-Enfrag, S-Enfrag, and Nad-Enfrag represent samples that are subjected to enzyme digestion with different enzymes, and Nad-covaris represents samples that are subjected to ultrasonic disruption. It is easy to find that no matter what kind of enzyme cutting mode is used for building the library, a certain amount of Artifact reads enzyme cutting noise is introduced, and although the proportion of the enzyme cutting noise is not high, the enzyme cutting noise can cause false positive mutation and bring influence on mutation analysis.

FIG. 3 shows a histogram of SoftClip reads, hardClip reads, and Artifact reads in database data of different enzyme digestion and ultrasonic disruption building after filtration by the filtration method of the present invention, wherein the abscissa is a sample, the ordinate is SoftClip reads, hardClip reads, and Artifact reads, respectively, the V-Enfrag, K-Enfrag, S-Enfrag, and Nad-Enfrag represent samples subjected to enzyme digestion by different enzymes, and the Nad-covis represents samples subjected to ultrasonic disruption. Results show that the proportion of Artifact reads introduced by different enzyme digestion library building modes, even ultrasonic break, is reduced to 0, and the proportion of SoftClip reads and HardClip reads is also obviously reduced.

From the above description, it can be seen that the method of the present invention not only can effectively and rapidly remove the noise introduced by enzyme digestion library building to the maximum extent, but also has good compatibility.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for introducing noise in a manner of filtering enzyme digestion library building is characterized by comprising the following steps:

acquiring initial comparison result files of enzyme digestion database construction double-end sequencing off data and a reference genome sequence;

extracting read segments containing Soft Clip marks from the initial comparison result file, and counting the number of bases of the Soft Clip in each read segment;

recording the reads with the Soft Clip base number larger than a threshold value T1 as candidate processing sequences, and extracting the alignment positions of the candidate processing sequences on the reference genome and Soft Clip base sequences in the candidate processing sequences;

extending the length of D before and after the comparison position of each candidate processing sequence on the reference genome to obtain an extension region, searching a sequence similar to the base sequence of the Soft Clip in each extension region, and if the similarity of the similar sequence is greater than a threshold value T2, taking the read as a read containing enzyme cutting noise and storing the read in a removal file;

filtering out the reads contained in the removed file from the initial comparison result file to obtain a comparison file which is used for removing noise introduced by the enzyme digestion library building mode;

2. The method as claimed in claim 1, wherein in the step of extracting the reads containing Soft Clip tags from the initial comparison result file and counting the number of Soft Clip bases in each read, the initial comparison result file is divided into several parts for multi-process parallel processing.

3. The method according to claim 2, wherein when the initial comparison result file is divided into a plurality of parts for multi-process parallel processing, the size of each divided file is calculated according to int (M/N) +1 according to the number of rows M and the number of processes N of the initial comparison result file, so as to obtain an equal division comparison file.

4. The method of claim 1, wherein marking the reads with the Soft Clip base number greater than the threshold T1 as candidate processing sequences comprises:

(i) Soft Clip bases distributed at the front end or the rear end of the read, and if the Soft Clip bases are larger than a threshold value T1, the read is marked as a candidate processing sequence;

(ii) Soft Clip bases which simultaneously appear at the front end and the rear end of the read segment, and when the Soft Clip bases at least one end are more than a threshold value T1, the read segment is marked as a candidate processing sequence;

(iii) And Soft Clip bases which simultaneously appear at the front end and the rear end of the read segment, wherein the front end and the rear end of the read segment are respectively smaller than a threshold value T1, and when the sum of the front end and the rear end of the Soft Clip bases is larger than the threshold value T1, the read segment is not recorded as a candidate processing sequence.

5. The method of claim 1, wherein the D length is extended according to the alignment position of each candidate processing sequence on the reference genome according to the following rule:

(ii) When the aligned position is located in the region of the chromosomal end position and the distance is such that the length of the chromosomal end position is less than D, only extending to the chromosomal end position;

(iii) And when the alignment position is positioned in the middle of the chromosome and the length from the initial position of the chromosome to the terminal position of the chromosome is greater than D, the length of D is extended forwards and backwards.

6. The method of claim 5, wherein D is 200 to 400bp.

7. The method according to claim 5, wherein D is 250bp to 350bp.

8. The method as claimed in claim 1, wherein a sequence similar to the Soft Clip base sequence is found in each of the extended regions by local alignment.

9. The method of claim 8, wherein the scoring mechanism introduced by the method of local alignment is as follows: marking the same base as 2 points, marking the mismatched base as-3 points, marking the occurrence of vacancy as-10 points, marking the continuous vacancy as no score, and finding the optimal similar sequence of the Soft Clip base sequence in the extension region according to the optimal score.

10. The method as claimed in claim 4, wherein searching for a sequence similar to the Soft Clip base sequence in each of the extension regions, and if the similarity of the similar sequences is greater than a threshold T2, regarding the reads as containing enzyme cutting noise comprises the following steps:

(i) Soft Clip bases which are distributed at the front end or the rear end of the read segment and are larger than a threshold value T1, or Soft Clip bases which are only arranged at one end and are larger than the threshold value T1 are simultaneously arranged at the front end and the rear end of the read segment, and when a similar sequence with the similarity of the Soft Clip base sequence being larger than the threshold value T2 exists in the extension region, the read segment is regarded as a sequence containing enzyme digestion noise;

(ii) And Soft Clip bases which are simultaneously arranged at the front end and the rear end of the reading section, wherein the front end and the rear end of the reading section are both larger than the Soft Clip base with the threshold value T1, and when the Soft Clip sequences at the two ends can find similar sequences with the similarity larger than the threshold value T2 in the extension area, the reading section is regarded as a sequence containing enzyme cutting noise.

11. A device for introducing noise in a manner of filtering enzyme digestion library building, which is characterized by comprising:

the acquisition module is configured to acquire initial comparison result files of enzyme digestion database construction double-end sequencing off data and a reference genome sequence;

the extraction statistical module is set to extract the read segments containing the Soft Clip marks from the initial comparison result file and count the Soft Clip base number in each read segment;

a tag extraction module, configured to mark the reads with the Soft Clip base number greater than a threshold T1 as candidate processing sequences, and extract alignment positions of the candidate processing sequences on the reference genome and Soft Clip base sequences in the candidate processing sequences;

the extension similar comparison module is configured to extend the length D before and after the comparison position of each candidate processing sequence on the reference genome to obtain extension regions, search sequences similar to the Soft Clip base sequence in each extension region, and store the read as a read containing enzyme cutting noise in a removal file if the similarity of the similar sequences is greater than a threshold value T2;

12. The apparatus of claim 11, wherein the extracted statistics module comprises a plurality of extracted statistics sub-modules, and wherein the plurality of extracted statistics sub-modules are processed in parallel.

13. The apparatus of claim 12, wherein a plurality of said extracted statistics sub-modules are partitioned as follows: and calculating the size of each segmented file according to int (M/N) +1 according to the line number M and the process number N of the initial comparison result file to obtain an equal segmentation comparison file.

14. The apparatus of claim 11, wherein the token extraction module comprises:

the first marking module is used for marking the reads as candidate processing sequences when the Soft Clip base number distributed at the front end or the rear end of the reads is larger than a threshold value T1;

the second marking module is used for marking the reads as candidate processing sequences when the Soft Clip base number at least one end is larger than a threshold value T1 for the Soft Clip bases which simultaneously appear at the front end and the rear end of the reads;

and the third marking module is set to respectively mark the front-end and rear-end Soft Clip base numbers of Soft Clip base numbers which are simultaneously generated at the front end and the rear end of the read as candidate processing sequences when the front-end and rear-end Soft Clip base numbers are respectively smaller than a threshold value T1 and the sum of the front-end and rear-end Soft Clip base numbers is larger than the threshold value T1.

15. The apparatus of claim 11, wherein the extended similarity alignment module comprises:

a first extension module configured to extend only to the chromosome starting position when the aligned position is located in the chromosome starting position region and the length from the chromosome starting position is less than D;

a second extension module configured to extend only to a chromosomal end position when the aligned position is located in a chromosomal end position region and the distance is that the length of the chromosomal end position is less than D;

and the third extension module is configured to extend the length of D back and forth when the alignment position is located in the middle of the chromosome and the length from the start position of the chromosome to the end position of the chromosome is greater than D.

16. The device of claim 15, wherein D is 200 to 400bp.

17. The apparatus of claim 15, wherein D is 250bp to 350bp.

18. The apparatus of claim 11, wherein the extended similarity alignment module comprises a local alignment module.

19. The apparatus of claim 18, wherein the scoring mechanism of the local alignment module is as follows: marking the same base as 2 points, marking the mismatched base as-3 points, marking the occurrence of vacancy as-10 points, marking the continuous vacancy as no score, and finding the optimal similar sequence of the Soft Clip base sequence in the extension region according to the optimal score.

20. The apparatus of claim 14, wherein the extended similarity alignment module further comprises:

the first judging module of the noise sequence is set to be a sequence containing enzyme cutting noise for Soft Clip bases which are distributed at the front end or the rear end of the reading section and are larger than a threshold value T1, or only Soft Clip bases with one end larger than the threshold value T1 exist at the front end and the rear end of the reading section at the same time, and when a similar sequence with the similarity larger than the threshold value T2 to the Soft Clip base sequence exists in the extension region, the reading section is regarded as the sequence containing the enzyme cutting noise;

and the second judging module of the noise sequence is set to be the Soft Clip base with the front end and the rear end both larger than the threshold value T1 for the Soft Clip base which simultaneously appears at the front end and the rear end of the reading section, and when the Soft Clip base with the similarity larger than the threshold value T2 can be found in the extension area of the two ends, the reading section is regarded as the sequence containing enzyme cutting noise.

21. A computer-readable storage medium, comprising a stored program, wherein when the program is executed, the program controls a device on which the storage medium is located to execute the method for introducing noise by filtering enzyme library building according to any one of claims 1 to 10.

22. A processor, wherein the processor is configured to run a program, wherein the program is configured to execute the method for introducing noise by filtering enzyme library building according to any one of claims 1 to 10 when the program is run.