CN113782097B - Anchor point screening method and device based on bloom filter and computer equipment - Google Patents

Anchor point screening method and device based on bloom filter and computer equipment Download PDF

Info

Publication number
CN113782097B
CN113782097B CN202111041904.5A CN202111041904A CN113782097B CN 113782097 B CN113782097 B CN 113782097B CN 202111041904 A CN202111041904 A CN 202111041904A CN 113782097 B CN113782097 B CN 113782097B
Authority
CN
China
Prior art keywords
segment
sequence
sub
query sequence
reference sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111041904.5A
Other languages
Chinese (zh)
Other versions
CN113782097A (en
Inventor
张昂
廖湘科
崔英博
杨灿群
黄春
唐滔
彭林
夏泽宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111041904.5A priority Critical patent/CN113782097B/en
Publication of CN113782097A publication Critical patent/CN113782097A/en
Application granted granted Critical
Publication of CN113782097B publication Critical patent/CN113782097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to an anchor point screening method and device based on a bloom filter and computer equipment. The method comprises the following steps: according to anchor points which are positioned in advance, selecting fragments of a query sequence and a reference sequence between the two anchor points as a query sequence fragment and a reference sequence fragment, respectively generating a plurality of continuously overlapped sub-fragments for the reference sequence fragment and the query sequence fragment according to a preset length, establishing indexes through a plurality of preset hash functions, mapping the reference sequence sub-fragments into bit vectors of a bloom filter, and then querying according to the indexes, when the query sequence sub-fragments do not exist in the reference sequence, judging that the query sequence sub-fragments do not pass the screening; traversing all query sequence sub-segments in the query sequence segments, counting the total number of the query sequence sub-segments which do not pass the screening, and rejecting the left anchor point when the total number is larger than a preset threshold value; and traversing all anchor points until the screening of the anchor points is completed. The invention can improve the precision and speed of DNA sequence comparison.

Description

Anchor point screening method and device based on bloom filter and computer equipment
Technical Field
The application relates to the technical field of computers, in particular to a bloom filter-based long-read DNA sequence comparison anchor point screening method and device and computer equipment.
Background
The third generation sequencing belongs to a single molecule detection technology, does not need to amplify a template, and avoids base preference brought by polymerase chain reaction. Moreover, the reading length of the third-generation sequencing is longer, the information such as the genome repetitive fragment and the structural variation which cannot be found by the second-generation sequencing can be found, and a new breakthrough is made in the fields of genome assembly, structural variation detection, genome re-sequencing and the like.
Sequence alignment is a fundamental and important link in sequencing data analysis, and the result of alignment is a prerequisite for other steps. Different from the comparison algorithm facing the second-generation short-read sequence, the realization of the rapid and accurate comparison of the third-generation long-read sequence faces the challenges of longer read length, higher sequencing error rate and the like. Aiming at the problem, a heuristic method, namely 'seed-expansion', is mostly adopted for the third-generation long read sequence alignment, and the idea is that some short segments are selected from a read segment and a reference genome as seeds; then anchor point positioning is carried out through accurate matching of the seeds, and the comparison range is narrowed from the whole genome to a part of candidate regions; and finally, carrying out base comparison on the candidate regions by using a dynamic programming method, refining comparison results and realizing extended verification. Therefore, the sequence alignment algorithm mainly comprises three steps of seed generation, anchor point positioning and base alignment.
Due to sequencing errors and local homology of the genome itself, existing alignment tools can locate some wrong anchor points during global positioning, and aligning fragments between the wrong anchor points will produce sub-optimal results or even wrong results. Moreover, these false anchor points also need to be subjected to extended verification, which brings about a large amount of useless calculation, and reduces the speed while affecting the comparison accuracy.
Disclosure of Invention
In view of the above, there is a need to provide a bloom filter-based long-read DNA sequence alignment anchor screening method, apparatus, computer device and storage medium capable of eliminating false anchors.
A bloom filter-based anchor point screening method, the method comprising:
acquiring a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;
selecting a segment of the query sequence between a first anchor point and a second anchor point as a query sequence segment, and selecting a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment;
generating a plurality of continuous overlapped reference sequence sub-segments according to the reference sequence segments and preset lengths, and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the inquiry sequence segments and the preset lengths;
establishing indexes through a plurality of preset hash functions, and mapping the reference sequence sub-segments into bit vectors of a bloom filter;
inquiring whether the reference sequence has the inquiry sequence sub-segment or not according to the index, and judging that the inquiry sequence sub-segment does not pass the screening when the inquiry sequence sub-segment does not exist in the reference sequence;
traversing all query sequence sub-segments in the query sequence segments, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are larger than a preset threshold value;
and traversing all anchor points until the screening of all anchor points is completed.
In one embodiment, the method further comprises the following steps: before generating a plurality of continuous overlapped reference sequence sub-segments according to the reference sequence segment and the preset length and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the inquiry sequence segment and the preset length, deleting the parts of the inquiry sequence segment, which are the same with the two ends of the reference sequence segment, and updating the inquiry sequence segment and the reference sequence segment.
In one embodiment, the method further comprises the following steps: aligning the query sequence segment and the reference sequence segment before deleting the parts of the query sequence segment and the reference sequence segment which are identical at both ends and updating the query sequence segment and the reference sequence segment; extending from two ends of the sequence to the middle, and comparing the bases one by one; and obtaining the parts with the same ends of the query sequence segment and the reference sequence segment.
In one embodiment, the method further comprises the following steps: if the two bases of the query sequence fragment and the reference sequence fragment which are being aligned are the same, continuing to extend to the next base; stopping extension if the two bases of the query sequence fragment and the reference sequence fragment being aligned are different.
In one embodiment, the method further comprises the following steps: obtaining a plurality of hash values of each reference sequence sub-segment according to a plurality of preset hash functions; and setting the value of the position corresponding to the bit vector of the bloom filter to be 1 according to the hash value of the sub-segment of the reference sequence.
In one embodiment, the method further comprises the following steps: obtaining a plurality of hash values of each query sequence sub-segment according to a plurality of preset hash functions; inquiring at the corresponding position of the bit vector of the bloom filter according to the hash value of the sub-segment of the inquiry sequence; if all the values of the positions corresponding to the plurality of hash values of the query sequence sub-segment are 1, judging that the query sequence sub-segment passes screening; if not all are 1, judging that the query sequence sub-segment does not pass the screening.
In one embodiment, the method further comprises the following steps: and taking the second anchor point as a first anchor point and taking the anchor point adjacent to the right side of the second anchor point as a second anchor point, and carrying out the next round of anchor point screening.
An anchor point screening device based on a bloom filter, the device comprising:
the sequence acquisition module is used for acquiring a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by positioning in advance; the query sequence is a long read DNA sequence;
a segment selection module, configured to select a segment of the query sequence between a first anchor point and a second anchor point as a query sequence segment, and select a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment;
the sub-segment generating module is used for generating a plurality of continuous overlapped reference sequence sub-segments according to the reference sequence segment and the preset length, and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the inquiry sequence segment and the preset length;
the index establishing module is used for establishing indexes through a plurality of preset hash functions and mapping the reference sequence sub-segments to bit vectors of a bloom filter;
the query module is used for querying whether the query sequence sub-segment exists in the reference sequence according to the index, and judging that the query sequence sub-segment does not pass the screening when the query sequence sub-segment does not exist in the reference sequence;
the anchor point removing module is used for traversing all query sequence sub-segments in the query sequence segments, counting the accumulated value of the query sequence sub-segments which do not pass the screening, and removing the first anchor point when the accumulated value is larger than a preset threshold value;
and the traversing module is used for traversing all anchor points until the screening of all anchor points is completed.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by pre-positioning; the query sequence is a long read DNA sequence;
selecting a segment of the query sequence between a first anchor point and a second anchor point as a query sequence segment, and selecting a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment;
generating a plurality of continuous overlapped reference sequence sub-segments according to the reference sequence segment and the preset length, and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the inquiry sequence segment and the preset length;
establishing indexes through a plurality of preset hash functions, and mapping the reference sequence sub-segments into bit vectors of a bloom filter;
inquiring whether the reference sequence has the inquiry sequence sub-segment or not according to the index, and judging that the inquiry sequence sub-segment does not pass the screening when the inquiry sequence sub-segment does not exist in the reference sequence;
traversing all query sequence sub-segments in the query sequence segments, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are larger than a preset threshold value;
and traversing all anchor points until the screening of all anchor points is completed.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by pre-positioning; the query sequence is a long read DNA sequence;
selecting a segment of the query sequence between a first anchor point and a second anchor point as a query sequence segment, and selecting a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment;
generating a plurality of continuous overlapped reference sequence sub-segments according to the reference sequence segment and the preset length, and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the inquiry sequence segment and the preset length;
establishing indexes through a plurality of preset hash functions, and mapping the reference sequence sub-segments into bit vectors of a bloom filter;
inquiring whether the reference sequence has the inquiry sequence sub-segment or not according to the index, and judging that the inquiry sequence sub-segment does not pass the screening when the inquiry sequence sub-segment does not exist in the reference sequence;
traversing all query sequence sub-segments in the query sequence segments, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are greater than a preset threshold value;
and traversing all anchor points until the screening of all anchor points is completed.
According to the anchor point screening method, the anchor point screening device, the computer equipment and the storage medium based on the bloom filter, according to anchor points which are positioned in advance, a segment of a query sequence between a first anchor point and a second anchor point is selected as a query sequence segment, a segment of a reference sequence between the first anchor point and the second anchor point is selected as a reference sequence segment, a plurality of continuous overlapped sub-segments are generated for the reference sequence segment and the query sequence segment respectively according to preset lengths, indexes are established through a plurality of preset hash functions, the reference sequence sub-segments are mapped to bit vectors of the bloom filter, whether the query sequence sub-segments exist in the reference sequence is queried according to the indexes, and when the query sequence sub-segments do not exist in the reference sequence, the query sequence sub-segments are judged not to pass screening; traversing all query sequence sub-segments in the query sequence segments, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are greater than a preset threshold value; and traversing all anchor points until the screening of all anchor points is completed. The invention realizes the screening of the DNA sequence comparison anchor points through the bloom filter, eliminates the error anchor points and can improve the precision and the speed of the DNA sequence comparison.
Drawings
FIG. 1 is a schematic flow diagram of a bloom filter based anchor point screening method in one embodiment;
FIG. 2 is a diagram showing the generation of a subfragment from a sequence fragment in one embodiment;
FIG. 3 is a schematic flow chart of a bloom filter-based anchor point screening method in another embodiment;
FIG. 4 is a block diagram of an anchor point screening device based on a bloom filter in one embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The bloom filter-based anchor point screening method can be applied to the following application environments. The terminal executes an anchor point screening method based on a bloom filter, selects a segment of a query sequence between a first anchor point and a second anchor point as a query sequence segment according to a pre-positioned anchor point, selects a segment of a reference sequence between the first anchor point and the second anchor point as a reference sequence segment, respectively generates a plurality of continuous overlapped sub-segments for the reference sequence segment and the query sequence segment according to a preset length, establishes indexes through a plurality of preset hash functions, maps the reference sequence sub-segments into bit vectors of the bloom filter, inquires whether the query sequence sub-segments exist in the reference sequence according to the indexes, and judges that the query sequence sub-segments do not pass screening when the query sequence sub-segments do not exist in the reference sequence; and traversing all query sequence sub-segments in the query sequence segments, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are greater than a preset threshold value. The terminal may be, but is not limited to, various personal computers, notebook computers, and tablet computers.
In one embodiment, as shown in fig. 1, there is provided a bloom filter-based anchor point screening method, including the steps of:
step 102, obtaining a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by pre-positioning.
The query sequence is a long read DNA sequence. The third generation long-read sequence alignment mostly adopts a heuristic method, namely 'seed-expansion', and the long-read DNA sequence alignment algorithm mainly comprises three steps of seed generation, anchor point positioning and base comparison.
And 104, selecting a segment of the query sequence between the first anchor point and the second anchor point as a query sequence segment, and selecting a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment.
And taking the anchor points as intervals, and comparing the sequences between the anchor points. When the invention is used for carrying out anchor point screening, the anchor points are also used as intervals, and the anchor point screening is carried out through the sequence between the anchor points.
And 106, generating a plurality of continuous overlapped reference sequence sub-segments according to the preset length of the reference sequence segment, and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the preset length of the inquiry sequence segment.
As shown in fig. 2, the sub-segments overlap consecutively, covering the entire sequence segment.
And step 108, establishing indexes through a plurality of preset hash functions, and mapping the reference sequence sub-segments into bit vectors of the bloom filter.
A bloom filter is a bit vector, and to map a value to the bloom filter, multiple hash values need to be generated using multiple different hash functions, and for each generated hash value, bit position 1 is pointed to.
And step 110, inquiring whether the reference sequence has the inquiry sequence sub-segment according to the index, and judging that the inquiry sequence sub-segment does not pass the screening when the inquiry sequence sub-segment does not exist in the reference sequence.
When the query sequence sub-segment is not present in the reference sequence, it indicates that the query sequence segment is present in a sequence segment that is not present in the reference sequence segment, i.e., the query sequence segment is different from the reference sequence segment.
And 112, traversing all the query sequence sub-segments in the query sequence segment, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are larger than a preset threshold value.
When a considerable number of sub-segments in the query sequence sub-segments do not exist in the reference sequence segment, the difference between the query sequence segment and the reference sequence segment is large, and the anchor point can be judged to be an error anchor point.
And step 114, traversing all anchor points until the screening of all anchor points is completed.
According to the anchor point screening method based on the bloom filter, according to anchor points which are positioned in advance, a segment of a query sequence between a first anchor point and a second anchor point is selected as a query sequence segment, a segment of a reference sequence between the first anchor point and the second anchor point is selected as a reference sequence segment, a plurality of continuous overlapped sub-segments are generated for the reference sequence segment and the query sequence segment respectively according to preset lengths, indexes are established through a plurality of preset hash functions, the reference sequence sub-segments are mapped into bit vectors of the bloom filter, whether the query sequence sub-segments exist in the reference sequence is queried according to the indexes, and when the query sequence sub-segments do not exist in the reference sequence, the query sequence sub-segments are judged not to pass screening; traversing all query sequence sub-segments in the query sequence segments, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are larger than a preset threshold value; and traversing all anchor points until the screening of all anchor points is completed. The invention realizes the screening of the DNA sequence comparison anchor points through the bloom filter, eliminates the error anchor points and can improve the precision and the speed of the DNA sequence comparison.
In one embodiment, the method further comprises the following steps: before generating a plurality of continuous overlapped reference sequence sub-segments according to the preset length of the reference sequence segment and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the preset length of the inquiry sequence segment, deleting the parts of the inquiry sequence segment, which are the same with the two ends of the reference sequence segment, and updating the inquiry sequence segment and the reference sequence segment.
The same parts at both ends of the query sequence segment and the reference sequence segment are deleted, and the query sequence segment and the reference sequence segment are updated, so that redundant data can be reduced.
In one embodiment, the method further comprises the following steps: after deleting the parts of the query sequence segment which are the same with the two ends of the reference sequence segment, aligning the query sequence segment with the reference sequence segment before updating the query sequence segment and the reference sequence segment; extending from two ends of the sequence to the middle, and comparing the bases one by one; obtaining the same part at both ends of the query sequence segment and the reference sequence segment.
In one embodiment, the method further comprises the following steps: if the two bases of the query sequence segment and the reference sequence segment which are being compared are the same, continuing to extend to the next base; if the two bases being aligned are different between the query sequence fragment and the reference sequence fragment, the extension is stopped.
In one embodiment, the method further comprises the following steps: obtaining a plurality of hash values of each reference sequence sub-segment according to a plurality of preset hash functions; and setting the value of the position corresponding to the bit vector of the bloom filter to be 1 according to the hash value of the sub-segment of the reference sequence.
In one embodiment, the method further comprises the following steps: obtaining a plurality of hash values of each query sequence sub-segment according to a plurality of preset hash functions; inquiring at the corresponding position of the bit vector of the bloom filter according to the hash value of the sub-segment of the inquiry sequence; if all the values of the positions corresponding to the plurality of hash values of the query sequence sub-segment are 1, judging that the query sequence sub-segment passes the screening; if not all are 1, judging that the query sequence subfragments do not pass the screening.
In one embodiment, the method further comprises the following steps: and taking the second anchor point as a first anchor point and taking the anchor point adjacent to the right side of the second anchor point as a second anchor point, and carrying out the next round of anchor point screening.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In a specific embodiment, as shown in fig. 3, there is provided a bloom filter-based anchor point screening method, including:
step 1: aligning a query sequence with a reference sequence target according to anchor points, wherein the number of the anchor points is N;
step 2: selecting a segment between a first anchor point P1 on the left side and a second anchor point P2 on the left side, wherein the segment of the query sequence is qseq1, and the segment of the reference sequence is tseq 1;
and step 3: removing the same part at both ends of the query sequence fragment qseq1 and the reference sequence fragment tseq 1;
and 4, step 4: indexing the remainder of the reference sequence fragment tseq 1;
and 5: querying the index for the remainder of qseq1 to determine whether to retain the left anchor point P1;
and 6: completing the screening of the anchor point P1;
and 7: returning to step 2, processing the fragments of the anchor P2 and the anchor P3 until the N anchor screens are completed.
The method comprises the following more specific steps:
step 1: aligning a query sequence with a reference sequence target according to anchor points, wherein the number of the anchor points is N;
step 2: selecting a segment between a first anchor point P1 on the left side and a second anchor point P2 on the left side, wherein the segment of the query sequence is qseq1, and the segment of the reference sequence is tseq 1;
and step 3: removing the same parts at both ends of the query sequence segment qseq1 and the reference sequence segment tseq 1;
step 3.1: aligning two ends of qseq1 and tseq1, extending from the two ends to the middle, and aligning bases one by one;
step 3.1.1: if the two bases are the same, continuing to extend to the next base;
step 3.1.2: if the two bases are different, stopping the extension;
step 3.2: removing the same parts at both ends;
and 4, step 4: indexing the remainder of the reference sequence fragment tseq 1;
step 4.1: generating a contiguous overlapping q-mer of length q fragments for the remainder of the reference sequence fragment tseq 1;
and 4.2: calculating the hash value of the q-mer by using a hash function;
step 4.3: inserting the calculated hash value of the q-mer into a bit vector of a bloom filter;
repeating the step 4.1-4.3 until all q-mers are processed;
and 5: querying the index for the remainder of qseq1 to determine whether to retain the left anchor point P1;
step 5.1: generating a continuous overlapping segment q-mer of length q for the remainder of the query sequence segment qseq 1;
and step 5.2: calculating the hash value of the q-mer by using a hash function;
step 5.3: inquiring the hash value calculated in the step 5.2 in the bit vector of the bloom filter;
step 5.3.1: all the corresponding hash values in the bit vectors are 1, and the q-mer passes the screening;
step 5.3.2: at least one bit of the corresponding hash value in the bit vector is 0, the q-mer does not pass the screening, and the q-mer number cumulative value Count plus 1 does not pass the screening;
step 5.3.3: judging whether the Count value is greater than a filtering threshold value T;
step 5.3.3.1: when the Count is greater than the filtering threshold T, the anchor point P1 is incorrectly positioned, and the anchor point P1 is removed;
step 5.3.3.2: when the Count is not greater than the filtering threshold T, the anchor point P is positioned correctly, and the anchor point P1 is reserved;
and 6: completing the screening of the anchor point P1;
and 7: returning to step 2, processing the fragments of the anchor P2 and the anchor P3 until the N anchor screens are completed.
In one embodiment, as shown in fig. 4, there is provided a bloom filter-based anchor point screening apparatus, including: a sequence acquisition module 402, a segment selection module 404, a sub-segment generation module 406, an index creation module 408, a query module 410, an anchor culling module 412, and a traversal module 414, where:
a sequence obtaining module 402, configured to obtain a query sequence to be compared, a reference sequence, and a plurality of anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;
a segment selecting module 404, configured to select a segment of the query sequence between the first anchor point and the second anchor point as a query sequence segment, and select a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment;
a sub-segment generating module 406, configured to generate a plurality of consecutive overlapping reference sequence sub-segments according to a preset length from the reference sequence segment, and generate a plurality of consecutive overlapping query sequence sub-segments according to a preset length from the query sequence segment;
an index establishing module 408, configured to establish an index through a plurality of preset hash functions, and map the reference sequence sub-segments into bit vectors of the bloom filter;
the query module 410 is configured to query whether a query sequence sub-segment exists in the reference sequence according to the index, and when the query sequence sub-segment does not exist in the reference sequence, determine that the query sequence sub-segment fails to be screened;
an anchor point removing module 412, configured to traverse all query sequence sub-segments in the query sequence segment, and count an accumulated value of the query sequence sub-segments that do not pass the screening, and remove the first anchor point when the accumulated value is greater than a preset threshold;
and a traversing module 414 for traversing all anchor points until the filtering of all anchor points is completed.
The segment selecting module 404 is further configured to delete the same portions at both ends of the query sequence segment and the reference sequence segment, and update the query sequence segment and the reference sequence segment.
The segment selection module 404 is further configured to align the query sequence segment with the reference sequence segment; extending from two ends of the sequence to the middle, and comparing the bases one by one; obtaining the same part at both ends of the query sequence segment and the reference sequence segment. If the two bases of the query sequence segment and the reference sequence segment which are being compared are the same, continuing to extend to the next base; if the two bases being aligned are different between the query sequence fragment and the reference sequence fragment, the extension is stopped.
The index establishing module 408 is further configured to obtain a plurality of hash values of each reference sequence sub-segment according to a plurality of preset hash functions; and setting the value of the position corresponding to the bit vector of the bloom filter to be 1 according to the hash value of the sub-segment of the reference sequence.
The query module 410 is further configured to obtain a plurality of hash values of each query sequence sub-segment according to a plurality of preset hash functions; inquiring at the corresponding position of the bit vector of the bloom filter according to the hash value of the sub-segment of the inquiry sequence; if all the values of the positions corresponding to the plurality of hash values of the query sequence sub-segments are 1, judging that the query sequence sub-segments pass the screening; if not all are 1, judging that the query sequence subfragments do not pass the screening.
The query module 410 is further configured to perform a next round of anchor screening by using the second anchor as the first anchor and using an anchor adjacent to the right side of the second anchor as the second anchor.
For specific limitations of the bloom filter-based anchor point screening device, reference may be made to the above limitations of the bloom filter-based anchor point screening method, which are not described herein again. The modules in the above-mentioned bloom filter-based anchor point screening device may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a bloom filter based anchor point screening method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A bloom filter-based anchor point screening method, comprising:
acquiring a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;
selecting a segment of the query sequence between a first anchor point and a second anchor point as a query sequence segment, and selecting a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment;
generating a plurality of continuous overlapped reference sequence sub-segments according to the reference sequence segments and preset lengths, and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the inquiry sequence segments and the preset lengths;
establishing indexes through a plurality of preset hash functions, and mapping the reference sequence sub-segments into bit vectors of a bloom filter;
inquiring whether the reference sequence has the inquiry sequence sub-segment or not according to the index, and judging that the inquiry sequence sub-segment does not pass the screening when the inquiry sequence sub-segment does not exist in the reference sequence;
traversing all query sequence sub-segments in the query sequence segments, counting the accumulated values of the query sequence sub-segments which do not pass the screening, and rejecting the first anchor point when the accumulated values are larger than a preset threshold value;
and traversing all anchor points until the screening of all anchor points is completed.
2. The method of claim 1, further comprising, prior to generating a plurality of consecutive overlapping reference sequence sub-segments according to a preset length from the reference sequence segment and a plurality of consecutive overlapping query sequence sub-segments according to the preset length from the query sequence segment:
deleting the same parts at both ends of the query sequence segment and the reference sequence segment, and updating the query sequence segment and the reference sequence segment.
3. The method of claim 2, further comprising, before deleting the same portions at both ends of the query sequence segment and the reference sequence segment and updating the query sequence segment and the reference sequence segment:
aligning the query sequence segment and the reference sequence segment;
extending from two ends of the sequence to the middle, and comparing the bases one by one;
and obtaining the parts with the same ends of the query sequence segment and the reference sequence segment.
4. The method of claim 3, wherein the base-by-base alignment extending from both ends of the sequence towards the middle comprises:
if the two bases of the query sequence segment and the reference sequence segment which are being compared are the same, continuing to extend to the next base;
stopping extension if the two bases of the query sequence fragment and the reference sequence fragment being aligned are different.
5. The method of claim 4, wherein the mapping the reference sequence sub-segments into bit vectors of a bloom filter by indexing through a predetermined plurality of hash functions comprises:
obtaining a plurality of hash values of each reference sequence sub-segment according to a plurality of preset hash functions;
and setting the value of the position corresponding to the bit vector of the bloom filter to be 1 according to the hash value of the sub-segment of the reference sequence.
6. The method of claim 5, wherein querying the reference sequence for the presence of the query sequence sub-segment according to the index, and determining that the query sequence sub-segment fails to be filtered when the query sequence sub-segment is not present in the reference sequence comprises:
obtaining a plurality of hash values of each query sequence sub-segment according to a plurality of preset hash functions;
inquiring at the corresponding position of the bit vector of the bloom filter according to the hash value of the sub-segment of the inquiry sequence;
if all the values of the positions corresponding to the plurality of hash values of the query sequence sub-segment are 1, judging that the query sequence sub-segment passes screening;
if not all are 1, judging that the query sequence sub-segment does not pass the screening.
7. The method of any one of claims 1 to 6, after culling the first anchor point, comprising:
and taking the second anchor point as a first anchor point and taking the anchor point adjacent to the right side of the second anchor point as a second anchor point, and carrying out the next round of anchor point screening.
8. An anchor point screening device based on a bloom filter, the device comprising:
the sequence acquisition module is used for acquiring a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by positioning in advance; the query sequence is a long read DNA sequence;
a segment selection module, configured to select a segment of the query sequence between a first anchor point and a second anchor point as a query sequence segment, and select a segment of the reference sequence between the first anchor point and the second anchor point as a reference sequence segment;
the sub-segment generating module is used for generating a plurality of continuous overlapped reference sequence sub-segments according to the reference sequence segment and the preset length, and generating a plurality of continuous overlapped inquiry sequence sub-segments according to the inquiry sequence segment and the preset length;
the index establishing module is used for establishing indexes through a plurality of preset hash functions and mapping the reference sequence sub-segments into bit vectors of the bloom filter;
the query module is used for querying whether the query sequence sub-segment exists in the reference sequence according to the index, and judging that the query sequence sub-segment does not pass the screening when the query sequence sub-segment does not exist in the reference sequence;
the anchor point removing module is used for traversing all query sequence sub-segments in the query sequence segments, counting the accumulated value of the query sequence sub-segments which do not pass the screening, and removing the first anchor point when the accumulated value is larger than a preset threshold value;
and the traversing module is used for traversing all anchor points until the screening of all anchor points is completed.
9. The apparatus of claim 8, wherein the segment selection module is further configured to delete portions of the query sequence segment that are identical at both ends of the reference sequence segment, and update the query sequence segment and the reference sequence segment.
10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
CN202111041904.5A 2021-09-07 2021-09-07 Anchor point screening method and device based on bloom filter and computer equipment Active CN113782097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111041904.5A CN113782097B (en) 2021-09-07 2021-09-07 Anchor point screening method and device based on bloom filter and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111041904.5A CN113782097B (en) 2021-09-07 2021-09-07 Anchor point screening method and device based on bloom filter and computer equipment

Publications (2)

Publication Number Publication Date
CN113782097A CN113782097A (en) 2021-12-10
CN113782097B true CN113782097B (en) 2022-06-24

Family

ID=78841456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111041904.5A Active CN113782097B (en) 2021-09-07 2021-09-07 Anchor point screening method and device based on bloom filter and computer equipment

Country Status (1)

Country Link
CN (1) CN113782097B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665772B (en) * 2023-05-30 2024-02-13 之江实验室 Genome map analysis method, device and medium based on memory calculation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320654A (en) * 2014-05-28 2016-02-10 中国科学院深圳先进技术研究院 Dynamic bloom filter and element operating method based on same
CN111292805A (en) * 2020-03-19 2020-06-16 山东大学 Third-generation sequencing data overlapping detection method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3663890B1 (en) * 2017-08-02 2024-04-10 GeneMind Biosciences Company Limited Alignment method, device and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320654A (en) * 2014-05-28 2016-02-10 中国科学院深圳先进技术研究院 Dynamic bloom filter and element operating method based on same
CN111292805A (en) * 2020-03-19 2020-06-16 山东大学 Third-generation sequencing data overlapping detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Cell 深度| 一套普遍适用于各类单细胞测序数据集的锚定整合方案 (qq.com)》;生信宝典;《https://mp.weixin.qq.com/s?__biz=MzI5MTcwNjA4NQ%3D%3D&idx=1&mid=2247488563&scene=21&sn=89fc2f3f8762c276c950331f82993fad#wechat_redirect》;20190612;全文 *
《基于锚点的多基因组序列比对算法》;苗素超;《万方学位论文数据库》;20110803;全文 *

Also Published As

Publication number Publication date
CN113782097A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
KR101248352B1 (en) Data error recovery in non-volatile memory
US7277832B2 (en) Dynamical method for obtaining global optimal solution of general nonlinear programming problems
CN110221600B (en) Path planning method and device, computer equipment and storage medium
CN113782097B (en) Anchor point screening method and device based on bloom filter and computer equipment
US11941534B2 (en) Genome sequence alignment system and method
CN113793351B (en) Laser filling method and device for multilayer outline pattern based on contour lines
CN107832227B (en) Interface parameter testing method, device, equipment and storage medium of business system
JP5612144B2 (en) Base sequence alignment system and method
JP2006099602A (en) Image construction method, fingerprint image construction apparatus and program
CN1306424C (en) Portable information processor having password code checking function
CN111008311B (en) Complex network node importance assessment method and device based on neighborhood weak connection
US20150142328A1 (en) Calculation method for interchromosomal translocation position
CN110659517A (en) Data verification method and device, computer equipment and storage medium
CN109753384B (en) Cloud host snapshot backup method and device, computer equipment and storage medium
CN115657963B (en) Sequential writing method and device based on solid state disk, electronic equipment and storage medium
CN108959419B (en) Video file recording processing method and device, computer equipment and storage medium
Wei et al. A branch elimination-based efficient algorithm for large-scale multiple longest common subsequence problem
CN115719618A (en) Reverse weighted sequence alignment seed generation method, device, equipment and memory
CN113902755B (en) Laser filling method and device for zigzag-based multilayer outline pattern
CN110909097B (en) Polygonal electronic fence generation method and device, computer equipment and storage medium
CN114722581A (en) Mobile state monitoring method and device based on Manhattan distance and computer equipment
CN111060127B (en) Vehicle starting point positioning method and device, computer equipment and storage medium
CN113033267B (en) Vehicle positioning method, device, computer equipment and storage medium
CN114093409A (en) Flash memory verification method and device, computer equipment and storage medium
CN109582516B (en) SSD back-end performance analysis method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant