CN115719618A

CN115719618A - Reverse weighted sequence alignment seed generation method, device, equipment and memory

Info

Publication number: CN115719618A
Application number: CN202211422257.7A
Authority: CN
Inventors: 崔英博; 张昂; 李发; 唐滔; 彭林; 郭宇飞; 彭晨晨; 夏泽宇; 郭逸飞
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-02-28

Abstract

The application relates to a method, a device, equipment and a memory for generating reverse weighted sequence alignment seeds in the technical field of computers. The method comprises the following steps: dividing the read reference sequence into N windows, and generating continuously overlapped k-mers for each window; counting the occurrence frequency of each k-mer in all windows, and calculating the occurrence frequency of each k-mer; according to the occurrence frequency of each k-mer, giving a weight value to each k-mer, wherein the weight value of the k-mer is reduced along with the increase of the occurrence frequency of the k-mer; calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer; and generating the minimum sub-seed of each window according to the hash value of each k-mer in each window. The method realizes reverse weighted seed generation of the genome repetitive region and supports the design and implementation of an accurate comparison algorithm.

Description

Reverse weighted sequence alignment seed generation method, device, equipment and memory

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage for generating a reverse weighted sequence alignment seed.

Background

Sequence alignment compares the read sequence output by the sequencer to a reference genome to find the most likely location of origin of the read on the genome. Sequence alignment is a fundamental and important link in sequencing data analysis, and the result of alignment is a prerequisite for other steps. The sequence alignment algorithm based on seed-extension can effectively narrow the range of base-level alignment, reduce the calculation amount and improve the alignment speed. However, the large number of repeated segments in the genome repeat region can cause misorientation of seeds, generate false positive hit sites, and affect the accuracy of sequence alignment.

Among the existing seed generation methods, the minimum seed (minizer) method has the advantages of time and space, and is more and more popular. The method comprises the steps of uniformly dividing a genome into a plurality of windows with the same length, generating continuous k-mers (segments with the length of k bases) in each window, calculating the hash value of each k-mer through a predefined hash function, and selecting the k-mer with the minimum hash value as the minimum seed of each window. However, there are many repetitive fragments in the genome of eukaryotes, such as tandem repeats of satellite DNA near the centromere, which results in the selection of a higher minimum seed frequency in the repetitive region of the genome than in other regions. When aligned, these high frequency minimal sub-seeds will guide the mapping algorithm to locate the repeat region, bringing many false positive sites and causing a lot of wasted computation and space waste at the base alignment stage.

Heng Li, in the paper Minimap2 for pair alignment for nucleotide sequences, published a method to count the frequency of all the smallest seed seeds and discard the first 0.02% of the smallest seed seeds with the highest frequency. Chirag Jain et al, in the article "A Fast approximation Algorithm for Mapping Long Reads to Large Reference Databases," disclose a method of discarding the first 0.001% of the smallest seed with the highest frequency. Although the two methods avoid false positive sites to a certain extent, the integrity of the minimum seed algorithm is damaged, the comparison accuracy of the genome repetitive region is reduced, and the downstream data analysis is influenced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a device and a storage for generating an inverse weighted sequence alignment seed.

A method of reverse weighted sequence alignment seed generation, the method comprising:

reading a reference sequence, and dividing the reference sequence into N windows, wherein the first N-1 windows comprise w bases, the number of the bases in the Nth window is not more than w, and both N and w are integers more than 1.

And generating continuous overlapped k-mers for the corresponding windows according to each window.

And counting the occurrence times of each k-mer in all windows to obtain the occurrence times of each k-mer.

And calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer.

Calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight values of the k-mers decrease with increasing frequency of occurrence of the k-mers.

And calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer.

And respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window.

In one embodiment, the step of calculating the hash value of each k-mer by using a preset hash function according to the base sequence information and the occurrence frequency of each k-mer comprises:

wherein, h (k) _i ) The hash value of the ith k-mer is obtained, and the value range of the hash value is (0, 1) open interval; k is a radical of _i Base sequence information of the ith k-mer; phi (f (k) _i ) Is the weighted value of the ith k-mer, and the weighted value range is (0, 1) open interval; g (k) _i ) In order to do not consider the ordinary hash function of the weight information, the value range of the hash value is (0, 1) open interval.

In one embodiment, the step of comparing the hash values of all k-mers in each window to generate the minimum sub-seed of each window comprises:

taking the first window as a current window; taking the first k-mer in the current window as the current k-mer; setting the minimum hash value of the k-mer in the current window to be 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.

In one embodiment, counting the occurrence of each k-mer in all windows to obtain the occurrence of each k-mer comprises:

setting a frequency record table to be empty, wherein the information in the frequency record table comprises: k-mer, the occurrence frequency of k-mer; reading a k-mer, denoted as k _i And checking whether the frequency record table already exists and k _i K-mers of identical sequence; if no and k exist in the frequency record table _i K-mers with the same sequence are added to the frequency record table, k is added _i Record in the row, and k _i The number of occurrences of (d) is recorded as 1; if there is a sum of k in the frequency record table _i Adding 1 to the occurrence frequency of the corresponding k-mer in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.

An apparatus for inverse weighted sequence alignment seed generation, the apparatus comprising:

and the reference sequence reading module is used for reading the reference sequence and dividing the reference sequence into N windows, wherein the first N-1 windows comprise w bases, the number of the bases in the Nth window is not more than w, and N and w are integers more than 1.

The k-mer occurrence frequency statistical module is used for generating continuously overlapped k-mers for the corresponding windows according to each window; counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer; and calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer.

The k-mer weight calculation module is used for calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of the k-mer decreases as the frequency of occurrence of the k-mer increases.

And the k-mer hash value calculation module is used for calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer.

And the minimum sub-seed generation module is used for respectively comparing the hash values of all the k-mers in each window to generate the minimum sub-seed of each window.

In one embodiment, the predetermined hash function in the k-mer hash value calculation module is:

wherein, h (k) _i ) The hash value of the ith k-mer is in an interval (0, 1); k is a radical of formula _i Base sequence information of the ith k-mer; phi (f (k) _i ) Is the weight value of the ith k-mer, and the value range of the weight is (0, 1) open interval; g (k) _i ) In order to take the weight information into account in the ordinary hash function, the value range of the hash value is (0, 1) open interval.

In one embodiment, the minimum sub-seed generation module is further configured to use the first window as a current window; taking the first k-mer in the current window as the current k-mer; setting the minimum hash value of the k-mer in the current window as 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.

In one embodiment, the k-mer occurrence frequency statistics module is further configured to set a frequency record table to be empty, where information in the frequency record table includes: k-mer, the occurrence frequency of k-mer; reading a k-mer, denoted as k _i And checking whether the frequency record table already contains the sum k _i The sequences being identicalk-mer; if no and k exist in the frequency record table _i K-mers with the same sequence are added to the frequency record table, k is added _i Record in the row, and k _i The number of occurrences of (d) is recorded as 1; if there is a sum of k in the frequency record table _i Adding 1 to the occurrence frequency of the corresponding k-mer in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

The method, the device, the equipment and the memory for generating the reverse weighted sequence alignment seeds comprise the following steps: reading the reference sequence and dividing the reference sequence into N windows, wherein each window comprises w bases; generating continuously overlapped k-mers for the corresponding windows according to each window; counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer; calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer; calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer, wherein the weight value of each k-mer is reduced along with the increase of the occurrence frequency of each k-mer; calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer; and respectively comparing the hash values of all the k-mers in each window to generate the minimum sub-seed of each window. The method avoids reducing false positive sites by deleting high-frequency seeds, ensures the integrity of a seed generation algorithm, realizes reverse weighted seed generation of a genome repetitive region, and supports the design and implementation of an accurate comparison algorithm.

Drawings

FIG. 1 is a schematic flow diagram of a method for generating an inverse weighted sequence alignment seed according to one embodiment;

FIG. 2 is a block diagram of an apparatus for reverse weighted sequence alignment seed generation in one embodiment;

FIG. 3 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The reason for false positive locations of genomic repeats is that the smallest daughter seed frequency of the repeat is too high, which can be solved if one can avoid selecting too high frequency k-mers as the smallest daughter seeds. Therefore, the invention proposes to adopt a weighted seed selection method, each k-mer is given a weight, the higher the weight is, the higher the probability that the k-mer is selected as the smallest seed is, and the lower the weight is, the smaller the probability that the k-mer is selected as the smallest seed is. For k-mers with higher frequencies, lower weights are given to realize inverse weighting, so that the probability of selecting the high-frequency k-mers as the minimum sub-seeds is reduced, and excessive false positive sites cannot be brought.

In one embodiment, as shown in FIG. 1, a method for generating an inverse weighted seed of sequence alignment is provided, the method comprising the steps of:

step 100: reading the reference sequence, and dividing the reference sequence into N windows, wherein the first N-1 windows all comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1.

Specifically, in the field of gene comparison, the reference sequence is a gene base sequence template which is accumulated and established for many years, also called a standard gene library, which represents the corresponding relationship between the currently known gene and the gene effect, and the gene effect of the sequence to be compared can be predicted by comparing the sequence to be compared with the reference sequence.

Step 102: from each window, successive overlapping k-mers are generated for the corresponding window.

Specifically, a window is read, and continuously overlapped k-mers are generated for the window, if the window is not the Nth window, the window is w-k +1 k-mers and is marked as k _i I =1, 2.. 7., w-k +1, if the nth window, v-k +1 k-mers. And continuing to generate the continuously overlapped k-mers for the next window, and so on until N windows of the reference sequence are traversed, and generating the continuously overlapped k-mers for each window.

Step 104: and counting the occurrence times of each k-mer in all windows to obtain the occurrence times of each k-mer.

Step 106: and calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer.

Specifically, reading a k-mer in a frequency record table and the occurrence number n thereof _i (ii) a Calculating the occurrence frequency f (k) of the k-mer according to the occurrence frequency of the k-mer _i ) (ii) a And continuously reading the next k-mer and the occurrence frequency of the k-mer, calculating the occurrence frequency, and repeating the steps until all the k-mers in the frequency record table are traversed to obtain the occurrence frequency of each k-mer.

Step 108: calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of a k-mer decreases with increasing frequency of occurrence of the k-mer.

Specifically, a k-mer in the frequency record table and the occurrence frequency f (k) thereof are read _i ) Calculating the weighted value of the k-mer by adopting a preset weighted function, and continuously reading the next k-mer without the weighted value in the frequency record table and the occurrence frequency f (k) thereof _i ) And calculating the weight value of the k-mer by adopting a preset weight function, and repeating the steps until all the k-mers in the frequency record table are traversed to obtain the weight value of each k-mer.

The preset weight function is designed by adopting the following method: presetting the frequency of occurrence f (k) of a weight function phi with the input of k-mer _i ) The output is a weighted value, and the value range of the weighted value is (0, 1) open interval. The preset weight function phi can endow the k-mer with extremely high frequency to lower weight and endow the k-mer with lower frequency to higher weight to realize inverse directionAnd (4) weighting.

Step 110: and calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer.

Specifically, reading a k-mer k in the frequency record table _i And frequency of occurrence information thereof; according to k _i The base sequence information and the frequency of occurrence of (2) are determined by a hash function h (k) _i ) Calculate its hash value H _i 。

Step 112: and respectively comparing the hash values of all the k-mers in each window to generate the minimum sub-seed of each window.

In the above method for generating reverse weighted sequence alignment seeds, the method comprises: reading a reference sequence and dividing the reference sequence into N windows; generating continuously overlapped k-mers for the corresponding windows according to each window; counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer; calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer; calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer, wherein the weight value of each k-mer is reduced along with the increase of the occurrence frequency of each k-mer; calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer; and respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window. The method avoids reducing false positive sites by deleting high-frequency seeds, ensures the integrity of a seed generation algorithm, realizes reverse weighted seed generation of a genome repetitive region, and supports the design and implementation of an accurate comparison algorithm.

In one embodiment, the hash function preset in step 110 is:

wherein, h (k) _i ) The hash value of the ith k-mer is in an interval (0, 1); k is a radical of _i Base sequence information of the i-th k-mer; phi (f (k) _i ) Is the weighted value of the ith k-mer, and the weighted value range is (0, 1) open interval; g (k) _i ) In order to take the weight information into account in the ordinary hash function, the value range of the hash value is (0, 1) open interval.

In one embodiment, step 112 includes: taking the first window as a current window; taking the first k-mer in the current window as the current k-mer; setting the minimum hash value of the k-mer in the current window as 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.

In one embodiment, step 104 includes: setting the frequency recording table to be empty, wherein the information in the frequency recording table comprises: k-mer, the occurrence frequency of k-mer; reading a k-mer, denoted as k _i And checking whether the frequency record table already exists and k _i K-mers of identical sequence; if no and k exist in the frequency record table _i K-mers with the same sequence are added to the frequency record table, and k is added _i Record in the row, and k _i The number of occurrences of (d) is recorded as 1; if there is a frequency in the frequency table with k _i Adding 1 to the occurrence times of the corresponding k-mers in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer. The frequency table is shown in table 1.

TABLE 2 frequency recording sheet

k-mer	Count n _i	Frequency f (k) _i )
			k ₁	n ₁	f ₁
k ₂	n ₁	f ₂
			k ₃	n ₁	f ₃
……	……	……

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 2, there is provided an apparatus for generating an inverse weighted sequence alignment seed, comprising: the device comprises a reference sequence reading module, a k-mer occurrence frequency statistical module, a k-mer weight calculation module, a k-mer hash value calculation module and a minimum seed generation module, wherein:

and the reference sequence reading module is used for reading the reference sequence and dividing the reference sequence into N windows, wherein the first N-1 windows respectively comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1.

The k-mer weight calculation module is used for calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of a k-mer decreases with increasing frequency of occurrence of the k-mer.

And the minimum seed generation module is used for respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window.

In one embodiment, the preset hash function in the k-mer hash value calculation module is:

wherein, h (k) _i ) The hash value of the ith k-mer is in the range of (0, 1) divisionA (c) is added; k is a radical of formula _i Base sequence information of the ith k-mer; phi (f (k) _i ) Is the weight value of the ith k-mer, and the value range of the weight is (0, 1) open interval; g (k) _i ) In order to take the weight information into account in the ordinary hash function, the value range of the hash value is (0, 1) open interval.

In one embodiment, the minimum sub-seed generation module is further configured to use the first window as a current window; taking the first k-mer in the current window as a current k-mer; setting the minimum hash value of the k-mer in the current window to be 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to be the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.

In one embodiment, the k-mer occurrence frequency statistics module is further configured to set a frequency record table to be empty, where information in the frequency record table includes: k-mer, the occurrence frequency of k-mer; reading a piece of k-mer, denoted as k _i And checking whether the frequency record table already exists with k _i K-mers of identical sequence; if no and k exist in the frequency record table _i K-mers with the same sequence, adding a row in the frequency record table, and adding k to _i Record in the row, and k _i The number of occurrences of (c) is recorded as 1; if there is a frequency in the frequency table with k _i Adding 1 to the occurrence times of the corresponding k-mers in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.

For the specific limitations of the apparatus for generating the reverse weighted sequence alignment seed, reference may be made to the limitations of the method for generating the reverse weighted sequence alignment seed, which are not described herein again. The modules in the above-described apparatus for generating an inversely weighted sequence alignment seed may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 3. The computer device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for inverse weighted sequence alignment seed generation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configuration shown in fig. 3 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A method for generating an inverse weighted seed of a sequence alignment, the method comprising:

reading a reference sequence, and dividing the reference sequence into N windows, wherein the first N-1 windows comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1;

generating continuously overlapped k-mers for the corresponding windows according to each window;

counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer;

calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer;

calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of the k-mer decreases as the frequency of occurrence of the k-mer increases;

calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer;

2. The method according to claim 1, wherein the hash value of each k-mer is calculated by using a preset hash function according to the base sequence information and the occurrence frequency of each k-mer, wherein the preset hash function in the step is:

wherein, h (k) _i ) As hash value of the ith k-mer, hash valueThe value range of (1) is (0) open interval; k is a radical of formula _i Base sequence information of the ith k-mer; phi (f (k) _i ) Is the weight value of the ith k-mer, and the value range of the weight is (0, 1) open interval; g (k) _i ) In order to do not consider the ordinary hash function of the weight information, the value range of the hash value is (0, 1) open interval.

3. The method of claim 1, wherein comparing the hash values of all k-mers in each window separately to generate a minimum child seed for each window comprises:

taking the first window as a current window;

taking the first k-mer in the current window as a current k-mer;

setting the minimum hash value of the k-mer in the current window as 1;

comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result;

when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window;

when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, continuing to process the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to be the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.

4. The method of claim 1, wherein counting occurrences of each k-mer in all windows to obtain occurrences of each k-mer comprises:

setting a frequency record table to be empty, wherein the information in the frequency record table comprises: k-mer, the occurrence frequency of k-mer;

reading a k-mer, denoted as k _i And checking whether the frequency record table already contains the sum k _i K-mers of identical sequence;

if no and k exist in the frequency record table _i K-mers with the same sequence are added to the frequency record table, k is added _i Record in the row, and k _i The number of occurrences of (d) is recorded as 1;

if there is a sum of k in the frequency record table _i Adding 1 to the occurrence times of the corresponding k-mers in the frequency record table if the k-mers have the same sequence;

and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.

5. An apparatus for generating an inversely weighted sequence alignment seed, the apparatus comprising:

the device comprises a reference sequence reading module, a sequence analyzing module and a sequence analyzing module, wherein the reference sequence reading module is used for reading a reference sequence and dividing the reference sequence into N windows, the first N-1 windows all comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1;

a k-mer frequency statistics module for generating continuously overlapped k-mers for the corresponding windows according to each of the windows; counting the occurrence times of each k-mer in all windows to obtain the occurrence times of each k-mer; calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer;

the k-mer weight calculation module is used for calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of the k-mer decreases as the frequency of occurrence of the k-mer increases;

the k-mer hash value calculation module is used for calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer;

6. The apparatus of claim 5, wherein the predetermined hash function in the k-mer hash value calculation module is:

wherein, h (k) _i ) The hash value of the ith k-mer is in an interval (0, 1); k is a radical of _i Base sequence information of the ith k-mer; phi (f (k) _i ) Is the weighted value of the ith k-mer, and the weighted value range is (0, 1) open interval; g (k) _i ) In order to do not consider the ordinary hash function of the weight information, the value range of the hash value is (0, 1) open interval.

7. The apparatus of claim 5, wherein the minimum sub-seed generation module is further configured to use the first window as a current window; taking the first k-mer in the current window as a current k-mer; setting the minimum hash value of the k-mer in the current window to be 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.

8. The apparatus of claim 5, wherein the k-mer occurrence frequency statistics module is further configured to set a frequency record table to be empty, wherein the frequency record table containsThe information of (2) includes: the occurrence frequency of k-mers, k-mers and the occurrence frequency of k-mers; reading a k-mer, denoted as k _i And checking whether the frequency record table already exists and k _i K-mers of identical sequence; if no and k exist in the frequency record table _i K-mers with the same sequence are added to the frequency record table, and k is added _i Record in the row, and k _i The number of occurrences of (d) is recorded as 1; if there is a sum of k in the frequency record table _i Adding 1 to the occurrence frequency of the corresponding k-mer in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuing to process until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any of claims 1 to 4 when executing the computer program.

10. A computer-readable memory, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 4.