CN115719618A - Reverse weighted sequence alignment seed generation method, device, equipment and memory - Google Patents

Reverse weighted sequence alignment seed generation method, device, equipment and memory Download PDF

Info

Publication number
CN115719618A
CN115719618A CN202211422257.7A CN202211422257A CN115719618A CN 115719618 A CN115719618 A CN 115719618A CN 202211422257 A CN202211422257 A CN 202211422257A CN 115719618 A CN115719618 A CN 115719618A
Authority
CN
China
Prior art keywords
mer
window
hash value
current
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211422257.7A
Other languages
Chinese (zh)
Inventor
崔英博
张昂
李发
唐滔
彭林
郭宇飞
彭晨晨
夏泽宇
郭逸飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202211422257.7A priority Critical patent/CN115719618A/en
Publication of CN115719618A publication Critical patent/CN115719618A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The application relates to a method, a device, equipment and a memory for generating reverse weighted sequence alignment seeds in the technical field of computers. The method comprises the following steps: dividing the read reference sequence into N windows, and generating continuously overlapped k-mers for each window; counting the occurrence frequency of each k-mer in all windows, and calculating the occurrence frequency of each k-mer; according to the occurrence frequency of each k-mer, giving a weight value to each k-mer, wherein the weight value of the k-mer is reduced along with the increase of the occurrence frequency of the k-mer; calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer; and generating the minimum sub-seed of each window according to the hash value of each k-mer in each window. The method realizes reverse weighted seed generation of the genome repetitive region and supports the design and implementation of an accurate comparison algorithm.

Description

Reverse weighted sequence alignment seed generation method, device, equipment and memory
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage for generating a reverse weighted sequence alignment seed.
Background
Sequence alignment compares the read sequence output by the sequencer to a reference genome to find the most likely location of origin of the read on the genome. Sequence alignment is a fundamental and important link in sequencing data analysis, and the result of alignment is a prerequisite for other steps. The sequence alignment algorithm based on seed-extension can effectively narrow the range of base-level alignment, reduce the calculation amount and improve the alignment speed. However, the large number of repeated segments in the genome repeat region can cause misorientation of seeds, generate false positive hit sites, and affect the accuracy of sequence alignment.
Among the existing seed generation methods, the minimum seed (minizer) method has the advantages of time and space, and is more and more popular. The method comprises the steps of uniformly dividing a genome into a plurality of windows with the same length, generating continuous k-mers (segments with the length of k bases) in each window, calculating the hash value of each k-mer through a predefined hash function, and selecting the k-mer with the minimum hash value as the minimum seed of each window. However, there are many repetitive fragments in the genome of eukaryotes, such as tandem repeats of satellite DNA near the centromere, which results in the selection of a higher minimum seed frequency in the repetitive region of the genome than in other regions. When aligned, these high frequency minimal sub-seeds will guide the mapping algorithm to locate the repeat region, bringing many false positive sites and causing a lot of wasted computation and space waste at the base alignment stage.
Heng Li, in the paper Minimap2 for pair alignment for nucleotide sequences, published a method to count the frequency of all the smallest seed seeds and discard the first 0.02% of the smallest seed seeds with the highest frequency. Chirag Jain et al, in the article "A Fast approximation Algorithm for Mapping Long Reads to Large Reference Databases," disclose a method of discarding the first 0.001% of the smallest seed with the highest frequency. Although the two methods avoid false positive sites to a certain extent, the integrity of the minimum seed algorithm is damaged, the comparison accuracy of the genome repetitive region is reduced, and the downstream data analysis is influenced.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, an apparatus, a device and a storage for generating an inverse weighted sequence alignment seed.
A method of reverse weighted sequence alignment seed generation, the method comprising:
reading a reference sequence, and dividing the reference sequence into N windows, wherein the first N-1 windows comprise w bases, the number of the bases in the Nth window is not more than w, and both N and w are integers more than 1.
And generating continuous overlapped k-mers for the corresponding windows according to each window.
And counting the occurrence times of each k-mer in all windows to obtain the occurrence times of each k-mer.
And calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer.
Calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight values of the k-mers decrease with increasing frequency of occurrence of the k-mers.
And calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer.
And respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window.
In one embodiment, the step of calculating the hash value of each k-mer by using a preset hash function according to the base sequence information and the occurrence frequency of each k-mer comprises:
Figure BDA0003942254170000021
wherein, h (k) i ) The hash value of the ith k-mer is obtained, and the value range of the hash value is (0, 1) open interval; k is a radical of i Base sequence information of the ith k-mer; phi (f (k) i ) Is the weighted value of the ith k-mer, and the weighted value range is (0, 1) open interval; g (k) i ) In order to do not consider the ordinary hash function of the weight information, the value range of the hash value is (0, 1) open interval.
In one embodiment, the step of comparing the hash values of all k-mers in each window to generate the minimum sub-seed of each window comprises:
taking the first window as a current window; taking the first k-mer in the current window as the current k-mer; setting the minimum hash value of the k-mer in the current window to be 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.
In one embodiment, counting the occurrence of each k-mer in all windows to obtain the occurrence of each k-mer comprises:
setting a frequency record table to be empty, wherein the information in the frequency record table comprises: k-mer, the occurrence frequency of k-mer; reading a k-mer, denoted as k i And checking whether the frequency record table already exists and k i K-mers of identical sequence; if no and k exist in the frequency record table i K-mers with the same sequence are added to the frequency record table, k is added i Record in the row, and k i The number of occurrences of (d) is recorded as 1; if there is a sum of k in the frequency record table i Adding 1 to the occurrence frequency of the corresponding k-mer in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.
An apparatus for inverse weighted sequence alignment seed generation, the apparatus comprising:
and the reference sequence reading module is used for reading the reference sequence and dividing the reference sequence into N windows, wherein the first N-1 windows comprise w bases, the number of the bases in the Nth window is not more than w, and N and w are integers more than 1.
The k-mer occurrence frequency statistical module is used for generating continuously overlapped k-mers for the corresponding windows according to each window; counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer; and calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer.
The k-mer weight calculation module is used for calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of the k-mer decreases as the frequency of occurrence of the k-mer increases.
And the k-mer hash value calculation module is used for calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer.
And the minimum sub-seed generation module is used for respectively comparing the hash values of all the k-mers in each window to generate the minimum sub-seed of each window.
In one embodiment, the predetermined hash function in the k-mer hash value calculation module is:
Figure BDA0003942254170000041
wherein, h (k) i ) The hash value of the ith k-mer is in an interval (0, 1); k is a radical of formula i Base sequence information of the ith k-mer; phi (f (k) i ) Is the weight value of the ith k-mer, and the value range of the weight is (0, 1) open interval; g (k) i ) In order to take the weight information into account in the ordinary hash function, the value range of the hash value is (0, 1) open interval.
In one embodiment, the minimum sub-seed generation module is further configured to use the first window as a current window; taking the first k-mer in the current window as the current k-mer; setting the minimum hash value of the k-mer in the current window as 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.
In one embodiment, the k-mer occurrence frequency statistics module is further configured to set a frequency record table to be empty, where information in the frequency record table includes: k-mer, the occurrence frequency of k-mer; reading a k-mer, denoted as k i And checking whether the frequency record table already contains the sum k i The sequences being identicalk-mer; if no and k exist in the frequency record table i K-mers with the same sequence are added to the frequency record table, k is added i Record in the row, and k i The number of occurrences of (d) is recorded as 1; if there is a sum of k in the frequency record table i Adding 1 to the occurrence frequency of the corresponding k-mer in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
The method, the device, the equipment and the memory for generating the reverse weighted sequence alignment seeds comprise the following steps: reading the reference sequence and dividing the reference sequence into N windows, wherein each window comprises w bases; generating continuously overlapped k-mers for the corresponding windows according to each window; counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer; calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer; calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer, wherein the weight value of each k-mer is reduced along with the increase of the occurrence frequency of each k-mer; calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer; and respectively comparing the hash values of all the k-mers in each window to generate the minimum sub-seed of each window. The method avoids reducing false positive sites by deleting high-frequency seeds, ensures the integrity of a seed generation algorithm, realizes reverse weighted seed generation of a genome repetitive region, and supports the design and implementation of an accurate comparison algorithm.
Drawings
FIG. 1 is a schematic flow diagram of a method for generating an inverse weighted sequence alignment seed according to one embodiment;
FIG. 2 is a block diagram of an apparatus for reverse weighted sequence alignment seed generation in one embodiment;
FIG. 3 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The reason for false positive locations of genomic repeats is that the smallest daughter seed frequency of the repeat is too high, which can be solved if one can avoid selecting too high frequency k-mers as the smallest daughter seeds. Therefore, the invention proposes to adopt a weighted seed selection method, each k-mer is given a weight, the higher the weight is, the higher the probability that the k-mer is selected as the smallest seed is, and the lower the weight is, the smaller the probability that the k-mer is selected as the smallest seed is. For k-mers with higher frequencies, lower weights are given to realize inverse weighting, so that the probability of selecting the high-frequency k-mers as the minimum sub-seeds is reduced, and excessive false positive sites cannot be brought.
In one embodiment, as shown in FIG. 1, a method for generating an inverse weighted seed of sequence alignment is provided, the method comprising the steps of:
step 100: reading the reference sequence, and dividing the reference sequence into N windows, wherein the first N-1 windows all comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1.
Specifically, in the field of gene comparison, the reference sequence is a gene base sequence template which is accumulated and established for many years, also called a standard gene library, which represents the corresponding relationship between the currently known gene and the gene effect, and the gene effect of the sequence to be compared can be predicted by comparing the sequence to be compared with the reference sequence.
Step 102: from each window, successive overlapping k-mers are generated for the corresponding window.
Specifically, a window is read, and continuously overlapped k-mers are generated for the window, if the window is not the Nth window, the window is w-k +1 k-mers and is marked as k i I =1, 2.. 7., w-k +1, if the nth window, v-k +1 k-mers. And continuing to generate the continuously overlapped k-mers for the next window, and so on until N windows of the reference sequence are traversed, and generating the continuously overlapped k-mers for each window.
Step 104: and counting the occurrence times of each k-mer in all windows to obtain the occurrence times of each k-mer.
Step 106: and calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer.
Specifically, reading a k-mer in a frequency record table and the occurrence number n thereof i (ii) a Calculating the occurrence frequency f (k) of the k-mer according to the occurrence frequency of the k-mer i ) (ii) a And continuously reading the next k-mer and the occurrence frequency of the k-mer, calculating the occurrence frequency, and repeating the steps until all the k-mers in the frequency record table are traversed to obtain the occurrence frequency of each k-mer.
Step 108: calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of a k-mer decreases with increasing frequency of occurrence of the k-mer.
Specifically, a k-mer in the frequency record table and the occurrence frequency f (k) thereof are read i ) Calculating the weighted value of the k-mer by adopting a preset weighted function, and continuously reading the next k-mer without the weighted value in the frequency record table and the occurrence frequency f (k) thereof i ) And calculating the weight value of the k-mer by adopting a preset weight function, and repeating the steps until all the k-mers in the frequency record table are traversed to obtain the weight value of each k-mer.
The preset weight function is designed by adopting the following method: presetting the frequency of occurrence f (k) of a weight function phi with the input of k-mer i ) The output is a weighted value, and the value range of the weighted value is (0, 1) open interval. The preset weight function phi can endow the k-mer with extremely high frequency to lower weight and endow the k-mer with lower frequency to higher weight to realize inverse directionAnd (4) weighting.
Step 110: and calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer.
Specifically, reading a k-mer k in the frequency record table i And frequency of occurrence information thereof; according to k i The base sequence information and the frequency of occurrence of (2) are determined by a hash function h (k) i ) Calculate its hash value H i
Step 112: and respectively comparing the hash values of all the k-mers in each window to generate the minimum sub-seed of each window.
In the above method for generating reverse weighted sequence alignment seeds, the method comprises: reading a reference sequence and dividing the reference sequence into N windows; generating continuously overlapped k-mers for the corresponding windows according to each window; counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer; calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer; calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer, wherein the weight value of each k-mer is reduced along with the increase of the occurrence frequency of each k-mer; calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer; and respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window. The method avoids reducing false positive sites by deleting high-frequency seeds, ensures the integrity of a seed generation algorithm, realizes reverse weighted seed generation of a genome repetitive region, and supports the design and implementation of an accurate comparison algorithm.
In one embodiment, the hash function preset in step 110 is:
Figure BDA0003942254170000071
wherein, h (k) i ) The hash value of the ith k-mer is in an interval (0, 1); k is a radical of i Base sequence information of the i-th k-mer; phi (f (k) i ) Is the weighted value of the ith k-mer, and the weighted value range is (0, 1) open interval; g (k) i ) In order to take the weight information into account in the ordinary hash function, the value range of the hash value is (0, 1) open interval.
In one embodiment, step 112 includes: taking the first window as a current window; taking the first k-mer in the current window as the current k-mer; setting the minimum hash value of the k-mer in the current window as 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.
In one embodiment, step 104 includes: setting the frequency recording table to be empty, wherein the information in the frequency recording table comprises: k-mer, the occurrence frequency of k-mer; reading a k-mer, denoted as k i And checking whether the frequency record table already exists and k i K-mers of identical sequence; if no and k exist in the frequency record table i K-mers with the same sequence are added to the frequency record table, and k is added i Record in the row, and k i The number of occurrences of (d) is recorded as 1; if there is a frequency in the frequency table with k i Adding 1 to the occurrence times of the corresponding k-mers in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer. The frequency table is shown in table 1.
TABLE 2 frequency recording sheet
k-mer Count n i Frequency f (k) i )
k 1 n 1 f 1
k 2 n 1 f 2
k 3 n 1 f 3
…… …… ……
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 2, there is provided an apparatus for generating an inverse weighted sequence alignment seed, comprising: the device comprises a reference sequence reading module, a k-mer occurrence frequency statistical module, a k-mer weight calculation module, a k-mer hash value calculation module and a minimum seed generation module, wherein:
and the reference sequence reading module is used for reading the reference sequence and dividing the reference sequence into N windows, wherein the first N-1 windows respectively comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1.
The k-mer occurrence frequency statistical module is used for generating continuously overlapped k-mers for the corresponding windows according to each window; counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer; and calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer.
The k-mer weight calculation module is used for calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of a k-mer decreases with increasing frequency of occurrence of the k-mer.
And the k-mer hash value calculation module is used for calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer.
And the minimum seed generation module is used for respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window.
In one embodiment, the preset hash function in the k-mer hash value calculation module is:
Figure BDA0003942254170000091
wherein, h (k) i ) The hash value of the ith k-mer is in the range of (0, 1) divisionA (c) is added; k is a radical of formula i Base sequence information of the ith k-mer; phi (f (k) i ) Is the weight value of the ith k-mer, and the value range of the weight is (0, 1) open interval; g (k) i ) In order to take the weight information into account in the ordinary hash function, the value range of the hash value is (0, 1) open interval.
In one embodiment, the minimum sub-seed generation module is further configured to use the first window as a current window; taking the first k-mer in the current window as a current k-mer; setting the minimum hash value of the k-mer in the current window to be 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to be the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.
In one embodiment, the k-mer occurrence frequency statistics module is further configured to set a frequency record table to be empty, where information in the frequency record table includes: k-mer, the occurrence frequency of k-mer; reading a piece of k-mer, denoted as k i And checking whether the frequency record table already exists with k i K-mers of identical sequence; if no and k exist in the frequency record table i K-mers with the same sequence, adding a row in the frequency record table, and adding k to i Record in the row, and k i The number of occurrences of (c) is recorded as 1; if there is a frequency in the frequency table with k i Adding 1 to the occurrence times of the corresponding k-mers in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.
For the specific limitations of the apparatus for generating the reverse weighted sequence alignment seed, reference may be made to the limitations of the method for generating the reverse weighted sequence alignment seed, which are not described herein again. The modules in the above-described apparatus for generating an inversely weighted sequence alignment seed may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 3. The computer device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for inverse weighted sequence alignment seed generation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the configuration shown in fig. 3 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (10)

1. A method for generating an inverse weighted seed of a sequence alignment, the method comprising:
reading a reference sequence, and dividing the reference sequence into N windows, wherein the first N-1 windows comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1;
generating continuously overlapped k-mers for the corresponding windows according to each window;
counting the occurrence frequency of each k-mer in all windows to obtain the occurrence frequency of each k-mer;
calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer;
calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of the k-mer decreases as the frequency of occurrence of the k-mer increases;
calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer;
and respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window.
2. The method according to claim 1, wherein the hash value of each k-mer is calculated by using a preset hash function according to the base sequence information and the occurrence frequency of each k-mer, wherein the preset hash function in the step is:
Figure FDA0003942254160000011
wherein, h (k) i ) As hash value of the ith k-mer, hash valueThe value range of (1) is (0) open interval; k is a radical of formula i Base sequence information of the ith k-mer; phi (f (k) i ) Is the weight value of the ith k-mer, and the value range of the weight is (0, 1) open interval; g (k) i ) In order to do not consider the ordinary hash function of the weight information, the value range of the hash value is (0, 1) open interval.
3. The method of claim 1, wherein comparing the hash values of all k-mers in each window separately to generate a minimum child seed for each window comprises:
taking the first window as a current window;
taking the first k-mer in the current window as a current k-mer;
setting the minimum hash value of the k-mer in the current window as 1;
comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result;
when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window;
when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, continuing to process the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to be the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.
4. The method of claim 1, wherein counting occurrences of each k-mer in all windows to obtain occurrences of each k-mer comprises:
setting a frequency record table to be empty, wherein the information in the frequency record table comprises: k-mer, the occurrence frequency of k-mer;
reading a k-mer, denoted as k i And checking whether the frequency record table already contains the sum k i K-mers of identical sequence;
if no and k exist in the frequency record table i K-mers with the same sequence are added to the frequency record table, k is added i Record in the row, and k i The number of occurrences of (d) is recorded as 1;
if there is a sum of k in the frequency record table i Adding 1 to the occurrence times of the corresponding k-mers in the frequency record table if the k-mers have the same sequence;
and reading the next k-mer, and continuously processing until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.
5. An apparatus for generating an inversely weighted sequence alignment seed, the apparatus comprising:
the device comprises a reference sequence reading module, a sequence analyzing module and a sequence analyzing module, wherein the reference sequence reading module is used for reading a reference sequence and dividing the reference sequence into N windows, the first N-1 windows all comprise w bases, the number of the bases of the Nth window is not more than w, and both N and w are integers more than 1;
a k-mer frequency statistics module for generating continuously overlapped k-mers for the corresponding windows according to each of the windows; counting the occurrence times of each k-mer in all windows to obtain the occurrence times of each k-mer; calculating the occurrence frequency of each k-mer according to the occurrence frequency of each k-mer;
the k-mer weight calculation module is used for calculating the weight value of each k-mer by adopting a preset weight function according to the occurrence frequency of each k-mer; the weight value of the k-mer decreases as the frequency of occurrence of the k-mer increases;
the k-mer hash value calculation module is used for calculating the hash value of each k-mer by adopting a preset hash function according to the base sequence information and the occurrence frequency of each k-mer;
and the minimum seed generation module is used for respectively comparing the hash values of all the k-mers in each window to generate the minimum seed of each window.
6. The apparatus of claim 5, wherein the predetermined hash function in the k-mer hash value calculation module is:
Figure FDA0003942254160000031
wherein, h (k) i ) The hash value of the ith k-mer is in an interval (0, 1); k is a radical of i Base sequence information of the ith k-mer; phi (f (k) i ) Is the weighted value of the ith k-mer, and the weighted value range is (0, 1) open interval; g (k) i ) In order to do not consider the ordinary hash function of the weight information, the value range of the hash value is (0, 1) open interval.
7. The apparatus of claim 5, wherein the minimum sub-seed generation module is further configured to use the first window as a current window; taking the first k-mer in the current window as a current k-mer; setting the minimum hash value of the k-mer in the current window to be 1; comparing the hash value of the current k-mer in the current window with the minimum hash value to obtain a comparison result; when the comparison result is that the hash value of the current k-mer is smaller than the minimum hash value, updating the minimum hash value to the hash value of the current k-mer, and recording the corresponding k-mer as the candidate minimum seed of the current window; when the comparison result is that the hash value of the current k-mer is larger than the minimum hash value, if the current k-mer is not the last k-mer in the current window, processing the next k-mer; and if the current k-mer is the last k-mer in the current window, updating the current window to the next window by taking the candidate minimum sub-seed of the current window as the minimum sub-seed of the current window, and continuing to perform the next iteration until all windows are traversed, thereby completing the generation process of the minimum sub-seeds of all windows.
8. The apparatus of claim 5, wherein the k-mer occurrence frequency statistics module is further configured to set a frequency record table to be empty, wherein the frequency record table containsThe information of (2) includes: the occurrence frequency of k-mers, k-mers and the occurrence frequency of k-mers; reading a k-mer, denoted as k i And checking whether the frequency record table already exists and k i K-mers of identical sequence; if no and k exist in the frequency record table i K-mers with the same sequence are added to the frequency record table, and k is added i Record in the row, and k i The number of occurrences of (d) is recorded as 1; if there is a sum of k in the frequency record table i Adding 1 to the occurrence frequency of the corresponding k-mer in the frequency record table if the k-mers have the same sequence; and reading the next k-mer, and continuing to process until all the k-mers are traversed, so as to obtain the occurrence frequency of each k-mer.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any of claims 1 to 4 when executing the computer program.
10. A computer-readable memory, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 4.
CN202211422257.7A 2022-11-14 2022-11-14 Reverse weighted sequence alignment seed generation method, device, equipment and memory Pending CN115719618A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211422257.7A CN115719618A (en) 2022-11-14 2022-11-14 Reverse weighted sequence alignment seed generation method, device, equipment and memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211422257.7A CN115719618A (en) 2022-11-14 2022-11-14 Reverse weighted sequence alignment seed generation method, device, equipment and memory

Publications (1)

Publication Number Publication Date
CN115719618A true CN115719618A (en) 2023-02-28

Family

ID=85255079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211422257.7A Pending CN115719618A (en) 2022-11-14 2022-11-14 Reverse weighted sequence alignment seed generation method, device, equipment and memory

Country Status (1)

Country Link
CN (1) CN115719618A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168765A (en) * 2023-04-25 2023-05-26 山东大学 Gene sequence generation method and system based on improved stroboemer

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168765A (en) * 2023-04-25 2023-05-26 山东大学 Gene sequence generation method and system based on improved stroboemer
CN116168765B (en) * 2023-04-25 2023-08-18 山东大学 Gene sequence generation method and system based on improved stroboemer

Similar Documents

Publication Publication Date Title
US20210193257A1 (en) Phase-aware determination of identity-by-descent dna segments
Gillespie A simple stochastic gene substitution model
US11262717B2 (en) Optimization device and control method of optimization device based on temperature statistical information
Mitrophanov et al. Statistical significance in biological sequence analysis
Dickhaus et al. How to analyze many contingency tables simultaneously in genetic association studies
US20110038212A1 (en) Controller and non-volatile semiconductor memory device
Chambaz et al. Estimation of a non-parametric variable importance measure of a continuous exposure
CN110990135B (en) Spark job time prediction method and device based on deep migration learning
CN109426655B (en) Data analysis method and device, electronic equipment and computer readable storage medium
CN115719618A (en) Reverse weighted sequence alignment seed generation method, device, equipment and memory
Huo et al. Optimizing genetic algorithm for motif discovery
Rajaratnam et al. Influence diagnostics for high-dimensional lasso regression
CN113239697B (en) Entity recognition model training method and device, computer equipment and storage medium
Tharakaraman et al. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements
US9390163B2 (en) Method, system and software arrangement for detecting or determining similarity regions between datasets
Bayer et al. An efficient forward–reverse expectation-maximization algorithm for statistical inference in stochastic reaction networks
Choi et al. Joint analysis of survival time and longitudinal categorical outcomes
US20240045749A1 (en) Error rate analysis method, system and apparatus for mlc chip
CN113378929B (en) Pulmonary nodule growth prediction method and computer equipment
US20200234796A1 (en) Evaluating Optimality Of A Trace Generated During Sequence Alignment
CN115033888B (en) Firmware encryption detection method and device based on entropy, computer equipment and medium
JPWO2021216477A5 (en)
JP2020161044A (en) System, method, and program for managing data
CN115579054B (en) Single cell copy number variation detection method, device, equipment and medium
WO2022061974A1 (en) Data processing method for rapid quantitative expression of transcriptome, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination