WO2018000174A1 - 面向存储的dna序列的并行快速匹配方法及其*** - Google Patents

面向存储的dna序列的并行快速匹配方法及其*** Download PDF

Info

Publication number
WO2018000174A1
WO2018000174A1 PCT/CN2016/087407 CN2016087407W WO2018000174A1 WO 2018000174 A1 WO2018000174 A1 WO 2018000174A1 CN 2016087407 W CN2016087407 W CN 2016087407W WO 2018000174 A1 WO2018000174 A1 WO 2018000174A1
Authority
WO
WIPO (PCT)
Prior art keywords
matching
kmer
dna sequence
hash
threads
Prior art date
Application number
PCT/CN2016/087407
Other languages
English (en)
French (fr)
Inventor
朱泽轩
邓清津
储颖
孙怡雯
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2016/087407 priority Critical patent/WO2018000174A1/zh
Publication of WO2018000174A1 publication Critical patent/WO2018000174A1/zh

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to the field of data compression, and in particular, to a parallel fast matching method for a stored DNA sequence and a system thereof.
  • next-generation sequencing technology has facilitated the generation of high-throughput DNA sequencing data, exponentially growing faster than computer microprocessors and storage devices, while high-throughput DNA sequencing data compression technology is effective in solving DNA sequences.
  • the method of storage and transmission Before being applied to compressed storage, a common practice is to match the high-throughput sequencing data FASTQ sequence file to the existing genome, ie the reference genome, where the reference genome file format is the FASTA file format, storing the target sequence and the reference genome.
  • the matching result replaces the original sequence to achieve the purpose of compression storage, effectively eliminating redundant information, and only retaining different parts of the information. Therefore, base matching has become an important issue to be solved in the compression storage of DNA sequences.
  • the current mainstream FASTQ matching software includes BWA (Burrows-Wheeler)
  • BWA Backwards-Wheeler
  • the Aligner tool and the Bowtie tool are all based on the BWT matching method.
  • the BWA tool based on the BWT algorithm has a large amount of computation, high storage space requirements, high computer memory consumption, and slower computational speed than the sparse indexing algorithm.
  • the memory usage and operation speed are not occupied by the sparse indexing algorithm.
  • the Bowtie tool further expands its matching time and memory consumption due to the consideration of the processing of the quality score, and the mismatch situation considers only three or less, and does not allow the existence of intervals between the sequence and the reference genome, such as insertion and deletion. mistake.
  • the above methods are all matching tools designed for downstream sequence analysis, which pursue the integrity and accuracy of matching results, but are not suitable for DNA compression storage, especially on high noise DNA data.
  • DNA compression storage requires fast matching but the accuracy requirements for matching results can be relaxed.
  • the program does not use multi-threaded level for parallel processing, can not fully utilize the multi-core processor thread application capabilities, can not fully utilize multiple execution cores, can not More tasks are executed in a specific time.
  • problems such as slow running speed, high time consumption, and large memory consumption often occur, resulting in very low matching efficiency.
  • an object of the present invention is to provide a parallel fast matching method for a stored DNA sequence and a system thereof, aiming at solving the problem of low matching efficiency against DNA sequences in the prior art.
  • the invention provides a parallel fast matching method for stored DNA sequences, which is applied to compressed storage of DNA sequences, wherein the method comprises:
  • Hash index construction step constructing a hash index based on the reference to the reference genome of the FASTA format, finding all the kmer of the specified prefix and establishing a hash index table with the key value, each entry storing the location corresponding to the occurrence of the kmer;
  • File segmentation step input a DNA sequence file in FASTQ format, and perform block processing on the DNA sequence file;
  • Multi-thread processing step open multiple threads to process several tasks determined by the number of threads, multiple sub-blocks simultaneously call a matching function based on kmer hash index fast positioning, and sub-blocks are matched in parallel to the target reference genome of FASTA format, through storage
  • the matching result replaces the original DNA sequence for compression storage purposes.
  • the step of constructing the hash index specifically includes:
  • kmer Defining k consecutive bases prefixed by base combinations "AT” and "CG” is called kmer;
  • the binary hash values defining the four bases "A”, “C”, “G”, and “T” are 00, 01, 10, and 11, respectively, and a small number of "N” and “W” bases are present.
  • the rules are converted into corresponding "A”, “C”, “G”, “T” bases;
  • the file blocking step further comprises:
  • the number of specific sequences contained in each sub-block is calculated.
  • the multi-thread processing step specifically includes:
  • the DNA sequence and the reference genome are bidirectionally extended and matched by the matching position returned by the hash table, and the fault tolerance rate is allowed to allow partial bases to be mismatched, inserted, and deleted;
  • the unmatched portion is converted into a palindrome structure and the above process is repeated for rematching;
  • the final output matching result includes matching position, matching type, matching length, mismatch position, mismatch content, and multiple matching results obtained by performing matching operations on each of the plurality of sub-blocks are archived and combined.
  • the present invention also provides a parallel fast matching system for stored DNA sequences, the system comprising:
  • a hash index construction module is configured to construct a hash index based on a reference to a FASTA format reference genome, find all kmer of the specified prefix, and establish a hash index table by using the key value, and each entry stores a location corresponding to the kmer. ;
  • a file blocking module configured to input a DNA sequence file in a FASTQ format, and perform the block processing on the DNA sequence file
  • a multi-thread processing module configured to enable multiple threads to separately process a plurality of tasks determined by the number of threads, and multiple sub-blocks simultaneously call a matching function based on a quick positioning of the kmer hash index, and the sub-blocks are matched in parallel to the target reference genome in the FASTA format;
  • the specific matching module is used to find the kmer of all the specified prefixes in the DNA sequence to be matched and convert it into a hash value, and query the hash table to obtain the position of the corresponding kmer in the reference genome, and use this as a starting point for bidirectional extension matching.
  • the file blocking module is further configured to:
  • the number of specific sequences contained in each sub-block is calculated.
  • the multi-thread processing module is specifically configured to:
  • the corresponding matching function is called according to the thread function processing command, and all threads are executed synchronously until all the threads are executed.
  • the hash index building module is specifically configured to:
  • kmer Defining k consecutive bases prefixed by base combinations "AT” and "CG” is called kmer;
  • the binary hash values defining the four bases "A”, “C”, “G”, and “T” are 00, 01, 10, and 11, respectively, and a small number of "N” and “W” bases are present.
  • the rules are converted into corresponding "A”, “C”, “G”, “T” bases;
  • the system further comprises a specific matching module for:
  • the DNA sequence and the reference genome are bidirectionally extended and matched by the matching position returned by the hash table, and the fault tolerance rate is allowed to allow partial bases to be mismatched, inserted, and deleted;
  • the unmatched part is converted into a palindrome structure and the above process is repeated for re-matching; the final output matching result includes matching position, matching type, matching length, mismatch position, mismatch content, and more
  • Each of the sub-blocks performs a matching operation to obtain a plurality of matching results for archival merge processing.
  • the technical solution provided by the invention is applied to a reference genome-based compression method in a DNA sequence compression method, and the base sequence in the FASTQ format is subjected to lightweight parallel matching based on a reference genome, and the final matching result is compressed again.
  • the tool is further processed.
  • the parallel algorithm processing part is mainly based on Pthreads (POSIX Thread-level parallel programming of threads) is a set of application interfaces for creating threads. Multi-threading is used to implement multi-tasking in parallel, which realizes parallelization of lightweight matching, effectively improves the efficiency of DNA data for FASTQ format, and makes The program runs much faster.
  • FIG. 1 is a flow chart of a parallel fast matching method for a stored DNA sequence according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a matching parallel execution process according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a process of merging multiple sub-block matching results according to an embodiment of the present invention
  • FIG. 4 is a general flowchart of a parallel matching model according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of specific matching according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram showing the internal structure of a parallel fast matching system 10 for storing DNA sequences according to an embodiment of the present invention.
  • a specific embodiment of the present invention provides a parallel fast matching method for a stored DNA sequence, which is applied to compressed storage of a DNA sequence, wherein the method mainly comprises the following steps:
  • Hash index construction step construct a hash index based on the reference reference genome of the FASTA format, find all kmer of the specified prefix, and establish a hash index table with the key value, and each entry stores the location corresponding to the kmer. ;
  • Multi-thread processing step starting multiple threads to process several tasks determined by the number of threads, and multiple sub-blocks simultaneously calling a matching function based on the quick positioning of the kmer hash index, and matching the sub-blocks in parallel to the target reference genome of the FASTA format.
  • the storage storage result is replaced by storing the matching result instead of the original DNA sequence.
  • the parallel fast matching method for storage-oriented DNA sequences uses multi-thread parallel to realize multi-tasking, realizes parallelization of lightweight DNA sequence matching applied to compressed storage, and effectively improves DNA data for FASTQ format. Efficiency, and the matching speed is also increased accordingly, so that the entire program runs faster, reduces the consumption in time, and enhances the usability of the method.
  • FIG. 1 is a flowchart of a parallel fast matching method for a stored DNA sequence according to an embodiment of the present invention.
  • step S11 the hash index construction step: constructing a hash index based on the reference to the reference genome of the FASTA format, finding all the kmer of the specified prefix and establishing a hash index table with the key value, each entry storing the corresponding kmer The location that appears.
  • the FASTQ format is a text format for storing biological sequences, and is also one of the commonly used storage formats for DNA sequences.
  • DNA sequencing technology produces thousands of DNA sequences that are stored in FASTQ-formatted files and contain all the information generated by sequencing.
  • each sequence contains four lines, each line being separated by a newline character. Each sequence begins with the character '@' followed by the metadata as the first line to uniquely identify the DNA sequence.
  • the second line is the base data, consisting of a sequence of five characters including the common ⁇ 'A', 'T', 'C', 'G', 'N' ⁇ , where the character 'N' indicates an unclear base.
  • Base can be expressed as any of the characters ⁇ 'A', 'T', 'C', 'G' ⁇ .
  • the third line begins with the character '+' followed by the same DNA sequence identifier as the first line.
  • the last behavioral quality score line, one-to-one correspondence with the base, indicates the credibility of sequencing of each base character corresponding to the position.
  • step S12 the file blocking step: inputting a DNA sequence file in FASTQ format, and performing the block processing on the DNA sequence file.
  • the DNA sequence file of the FASTQ format needs to be preprocessed, which mainly includes: since each DNA sequence record contains four rows of key information, the DNA sequence file of the FASTQ format is subjected to block processing according to a multiple of four, At the same time, the number of required sub-blocks is 10, which can of course be adjusted. For example, the number of blocks can be adjusted to 8, 9, 11, 12, etc. as needed, and the original FASTQ format is used before the block processing. After the file is traversed by the file pointer, the number of specific sequences contained in each sub-block is calculated.
  • the file segmentation step S12 further includes: presetting the number of required blocks before performing the blocking process; performing file pointer traversal on the original DNA sequence file; calculating each sub-block containing The number of specific sequences.
  • step S13 the multi-thread processing step: turning on multiple threads to respectively process several tasks determined by the number of threads, and multiple sub-blocks simultaneously calling a matching function based on the quick positioning of the kmer hash index to match the sub-blocks in parallel to the target of the FASTA format.
  • the reference genome achieves compression storage by storing matching results instead of the original DNA sequence.
  • the parallel algorithm idea is to use thread-level parallel programming, open multiple threads to process several tasks determined by the number of threads, and then put the task into the background to process, wherein the thread function calls the matching function. Multiple sub-blocks execute thread functions simultaneously to implement parallel execution of matching operations.
  • the multi-thread processing step S13 uses Pthreads (ie, POSIX). Threads-level parallel programming to achieve parallelization of matching operations, effectively improve the efficiency of DNA data for FASTQ format, including: opening the number of threads corresponding to the number of sub-blocks, and executing a quick call based on kmer hash index Positioning the thread function of the matching function to process the command; invoking the corresponding matching function according to the thread function processing command, executing all the threads synchronously until all the threads are executed; in the matching function, finding all the DNA sequences to be matched The kmer of the AT" or "CG" prefix is converted to a binary hash value.
  • Pthreads ie, POSIX
  • the matching position returned by the hash table is used as the starting point for the DNA.
  • the sequence and the reference genome are bidirectionally extended and matched, and the fault tolerance rate is allowed to allow mismatch, insertion, and deletion of some bases; when the matching length is greater than the set minimum effective matching length and the mismatch rate is less than the set fault tolerance rate, then The match is successful; if the sequence only partially matches, the unmatched part is converted into a palindrome structure and the above process is repeated Rematching line; final output comprises a matching position matching result, the matching type, the match length, position of the mismatch, the mismatch content.
  • FIG. 2 is a flowchart of matching parallel execution process in an embodiment of the present invention.
  • the corresponding number of threads is turned on according to the number of sub-blocks (ie, n threads are turned on), and each sub-block execution is called based on the kmer hash index. Position the thread function of the matching function, and then all threads execute synchronously until all threads have finished executing.
  • the hash index construction step S11 further includes: defining a prefix, that is, a base combination "AT”, “CG”, etc., followed by k bases called kmer; finding all prefixes in the reference genome And then converting the subsequent kmer base sequence into a hash value; defining four bases "A”, “C”, “G”, “T” binary hash values are 00, 01, 10, 11, respectively.
  • a small number of bases such as “N” and “W” can be converted into corresponding "A”, “C”, “G”, “T” bases by approximation rules; the converted hash value is stored in Ha In the hash table, a hash index is constructed; and the error matching rate parameter, the minimum effective matching length parameter, and the matching thread number parameter are adjusted as needed, and the adjusted multiple parameters are applied to the subsequent matching operation step.
  • the specific matching process includes: finding all the prefixes in the DNA sequence to be matched and the k bases after the prefix are converted into hash values, and if not, directly outputting the mismatch information; And perform bidirectional matching according to the best matching position returned by the previously constructed hash table, and set the fault tolerance rate to allow partial bases to have mismatch, insertion, and deletion; when the matching length is greater than the set value, the minimum effective matching length and the mismatch If the rate is less than the set fault tolerance rate, the matching is successful; if the first matching does not satisfy the above conditions, the matching part of the sequence is output, and the unmatched part is converted into a palindrome structure and then re-matched; finally, according to the matching process , the output matching result, the matching result includes information such as matching position, matching type, matching length, and the like.
  • FIG. 5 it is a specific matching flowchart in an embodiment of the present invention.
  • the intermediate file is generated in the parallel matching process after the file is divided into blocks, and the matching result corresponding to the number of blocks is recorded.
  • the matching results can be directly applied to the next compression storage work.
  • the compression storage process must be based on the overall matching result of the DNA sequence file of the FASTQ format, so multiple matching results need to be archived and merged for subsequent work, and at the same time, key statistical information related to performance is output.
  • FIG. 3 is a flowchart of the merging process of the multi-sub-block matching result in an embodiment of the present invention.
  • the parallel fast matching method for the stored DNA sequence of the present invention further includes:
  • Parameter adjustable step find all prefixes in the reference gene of FASTA format, and create a hash index value by taking k base numbers after the prefix, and adjust the error matching rate parameter, the minimum effective matching length parameter, and the matching thread number parameter as needed. And applying the adjusted plurality of parameters to the matching operation step.
  • some key parameters required for running in the matching process such as matching prefix P (default value is "CG”), matching error tolerance e (default value is 0.05), minimum effective matching length L (default value) 30), the hash index value is created by taking the number of k bases after the prefix, k is the default value of 8, and the matching error tolerance e is the error matching rate.
  • CG matching prefix
  • e default value is 0.05
  • L default value 30
  • the hash index value is created by taking the number of k bases after the prefix
  • k is the default value of 8
  • the matching error tolerance e is the error matching rate.
  • the DNA sequence file of the FASTQ format is subjected to block preprocessing using step S12 to obtain sub-blocks of the DNA sequence file of the FASTQ format, thereby reducing the time occupation problem caused by one-time serial matching.
  • the parallelization of the lightweight matching process applied to the compressed storage of the DNA sequence file is implemented by using step S13, and multiple threads are processed to process multiple sub-blocks, and multiple sub-blocks are quickly matched, and the matching result is archived and merged.
  • the parameter settings can be adjusted, and the parameters can be adjusted to obtain the ideal matching result.
  • the overall flowchart of the parallel matching model is shown in FIG.
  • the invention provides a parallel fast matching method for stored DNA sequences, and performs lightweight parallel matching based on reference genomes for the base sequences in the FASTQ format, and the obtained matching results are further processed by using a compression tool.
  • the parallel algorithm processing part is mainly based on Pthreads (POSIX Thread-level parallel programming of threads) is a set of application interfaces for creating threads. Multi-threading is used to implement multi-tasking in parallel, which realizes parallelization of lightweight matching, effectively improves the efficiency of DNA data for FASTQ format, and makes The program runs much faster.
  • a specific embodiment of the present invention further provides a parallel fast matching system 10 for storing DNA sequences, which mainly includes:
  • the hash index construction module 11 is configured to construct a hash index based on the reference to the reference genome of the FASTA format, find all the kmer of the specified prefix, and establish a hash index table by using the key value, and each entry stores the corresponding kmer. position;
  • a file blocking module 12 configured to input a DNA sequence file in a FASTQ format, and perform the block processing on the DNA sequence file;
  • the multi-thread processing module 13 is configured to enable multiple threads to separately process a plurality of tasks determined by the number of threads, and multiple sub-blocks simultaneously call a matching function based on the kmer hash index fast positioning, and the sub-blocks are matched in parallel to the target reference genome of the FASTA format. ;
  • the specific matching module 14 is configured to find the kmer of all the specified prefixes in the DNA sequence to be matched and convert it into a hash value, and query the hash table to obtain the position of the corresponding kmer in the reference genome, and use this as a starting point for bidirectional extension matching.
  • the parallel fast matching system 10 for storing DNA sequences uses multi-thread parallel to realize multi-tasking, realizes parallelization of lightweight DNA sequence matching applied to compressed storage, and effectively targets DNA compressed in FASTQ format.
  • the efficiency of the data, and the speed of the program is greatly accelerated, the matching speed is also accelerated, the consumption in time is reduced, and the availability of the method is enhanced.
  • FIG. 6 a schematic structural diagram of a parallel fast matching system 10 for storing DNA sequences according to an embodiment of the present invention is shown.
  • the parallel fast matching system 10 for storing DNA sequences is applied to compressed storage of DNA sequences, and mainly includes a hash index construction module 11, a file blocking module 12, a multi-thread processing module 13, and a specific matching module. 14.
  • the hash index construction module 11 is configured to construct a hash index based on the reference to the reference genome of the FASTA format, find all the kmer of the specified prefix, and establish a hash index table by using the key value, and each entry stores the corresponding kmer. position.
  • the hash index construction module 11 is configured to: define a base combination "AT", "CG” as a prefix k consecutive bases are called kmer; find the kmer of all "AT” or “CG” prefixes in the reference genome of FASTA format and convert them into hash values; define four bases “A”, "C", "G”
  • the binary hash values of "" and “T” are 00, 01, 10, and 11, respectively.
  • a small number of bases such as “N” and "W” can be converted into corresponding "A” and "C” by using approximate rules.
  • Bases such as "G” and "T”; convert a base in a kmer into a binary hash value, which is combined as a key value of a hash table, and a hash table entry associated with a key value stores the corresponding kmer in Refer to all the positions in the genome, and thus construct a hash index; adjust the error matching rate parameter, the minimum effective matching length parameter, and the matching thread number parameter as needed, and apply the adjusted multiple parameters to the subsequent matching operation. In the steps.
  • the specific matching module 14 is configured to find the kmer of all the specified prefixes in the DNA sequence to be matched and convert it into a hash value, and query the hash table to obtain the position of the corresponding kmer in the reference genome, which is used as a starting point. Perform two-way extension matching.
  • the specific matching module 14 is configured to find all the kmer of the "AT" or "CG" prefix in the DNA sequence to be matched and convert it into a binary hash value, and if the key value does not exist in the hash table, Matching; if there is a key value, the DNA sequence and the reference genome are bidirectionally extended and matched by the matching position returned by the hash table, and the fault tolerance rate is allowed to allow partial bases to be mismatched, inserted, deleted; If the maximum effective matching length is greater than the set value and the mismatch rate is less than the set fault tolerance rate, the matching is successful; if the sequence only partially matches, the unmatched portion is converted into a palindrome structure and the above process is repeated to re-match;
  • the matching result includes a matching position, a matching type, a matching length, a mismatched position, a mismatched content, and multiple matching results obtained by performing matching operations on each of the plurality of sub-blocks are archived and merged.
  • the file blocking module 12 is configured to input a DNA sequence file in a FASTQ format and perform block processing on the DNA sequence file.
  • the FASTQ format is a text format for storing biological sequences, and is also one of the commonly used storage formats for DNA sequences.
  • DNA sequencing technology produces thousands of DNA sequences that are stored in FASTQ-formatted files and contain all the information generated by sequencing.
  • each sequence contains four lines, each line being separated by a newline character. Each sequence begins with the character '@' followed by the metadata as the first line to uniquely identify the DNA sequence.
  • the second line is the base data, consisting of a sequence of five characters including the common ⁇ 'A', 'T', 'C', 'G', 'N' ⁇ , where the character 'N' indicates an unclear base.
  • Base can be expressed as any of the characters ⁇ 'A', 'T', 'C', 'G' ⁇ .
  • the third line begins with the character '+' followed by the same DNA sequence identifier as the first line.
  • the last behavioral quality score line, one-to-one correspondence with the base, indicates the credibility of sequencing of each base character corresponding to the position.
  • the file blocking module 12 is further configured to: preset the number of required blocks before performing the blocking process; perform file pointer traversal on the original DNA sequence file; and calculate the content of each sub-block The number of specific sequences.
  • the multi-thread processing module 13 is configured to enable multiple threads to separately process a plurality of tasks determined by the number of threads, and multiple sub-blocks simultaneously call a matching function based on the kmer hash index fast positioning, and the sub-blocks are matched in parallel to the target reference genome of the FASTA format. .
  • the parallel algorithm idea is to use thread-level parallel programming, open multiple threads to process several tasks determined by the number of threads, and then put the task into the background to process, wherein the thread function calls the matching function. Multiple sub-blocks execute thread functions simultaneously to implement parallel execution of matching operations.
  • the multi-thread processing step uses Pthreads (ie, POSIX). Threads-level parallel programming to achieve parallelization of matching operations, effectively improve the efficiency of DNA data for FASTQ format, specifically for: opening the number of threads corresponding to the number of sub-blocks, and executing the thread function that calls the matching operation Processing the command; calling the corresponding matching function according to the thread function processing command, and executing all the threads synchronously until all the threads are executed.
  • Pthreads ie, POSIX
  • the intermediate file is obtained in the parallel matching process after the file is divided into blocks, and the matching result corresponding to the number of blocks is generated.
  • the matching results will be applied to the compression storage work, but the compression storage The process must be based on the overall matching result of the DNA sequence file in the FASTQ format, so the matching result merging module is required to archive multiple matching results for subsequent work, and output the performance-related key statistical information results. It is not detailed.
  • the parallel fast matching system 10 for the stored DNA sequence further includes:
  • a parameter-adjustable module for finding all prefixes such as "CG”, "AT” in the reference gene of the FASTQ format, and then creating a hash index value by taking the number of k bases after the prefix, adjusting the error match as needed
  • the rate parameter, the minimum valid match length parameter, and the number of matching threads, and the adjusted plurality of parameters are applied to the matching operation step.
  • some key parameters required for running in the matching process such as matching prefix P (default value is "CG”), matching error tolerance e (default value is 0.05), minimum effective matching length L (default value) 30), the hash index value is created by taking the number of k bases after the prefix, k is the default value of 8, and the matching error tolerance e is the error matching rate, which are fixed default parameters, and the parameters are implemented in the present invention.
  • Adjustable users can adjust parameters as needed to maximize performance.
  • the matching thread also performs parameter adjustment processing.
  • the number of threads b (the default value is 10) can be adjusted according to the need to open the corresponding number of threads for fast matching.
  • the parallel fast matching system 10 for storing DNA sequences performs lightweight parallel matching based on reference genomes for the base sequences in the FASTQ format, and the obtained matching results are further processed by using a compression tool.
  • the parallel algorithm processing part is mainly based on Pthreads (POSIX Thread-level parallel programming of threads) is a set of application interfaces for creating threads. Multi-threading is used to implement multi-tasking in parallel, which realizes parallelization of lightweight matching, effectively improves the efficiency of DNA data for FASTQ format, and makes The program runs much faster.
  • each unit included is only divided according to functional logic, but is not limited to the above division, as long as the corresponding function can be implemented; in addition, the specific name of each functional unit is also They are only used to facilitate mutual differentiation and are not intended to limit the scope of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种面向存储的DNA序列的并行快速匹配方法,应用于DNA序列的压缩存储,所述方法包括:哈希索引构建步骤:基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置;文件分块步骤:输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理;多线程处理步骤:开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组,通过存储匹配结果代替原始DNA序列达到压缩存储目的。

Description

面向存储的DNA序列的并行快速匹配方法及其*** 技术领域
本发明涉及数据压缩领域,尤其涉及一种面向存储的DNA序列的并行快速匹配方法及其***。
背景技术
下一代测序技术的发展促进了高通量DNA测序数据的产生,数据指数级的增长速度超过了计算机微处理器和存储设备的增长速度,而高通量DNA测序数据压缩技术是有效解决DNA序列存储和传输的方法。而在应用于压缩存储之前,一种常见做法是将高通量测序数据FASTQ序列文件匹配到已有的基因组即参考基因组中,其中参考基因组文件的格式为FASTA文件格式,存储目标序列和参考基因组的匹配结果代替原始序列达到压缩存储的目的,有效地剔除了冗余信息,只保留记录不同的部分信息。因此碱基匹配就成为了DNA序列压缩存储所需要解决的一个重要问题。
目前主流的FASTQ匹配软件有包括BWA(Burrows-Wheeler Aligner)工具、Bowtie工具,都是基于BWT的匹配方法。但是,基于BWT算法的BWA工具计算量较大,对存储空间要求高,计算机内存消耗多,在运算速度上也较稀疏索引算法慢,内存占用和运算速度上相对于基于稀疏索引算法并不占据优势。同时,Bowtie工具由于考虑质量分数的处理问题,其匹配时间和内存消耗进一步扩大,并且错配情况考虑仅有3个以下,并且不允许序列和参考基因组之间有间隔的存在,例如***和删除的错误。上述方法都是针对下游序列分析而设计的匹配工具,它们追求匹配结果的完整性和精确性,但在DNA压缩存储中并不适用,特别是在高噪声的DNA数据上。DNA压缩存储要求匹配速度快但对匹配结果的精确性要求可以适当放宽。
此外,现有已存在的这些匹配工具大多使用单线程,程序没有采用多线程级进行并行化的处理,不能充分发挥多核处理器的线程应用的能力,不能够充分利用多个执行内核,不能在特定的时间内执行更多任务,在整个匹配过程中,经常会出现运行速度慢、时间消耗多、内存消耗大等等问题,导致匹配效率非常低。
技术问题
有鉴于此,本发明的目的在于提供一种面向存储的DNA序列的并行快速匹配方法及其***,旨在解决现有技术中针对DNA序列的匹配效率低的问题。
技术解决方案
本发明提出一种面向存储的DNA序列的并行快速匹配方法,应用于DNA序列的压缩存储,其中,所述方法包括:
哈希索引构建步骤:基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置;
文件分块步骤:输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理;
多线程处理步骤:开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组,通过存储匹配结果代替原始DNA序列达到压缩存储目的。
优选的,所述哈希索引构建步骤具体包括:
定义以碱基组合“AT”、“CG”为前缀的 k个连续碱基称为kmer;
找出FASTA格式的参考基因组中所有“AT”或“CG”前缀的kmer并转化成哈希值;
定义四种碱基“A”、“C”、“G”、“T”的二进制哈希值分别为00、01、10、11,少量存在的“N”、“W”碱基,采用近似规则转换成相应的“A”、“C”、“G”、“T”碱基;
将一个kmer中的碱基分别转换成二进制哈希值,组合起来作为哈希表的一个键值,一个键值关联的哈希表项存储对应kmer在参考基因组中出现的所有位置,由此即构建了哈希索引;
并根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数参数,并将调整后的多个参数应用到后续匹配运算步骤中。
优选的,所述文件分块步骤还包括:
在进行分块处理之前,预置所需分块的数量;
对原有的DNA序列文件进行文件指针遍历;
计算出每个子块含有的具体的序列的数量。
优选的,所述多线程处理步骤具体包括:
开启与子块数量相应的线程数,并执行调用了基于kmer哈希索引快速定位的匹配函数的线程函数来处理命令;
根据所述线程函数处理命令调用相应匹配函数,同步执行所有线程,直至所有线程执行完毕;
在匹配函数中,找出待匹配的DNA序列中所有“AT”或“CG”前缀的kmer并转化为二进制哈希值,若哈希表中不存在该键值即不做匹配;
若存在键值,即以哈希表返回的匹配位置为起点对DNA序列和参考基因组进行双向延展匹配,并设置容错率允许部分碱基存在错配、***、删除的情况;
当匹配长度大于设定值最小有效匹配长度并且错配率小于设置的容错率时,则匹配成功;
若序列只有部分匹配,则将不匹配的部分转换为回文结构再重复上述过程进行重新匹配;
最后输出匹配结果包括匹配位置、匹配类型、匹配长度、错配位置,错配内容,并将多个子块各自执行匹配运算所得到的多个匹配结果作归档合并处理。
另一方面,本发明还提供一种面向存储的DNA序列的并行快速匹配***,所述***包括:
哈希索引构建模块,用于基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置;
文件分块模块,用于输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理;
多线程处理模块,用于开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组;
具体匹配模块,用于找出待匹配的DNA序列中所有指定前缀的kmer并转化成哈希值,查询哈希表获得参考基因组中对应kmer的位置,以此为起点进行双向延展匹配。
优选的,所述文件分块模块还用于:
在进行分块处理之前,预置所需分块的数量;
对原有的DNA序列文件进行文件指针遍历;
计算出每个子块含有的具体的序列的数量。
优选的,所述多线程处理模块具体用于:
开启与子块数量相应的线程数,并执行调用了匹配运算的线程函数来处理命令;
根据所述线程函数处理命令调用相应的匹配函数,同步执行所有线程,直至所有线程执行完毕。
优选的,所述哈希索引构建模块具体用于:
定义以碱基组合“AT”、“CG”为前缀的 k个连续碱基称为kmer;
找出FASTA格式的参考基因组中所有“AT”或“CG”前缀的kmer并转化成哈希值;
定义四种碱基“A”、“C”、“G”、“T”的二进制哈希值分别为00、01、10、11,少量存在的“N”、“W”碱基,采用近似规则转换成相应的“A”、“C”、“G”、“T”碱基;
将一个kmer中的碱基分别转换成二进制哈希值,组合起来作为哈希表的一个键值,一个键值关联的哈希表项存储对应kmer在参考基因组中出现的所有位置,由此即构建了哈希索引;
并根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数参数,并将调整后的多个参数应用到后续匹配运算步骤中。
优选的,所述***还包括具体匹配模块用于:
找出待匹配的DNA序列中所有“AT”或“CG”前缀的kmer并转化为二进制哈希值,若哈希表中不存在该键值即不做匹配;
若存在键值,即以哈希表返回的匹配位置为起点对DNA序列和参考基因组进行双向延展匹配,并设置容错率允许部分碱基存在错配、***、删除的情况;
当匹配长度大于设定值最小有效匹配长度并且错配率小于设置的容错率时,则匹配成功;
若序列只有部分匹配,则将不匹配的部分转换为回文结构再重复上述过程进行重新匹配;最后输出匹配结果包括匹配位置、匹配类型、匹配长度、错配位置,错配内容,并将多个子块各自执行匹配运算所得到的多个匹配结果作归档合并处理。
有益效果
本发明提供的技术方案应用于DNA序列压缩方法中的基于参考基因组的压缩方法,对FASTQ格式中的碱基序列作基于参考基因组的轻量级并行匹配,将最终得出的匹配结果再使用压缩工具进行进一步处理。其中并行算法处理部分主要是基于Pthreads(POSIX threads)的线程级并行编程,是一套创建线程的应用程序接口,使用多线程并行实现多任务处理,实现了轻量级匹配的并行化,有效提高针对FASTQ格式的DNA数据的效率,并使得程序的运行速度大大加快。
附图说明
图1为本发明一实施方式中面向存储的DNA序列的并行快速匹配方法流程图;
图2为本发明一实施方式中匹配并行执行过程的流程图;
图3为本发明一实施方式中多子块匹配结果的合并过程的流程图;
图4为本发明一实施方式中并行匹配模型的总体流程图;
图5为本发明一实施方式中具体匹配流程图;
图6为本发明一实施方式中面向存储的DNA序列的并行快速匹配***10的内部结构示意图。
本发明的实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
本发明具体实施方式提供了一种面向存储的DNA序列的并行快速匹配方法,应用于DNA序列的压缩存储,其中,所述方法主要包括如下步骤:
S11、哈希索引构建步骤:基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置;
S12、文件分块步骤:输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理;
S13、多线程处理步骤:开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组,通过存储匹配结果代替原始DNA序列达到压缩存储目的。
本发明提供的一种面向存储的DNA序列的并行快速匹配方法使用多线程并行实现多任务处理,实现应用于压缩存储的轻量级DNA序列匹配的并行化,有效提高针对FASTQ格式的DNA数据的效率,并使得匹配速度也相应加快,从而使整个程序的运行速度大大加快,减少了在时间上的消耗,并增强了该方法的可用性。
以下将对本发明所提供的一种面向存储的DNA序列的并行快速匹配方法进行详细说明。
请参阅图1,为本发明一实施方式中面向存储的DNA序列的并行快速匹配方法流程图。
在步骤S11中,哈希索引构建步骤:基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置。
在本实施方式中,FASTQ格式是一种存储生物序列的文本格式,也是DNA序列常用的存储格式之一。DNA测序技术产生成千上万条DNA序列,这些DNA序列存储于以FASTQ为格式的文件中,包含测序产生的所有信息。在广泛使用的FASTQ格式中,每条序列包含四行,每行由换行符分割。每条序列以字符‘@’开始,后面紧接着元数据作为第一行,用来唯一标识DNA序列。第二行是碱基数据,由包含常见的{‘A’,‘T’,‘C’,‘G’,‘N’}五个字符的序列构成,其中字符‘N’表示不明确的碱基,可表示为{‘A’,‘T’,‘C’,‘G’}中任意一个字符。第三行以字符‘+’开始,紧接着与第一行相同的DNA序列标识。最后一行为质量分数行,与碱基一一对应,表示每个碱基字符对应位置测序的可信度。
在步骤S12中,文件分块步骤:输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理。
在本实施方式中,需要对FASTQ格式的DNA序列文件进行预处理,主要包括:由于每条DNA序列记录含有四行关键信息,所以根据四的倍数对FASTQ格式的DNA序列文件进行分块处理,同时预置所需分块的数量为10,当然也可作调整,例如可以根据需要将分块数目调整为8、9、11、12等等,在分块处理之前对原有的FASTQ格式的文件进行文件指针遍历之后,计算得出每个子块含有的具体的序列的数量。
在本实施方式中,所述文件分块步骤S12还包括:在进行分块处理之前,预置所需分块的数量;对原有的DNA序列文件进行文件指针遍历;计算出每个子块含有的具体的序列的数量。
在步骤S13中,多线程处理步骤:开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组,通过存储匹配结果代替原始DNA序列达到压缩存储目的。
在本实施方式中,并行的算法思路是使用线程级并行编程,开启多个线程分别处理由线程数决定的若干个任务,再将该任务放到后台去处理,其中线程函数调用了匹配函数,多个子块同时执行线程函数,以实现匹配运算的并行执行。
在本实施方式中,多线程处理步骤S13采用Pthreads(即POSIX threads)的线程级并行编程方式来实现匹配运算的并行化,有效提高针对FASTQ格式的DNA数据的效率,具体包括:开启与子块数量相应的线程数,并执行调用了基于kmer哈希索引快速定位的匹配函数的线程函数来处理命令;根据所述线程函数处理命令调用相应的匹配函数,同步执行所有线程,直至所有线程执行完毕;在匹配函数中,找出待匹配的DNA序列中所有“AT”或“CG”前缀的kmer并转化为二进制哈希值,若哈希表中不存在该键值即不做匹配;若存在键值,即以哈希表返回的匹配位置为起点对DNA序列和参考基因组进行双向延展匹配,并设置容错率允许部分碱基存在错配、***、删除的情况;当匹配长度大于设定值最小有效匹配长度并且错配率小于设置的容错率时,则匹配成功;若序列只有部分匹配,则将不匹配的部分转换为回文结构再重复上述过程进行重新匹配;最后输出匹配结果包括匹配位置、匹配类型、匹配长度、错配位置,错配内容。并将多个子块各自执行匹配运算所得到的多个匹配结果作归档合并处理。其中,匹配并行执行的流程如图2所示,为本发明一实施方式中匹配并行执行过程的流程图。
如图2中,在分成定义的子块数量确定之后,例如分成n个子块,根据子块数量开启相应的线程数(即开启n个线程),每个子块执行调用了基于kmer哈希索引快速定位的匹配函数的线程函数,然后所有线程同步执行,直至所有线程执行完毕。
在本实施方式中,所述哈希索引构建步骤S11还包括:定义前缀即碱基组合“AT”、“CG”等,其后的k个碱基称为kmer;找出参考基因组中所有前缀,然后将其后的kmer碱基序列转化成哈希值;定义四种碱基“A”、“C”、“G”、“T”二进制哈希值分别为00、01、10、11,少量存在的“N”、“W”等碱基,可以采用近似规则转换成相应的“A”、“C”、“G”、“T”等碱基;将转换的哈希值存入哈希表中,即构建了哈希索引;并根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数参数,并将调整后的多个参数应用到后续匹配运算步骤中。
在本实施方式中,所述具体匹配过程包括:找出待匹配的DNA序列中所有的前缀以及前缀后面k个碱基(kmer)转化为哈希值,若不存在则直接输出不匹配信息;并根据之前构建的哈希表返回的最佳匹配位置进行双向匹配,并设置容错率允许部分碱基存在错配、***、删除的情况;当匹配长度大于设定值最小有效匹配长度并且错配率小于设置的容错率时,则匹配成功;若第一次匹配不满足上述条件时候,将该序列的匹配部分输出,将不匹配的部分转换为回文结构再进行重新匹配;最后根据匹配过程,输出匹配结果,匹配结果包括匹配位置、匹配类型、匹配长度等信息。如图5所示,为本发明一实施方式中具体匹配流程图。
本发明在所有线程任务执行完毕之后,由于文件分块后进行并行匹配过程中会产生中间文件,记录与分块数量相应的匹配结果,理论上这些匹配结果可以直接应用于下一步的压缩储存工作,但压缩存储过程中必须基于FASTQ格式的DNA序列文件的整体匹配结果,所以需要将多个匹配结果作归档合并处理,以便后续工作进行,同时,输出与性能相关的关键统计信息,在此就不在详述,其中,多子块匹配结果的合并过程如图3所示,为本发明一实施方式中多子块匹配结果的合并过程的流程图。
在本实施方式中,本发明的面向存储的DNA序列的并行快速匹配方法,还进一步包括:
参数可调步骤:在FASTA格式的参考基因中找出所有前缀,取前缀之后的k个碱基数创建哈希索引值,根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数参数,并将调整后的多个参数应用到所述匹配运算步骤中。
在本实施方式中,一些匹配过程中运行所需要的关键参数,例如匹配前缀P(默认值为“CG”)、匹配错误容忍度e(默认值为0.05)、最小有效匹配长度L(默认值为30),取前缀之后的k个碱基数创建哈希索引值,k默认值为8,匹配错误容忍度e即错误匹配率,这些都是固定的默认参数,本发明中提供参数的可调化,用户可以根据需要调整参数以得到性能的最大化。同时,匹配线程也进行了参数可调处理,线程数b(默认值为10)可以根据需要调整开启相应的线程数进行快速匹配。
在本实施方式中,利用步骤S12对FASTQ格式的DNA序列文件进行分块预处理,得到FASTQ格式的DNA序列文件的子块,从而减少一次性串行匹配所带来的时间占用问题。利用步骤S13实现应用于DNA序列文件压缩存储的轻量级匹配过程的并行化,开启多个线程处理多个子块,实现多子块同步快速匹配,并将匹配结果作归档合并处理。同时,对参数设置进行可调化处理,可以自行调整参数来得到理想的匹配结果。在本实施方式中,并行匹配模型的总体流程图图4所示。
本发明提供的一种面向存储的DNA序列的并行快速匹配方法,对FASTQ格式中的碱基序列作基于参考基因组的轻量级并行匹配,将得出的匹配结果再使用压缩工具进行进一步处理。其中并行算法处理部分主要是基于Pthreads(POSIX threads)的线程级并行编程,是一套创建线程的应用程序接口,使用多线程并行实现多任务处理,实现了轻量级匹配的并行化,有效提高针对FASTQ格式的DNA数据的效率,并使得程序的运行速度大大加快。
本发明具体实施方式还提供一种面向存储的DNA序列的并行快速匹配***10,主要包括:
哈希索引构建模块11,用于基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置;
文件分块模块12,用于输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理;
多线程处理模块13,用于开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组;
具体匹配模块14,用于找出待匹配的DNA序列中所有指定前缀的kmer并转化成哈希值,查询哈希表获得参考基因组中对应kmer的位置,以此为起点进行双向延展匹配。
本发明提供的一种面向存储的DNA序列的并行快速匹配***10,使用多线程并行实现多任务处理,实现应用于压缩存储的轻量级DNA序列匹配的并行化,有效针对压缩FASTQ格式的DNA数据的效率,并使得程序的运行速度大大加快,匹配速度也相应加快,减少在时间上的消耗,并增强了该方法的可用性。
请参阅图6,所示为本发明一实施方式中面向存储的DNA序列的并行快速匹配***10的结构示意图。
在本实施方式中,面向存储的DNA序列的并行快速匹配***10,应用于DNA序列的压缩存储,主要包括哈希索引构建模块11、文件分块模块12、多线程处理模块13以及具体匹配模块14。
哈希索引构建模块11,用于基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置。
在本实施方式中,哈希索引构建模块11用于:定义以碱基组合“AT”、“CG”为前缀的 k个连续碱基称为kmer;找出FASTA格式的参考基因组中所有“AT”或“CG”前缀的kmer并转化成哈希值;定义四种碱基“A”、“C”、“G”、“T”的二进制哈希值分别为00、01、10、11,少量存在的“N”、“W”等碱基,可以采用近似规则转换成相应的“A”、“C”、“G”、“T”等碱基;将一个kmer中的碱基分别转换成二进制哈希值,组合起来作为哈希表的一个键值,一个键值关联的哈希表项存储对应kmer在参考基因组中出现的所有位置,由此即构建了哈希索引;并根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数参数,并将调整后的多个参数应用到后续匹配运算步骤中。
在本实施方式中,具体匹配模块14,用于找出待匹配的DNA序列中所有指定前缀的kmer并转化成哈希值,查询哈希表获得参考基因组中对应kmer的位置,以此为起点进行双向延展匹配。
具体的,具体匹配模块14,用于找出待匹配的DNA序列中所有“AT”或“CG”前缀的kmer并转化为二进制哈希值,若哈希表中不存在该键值即不做匹配;若存在键值,即以哈希表返回的匹配位置为起点对DNA序列和参考基因组进行双向延展匹配,并设置容错率允许部分碱基存在错配、***、删除的情况;当匹配长度大于设定值最小有效匹配长度并且错配率小于设置的容错率时,则匹配成功;若序列只有部分匹配,则将不匹配的部分转换为回文结构再重复上述过程进行重新匹配;最后输出匹配结果包括匹配位置、匹配类型、匹配长度、错配位置,错配内容,并将多个子块各自执行匹配运算所得到的多个匹配结果作归档合并处理。
文件分块模块12,用于输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理。
在本实施方式中,FASTQ格式是一种存储生物序列的文本格式,也是DNA序列常用的存储格式之一。DNA测序技术产生成千上万条DNA序列,这些DNA序列存储于以FASTQ为格式的文件中,包含测序产生的所有信息。在广泛使用的FASTQ格式中,每条序列包含四行,每行由换行符分割。每条序列以字符‘@’开始,后面紧接着元数据作为第一行,用来唯一标识DNA序列。第二行是碱基数据,由包含常见的{‘A’,‘T’,‘C’,‘G’,‘N’}五个字符的序列构成,其中字符‘N’表示不明确的碱基,可表示为{‘A’,‘T’,‘C’,‘G’}中任意一个字符。第三行以字符‘+’开始,紧接着与第一行相同的DNA序列标识。最后一行为质量分数行,与碱基一一对应,表示每个碱基字符对应位置测序的可信度。
在本实施方式中,文件分块模块12还用于:在进行分块处理之前,预置所需分块的数量;对原有的DNA序列文件进行文件指针遍历;计算出每个子块含有的具体的序列的数量。
多线程处理模块13,用于开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组。
在本实施方式中,并行的算法思路是使用线程级并行编程,开启多个线程分别处理由线程数决定的若干个任务,再将该任务放到后台去处理,其中线程函数调用了匹配函数,多个子块同时执行线程函数,以实现匹配运算的并行执行。
在本实施方式中,多线程处理步骤采用Pthreads(即POSIX threads)的线程级并行编程方式来实现匹配运算的并行化,有效提高针对FASTQ格式的DNA数据的效率,具体用于:开启与子块数量相应的线程数,并执行调用了匹配运算的线程函数处理命令;根据所述线程函数处理命令调用相应的匹配函数,同步执行所有线程,直至所有线程执行完毕。
本发明在所有线程任务执行完毕之后,由于文件分块后进行并行匹配过程中会得到中间文件,产生与分块数量相应的匹配结果,理论上这些匹配结果将应用于压缩储存工作,但压缩存储过程中必须基于FASTQ格式的DNA序列文件的整体匹配结果,所以需要匹配结果合并模块将多个匹配结果作归档合并处理,以便后续工作进行,同时,输出与性能相关的关键统计信息结果,在此就不在详述。
在本实施方式中,面向存储的DNA序列的并行快速匹配***10,还包括:
参数可调模块,用于在FASTQ格式的参考基因中找出所有前缀(如“CG”、“AT”),然后取前缀之后的k个碱基数创建哈希索引值,根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数,并将调整后的多个参数应用到所述匹配运算步骤中。
在本实施方式中,一些匹配过程中运行所需要的关键参数,例如匹配前缀P(默认值为“CG”)、匹配错误容忍度e(默认值为0.05)、最小有效匹配长度L(默认值为30),取前缀之后的k个碱基数创建哈希索引值,k默认值为8,匹配错误容忍度e即错误匹配率,这些都是固定的默认参数,本发明中实现了参数的可调化,用户可以根据需要调整参数以得到性能的最大化。同时,匹配线程也进行了参数可调处理,线程数b(默认值为10)可以根据需要调整开启相应的线程数进行快速匹配。
本发明提供的一种面向存储的DNA序列的并行快速匹配***10,对FASTQ格式中的碱基序列作基于参考基因组的轻量级并行匹配,将得出的匹配结果再使用压缩工具进行进一步处理。其中并行算法处理部分主要是基于Pthreads(POSIX threads)的线程级并行编程,是一套创建线程的应用程序接口,使用多线程并行实现多任务处理,实现了轻量级匹配的并行化,有效提高针对FASTQ格式的DNA数据的效率,并使得程序的运行速度大大加快。
值得注意的是,上述实施例中,所包括的各个单元只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。
另外,本领域普通技术人员可以理解实现上述各实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,相应的程序可以存储于一计算机可读取存储介质中,所述的存储介质,如ROM/RAM、磁盘或光盘等。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (9)

  1. 一种面向存储的DNA序列的并行快速匹配方法,应用于DNA序列的压缩存储,其特征在于,所述方法包括:
    哈希索引构建步骤:基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置;
    文件分块步骤:输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理;
    多线程处理步骤:开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组,通过存储匹配结果代替原始DNA序列达到压缩存储目的。
  2. 如权利要求1所述的面向存储的DNA序列的并行快速匹配方法,其特征在于,所述哈希索引构建步骤具体包括:
    定义以碱基组合“AT”、“CG”为前缀的 k个连续碱基称为kmer;
    找出FASTA格式的参考基因组中所有“AT”或“CG”前缀的kmer并转化成哈希值;
    定义四种碱基“A”、“C”、“G”、“T”的二进制哈希值分别为00、01、10、11,少量存在的“N”、“W”碱基,采用近似规则转换成相应的“A”、“C”、“G”、“T”碱基;
    将一个kmer中的碱基分别转换成二进制哈希值,组合起来作为哈希表的一个键值,一个键值关联的哈希表项存储对应kmer在参考基因组中出现的所有位置,由此即构建了哈希索引;
    并根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数参数,并将调整后的多个参数应用到后续匹配运算步骤中。
  3. 如权利要求1所述的面向存储的DNA序列的并行快速匹配方法,其特征在于,所述文件分块步骤还包括:
    在进行分块处理之前,预置所需分块的数量;
    对原有的DNA序列文件进行文件指针遍历;
    计算出每个子块含有的具体的序列的数量。
  4. 如权利要求1所述的面向存储的DNA序列的并行快速匹配方法,其特征在于,所述多线程处理步骤具体包括:
    开启与子块数量相应的线程数,并执行调用了基于kmer哈希索引快速定位的匹配函数的线程函数来处理命令;
    根据所述线程函数处理命令调用相应匹配函数,同步执行所有线程,直至所有线程执行完毕;
    在匹配函数中,找出待匹配的DNA序列中所有“AT”或“CG”前缀的kmer并转化为二进制哈希值,若哈希表中不存在该键值即不做匹配;
    若存在键值,即以哈希表返回的匹配位置为起点对DNA序列和参考基因组进行双向延展匹配,并设置容错率允许部分碱基存在错配、***、删除的情况;
    当匹配长度大于设定值最小有效匹配长度并且错配率小于设置的容错率时,则匹配成功;
    若序列只有部分匹配,则将不匹配的部分转换为回文结构再重复上述过程进行重新匹配;
    最后输出匹配结果包括匹配位置、匹配类型、匹配长度、错配位置,错配内容,并将多个子块各自执行匹配运算所得到的多个匹配结果作归档合并处理。
  5. 一种面向存储的DNA序列的并行快速匹配***,其特征在于,所述***包括:
    哈希索引构建模块,用于基于前缀对FASTA格式的参考基因组构建哈希索引,找出指定前缀的所有kmer并以其为键值建立哈希索引表,每个表项存储对应kmer出现的位置;
    文件分块模块,用于输入FASTQ格式的DNA序列文件,并将所述DNA序列文件进行分块处理;
    多线程处理模块,用于开启多个线程分别处理由线程数决定的若干任务,多个子块同时调用基于kmer哈希索引快速定位的匹配函数,将子块并行匹配到FASTA格式的目标参考基因组;
    具体匹配模块,用于找出待匹配的DNA序列中所有指定前缀的kmer并转化成哈希值,查询哈希表获得参考基因组中对应kmer的位置,以此为起点进行双向延展匹配。
  6. 如权利要求5所述的面向存储的DNA序列的并行快速匹配***,其特征在于,所述文件分块模块还用于:
    在进行分块处理之前,预置所需分块的数量;
    对原有的DNA序列文件进行文件指针遍历;
    计算出每个子块含有的具体的序列的数量。
  7. 如权利要求6所述的面向存储的DNA序列的并行快速匹配***,其特征在于,所述多线程处理模块具体用于:
    开启与子块数量相应的线程数,并执行调用了匹配运算的线程函数来处理命令;
    根据所述线程函数处理命令调用相应的匹配函数,同步执行所有线程,直至所有线程执行完毕。
  8. 如权利要求5所述的面向存储的DNA序列的并行快速匹配***,其特征在于,所述哈希索引构建模块具体用于:
    定义以碱基组合“AT”、“CG”为前缀的 k个连续碱基称为kmer;
    找出FASTA格式的参考基因组中所有“AT”或“CG”前缀的kmer并转化成哈希值;
    定义四种碱基“A”、“C”、“G”、“T”的二进制哈希值分别为00、01、10、11,少量存在的“N”、“W”碱基,采用近似规则转换成相应的“A”、“C”、“G”、“T”碱基;
    将一个kmer中的碱基分别转换成二进制哈希值,组合起来作为哈希表的一个键值,一个键值关联的哈希表项存储对应kmer在参考基因组中出现的所有位置,由此即构建了哈希索引;
    并根据需要调整错误匹配率参数、最小有效匹配长度参数以及匹配线程数参数,并将调整后的多个参数应用到后续匹配运算步骤中。
  9. 如权利要求5所述的面向存储的DNA序列的并行快速匹配***,其特征在于,所述具体匹配模块具体用于:
    找出待匹配的DNA序列中所有“AT”或“CG”前缀的kmer并转化为二进制哈希值,若哈希表中不存在该键值即不做匹配;
    若存在键值,即以哈希表返回的匹配位置为起点对DNA序列和参考基因组进行双向延展匹配,并设置容错率允许部分碱基存在错配、***、删除的情况;
    当匹配长度大于设定值最小有效匹配长度并且错配率小于设置的容错率时,则匹配成功;
    若序列只有部分匹配,则将不匹配的部分转换为回文结构再重复上述过程进行重新匹配;
    最后输出匹配结果包括匹配位置、匹配类型、匹配长度、错配位置,错配内容,并将多个子块各自执行匹配运算所得到的多个匹配结果作归档合并处理。
PCT/CN2016/087407 2016-06-28 2016-06-28 面向存储的dna序列的并行快速匹配方法及其*** WO2018000174A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/087407 WO2018000174A1 (zh) 2016-06-28 2016-06-28 面向存储的dna序列的并行快速匹配方法及其***

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/087407 WO2018000174A1 (zh) 2016-06-28 2016-06-28 面向存储的dna序列的并行快速匹配方法及其***

Publications (1)

Publication Number Publication Date
WO2018000174A1 true WO2018000174A1 (zh) 2018-01-04

Family

ID=60785678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087407 WO2018000174A1 (zh) 2016-06-28 2016-06-28 面向存储的dna序列的并行快速匹配方法及其***

Country Status (1)

Country Link
WO (1) WO2018000174A1 (zh)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134678A (zh) * 2018-02-08 2019-08-16 深圳先进技术研究院 一种生物数据的索引方法、***及电子设备
CN111145834A (zh) * 2019-11-29 2020-05-12 中科曙光(南京)计算技术有限公司 多线程基因数据压缩方法、装置
CN111326216A (zh) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 一种针对大数据基因测序文件的快速划分方法
CN111370064A (zh) * 2020-03-19 2020-07-03 山东大学 基于simd的哈希函数的基因序列快速分类方法及***
CN111584011A (zh) * 2020-04-10 2020-08-25 中国科学院计算技术研究所 面向基因比对的细粒度并行负载特征抽取分析方法及***
CN112259167A (zh) * 2020-10-22 2021-01-22 深圳华大基因科技服务有限公司 基于高通量测序的病原体分析方法、装置和计算机设备
CN112783904A (zh) * 2019-11-07 2021-05-11 北京沃东天骏信息技术有限公司 一种更新索引数据的方法和装置
CN112863607A (zh) * 2020-12-14 2021-05-28 武汉大学 一种面向大规模基因数据的同一认定***及优化处理方法
CN114064551A (zh) * 2022-01-17 2022-02-18 广州嘉检医学检测有限公司 基于cpu+gpu异构的高并发序列比对计算加速方法
CN115083530A (zh) * 2022-08-22 2022-09-20 广州明领基因科技有限公司 基因测序数据压缩方法、装置、终端设备和存储介质
CN117373538A (zh) * 2023-12-08 2024-01-09 山东大学 一种基于多线程计算的生物序列比对方法及***

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081707A (zh) * 2011-01-07 2011-06-01 深圳大学 一种dna序列数据压缩***
CN103546160A (zh) * 2013-09-22 2014-01-29 上海交通大学 基于多参考序列的基因序列分级压缩方法
CN103995988A (zh) * 2014-05-30 2014-08-20 周家锐 一种高通量dna测序质量分数无损压缩***及压缩方法
CN104239750A (zh) * 2014-08-25 2014-12-24 北京百迈客生物科技有限公司 基于高通量测序数据的基因组从头组装方法
CN104951672A (zh) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 一种第二代、三代基因组测序数据联用的拼接方法及***
CN106096332A (zh) * 2016-06-28 2016-11-09 深圳大学 面向存储的dna序列的并行快速匹配方法及其***

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081707A (zh) * 2011-01-07 2011-06-01 深圳大学 一种dna序列数据压缩***
CN103546160A (zh) * 2013-09-22 2014-01-29 上海交通大学 基于多参考序列的基因序列分级压缩方法
CN103995988A (zh) * 2014-05-30 2014-08-20 周家锐 一种高通量dna测序质量分数无损压缩***及压缩方法
CN104239750A (zh) * 2014-08-25 2014-12-24 北京百迈客生物科技有限公司 基于高通量测序数据的基因组从头组装方法
CN104951672A (zh) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 一种第二代、三代基因组测序数据联用的拼接方法及***
CN106096332A (zh) * 2016-06-28 2016-11-09 深圳大学 面向存储的dna序列的并行快速匹配方法及其***

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
MARCAIS, G. ET AL.: "A Fast, Lock-Free Approach for Efficient Parallel Counting of Occurrences of K-mers", BIOINFORMATICS, vol. 27, no. 6, 1 July 2011 (2011-07-01), pages 764 - 770, XP055450695 *
MEHTA, A. ET AL.: "DNA Compression Using Hash Based Data Structure", INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND KNOWLEDGE MANAGEMENT, vol. 2, no. 2, 1 July 2010 (2010-07-01), pages 383 - 386, XP055450698 *
ZHANG, LIXIA ET AL.: "DNA Compressed Pattern Matching Algorithms Based on Character and 0/1 Coding", APPLICATION RESEARCH OF COMPUTERS, vol. 24, no. 9, 30 September 2007 (2007-09-30), pages 22 - 24 *
ZHANG, Y.P. ET AL.: "Light-Weight Reference-Based Compression of FASTQ Data", BIOINFORMATICS, vol. 16, 9 June 2015 (2015-06-09), XP021223710 *
ZHOU, JIARUI ET AL.: "Intelligent DNA Sequence Data Compression Using Memetic Algorithm", ACTA ELECTRONICA SINICA, vol. 41, no. 3, 31 March 2013 (2013-03-31), pages 513 - 518 *
ZHU, Z.X. ET AL.: "High-Throughput DNA Sequence Data Compression", BRIEFINGS IN BIOINFORMATICS, vol. 16, no. 1, 3 December 2013 (2013-12-03), pages 1 - 15, XP055372867 *
ZHU, ZEXUAN ET AL.: "Advances in the Compression of High-Throughput DNA Sequencing Data", JOURNAL OF SHENZHEN UNIVERSITY ( SCIENCE & ENGINEERING, vol. 30, no. 4, 31 July 2013 (2013-07-31), pages 409 - 415 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134678A (zh) * 2018-02-08 2019-08-16 深圳先进技术研究院 一种生物数据的索引方法、***及电子设备
CN112783904A (zh) * 2019-11-07 2021-05-11 北京沃东天骏信息技术有限公司 一种更新索引数据的方法和装置
CN111145834A (zh) * 2019-11-29 2020-05-12 中科曙光(南京)计算技术有限公司 多线程基因数据压缩方法、装置
CN111145834B (zh) * 2019-11-29 2023-10-27 中科曙光(南京)计算技术有限公司 多线程基因数据压缩方法、装置
CN111326216A (zh) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 一种针对大数据基因测序文件的快速划分方法
CN111370064B (zh) * 2020-03-19 2023-05-05 山东大学 基于simd的哈希函数的基因序列快速分类方法及***
CN111370064A (zh) * 2020-03-19 2020-07-03 山东大学 基于simd的哈希函数的基因序列快速分类方法及***
CN111584011B (zh) * 2020-04-10 2023-08-29 中国科学院计算技术研究所 面向基因比对的细粒度并行负载特征抽取分析方法及***
CN111584011A (zh) * 2020-04-10 2020-08-25 中国科学院计算技术研究所 面向基因比对的细粒度并行负载特征抽取分析方法及***
CN112259167B (zh) * 2020-10-22 2022-09-23 深圳华大基因科技服务有限公司 基于高通量测序的病原体分析方法、装置和计算机设备
CN112259167A (zh) * 2020-10-22 2021-01-22 深圳华大基因科技服务有限公司 基于高通量测序的病原体分析方法、装置和计算机设备
CN112863607A (zh) * 2020-12-14 2021-05-28 武汉大学 一种面向大规模基因数据的同一认定***及优化处理方法
CN112863607B (zh) * 2020-12-14 2024-03-22 武汉大学 一种面向大规模基因数据的同一认定***及优化处理方法
CN114064551A (zh) * 2022-01-17 2022-02-18 广州嘉检医学检测有限公司 基于cpu+gpu异构的高并发序列比对计算加速方法
CN115083530A (zh) * 2022-08-22 2022-09-20 广州明领基因科技有限公司 基因测序数据压缩方法、装置、终端设备和存储介质
CN117373538A (zh) * 2023-12-08 2024-01-09 山东大学 一种基于多线程计算的生物序列比对方法及***
CN117373538B (zh) * 2023-12-08 2024-03-19 山东大学 一种基于多线程计算的生物序列比对方法及***

Similar Documents

Publication Publication Date Title
WO2018000174A1 (zh) 面向存储的dna序列的并行快速匹配方法及其***
US20200201675A1 (en) Hashing data-processing steps in workflow environments
WO2018058959A1 (zh) Sql审核方法、装置、服务器及存储设备
WO2018103320A1 (zh) 灰度发布方法、***、服务器及存储介质
JP6427592B2 (ja) データ型に関連するデータプロファイリング操作の管理
WO2018076800A1 (zh) 一种数据异步更新方法及其***
WO2013174172A1 (zh) 一种文件信息预览方法及***
Hellemans et al. On the power-of-d-choices with least loaded server selection
US8326821B2 (en) Transforming relational queries into stream processing
US9135270B2 (en) Server-centric versioning virtual file system
US20140258266A1 (en) Methods and apparatus of shared expression evaluation across rdbms and storage layer
US10671586B2 (en) Optimal sort key compression and index rebuilding
WO2014069764A1 (ko) 염기 서열 정렬 시스템 및 방법
WO2017177769A1 (zh) Ogg 版本部署方法、***、服务器和存储介质
WO2012155709A1 (zh) 一种动态推送用户个人标签的方法和***、存储介质
WO2017214765A1 (zh) 针对fastq数据的多线程快速存储无损压缩方法及其***
WO2014206227A1 (zh) 数据处理的方法及装置
WO2020177376A1 (zh) 数据的提取方法、装置、终端及计算机可读存储介质
Hussain et al. Tinygarble2: smart, efficient, and scalable Yao's Garble Circuit
WO2020186791A1 (zh) 数据传输方法、装置、设备及存储介质
WO2012159436A1 (zh) 一种在windows下调整磁盘分区的方法及装置
WO2020192627A1 (zh) 一种分块方法及其装置
WO2014069767A1 (ko) 염기 서열 정렬 시스템 및 방법
JP2015141543A (ja) ループ分割検出プログラム及びループ分割検出方法
TWI631509B (zh) 加速壓縮方法以及使用此方法的裝置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16906578

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 06.05.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 16906578

Country of ref document: EP

Kind code of ref document: A1