CN102521529A

CN102521529A - Distributed gene sequence alignment method based on Basic Local Alignment Search Tool (BLAST)

Info

Publication number: CN102521529A
Application number: CN2011104102015A
Authority: CN
Inventors: 吴一雷; 闫鹏程; 刘充; 李国锐; 陈禹保; 黄劲松; 谢威
Original assignee: BEJING COMPUTING CENTER
Current assignee: BEJING COMPUTING CENTER
Priority date: 2011-12-09
Filing date: 2011-12-09
Publication date: 2012-06-27

Abstract

The invention relates to the technical fields of computer and bioinformatics, disclosing a distributed gene sequence alignment method based on Basic Local Alignment Search Tool (BLAST). The method comprises the following steps: S1, the program analyzes user parameters, determines MPI thread serial number and reads query sequence file; query sequences are divided based on task number, and every MPI thread reads corresponding MPI thread serial number; S2, according to the MPI thread serial number, the program judges if the present MPI thread is head node; if the present MPI thread is a head node, the program waits for communication requests of other MPI threads; if a communication request exists, the response exists and then the present task is allocated to the thread making the request; the program continuously allocates task; if the present MPI thread is not a head node, the program requests a task serial number from the head node, reads the query sequence file segment according to the task serial number and performs BLAST to obtain BLAST alignment result; the program subtracts 1 in the task serial number and requests the task serial number after performing BLAST; and S3, the program combines all BLAST alignment results. The method can reduce hardware cost of the bioinformatics research.

Description

Distributed gene order comparison method based on BLAST

Technical field

The present invention relates to computing machine and bioinformatics technique field, be specifically related to a kind of distributed gene order comparison method based on BLAST.

Background technology

In time, (Next Generation Sequencing, NGS) technology has been brought huge change to biological study to new-generation sequencing, has obtained remarkable development at aspects such as order-checking principle, details of operation, technological expansion in the past few years.With respect to traditional Sanger PCR sequencing PCR; The NGS technology platform has been avoided clone's process; Directly use joint to carry out Parallel PC R (polymerase chain reaction), sequencing reaction, so its data throughput is largely increased, can in the shorter time, checks order more DNA.For example, use the Sanger PCR sequencing PCR to draw the 1st human genome collection of illustrative plates front and back and expend 13 years and hundreds of platform sequenator altogether, and NGS can accomplish this work in the time of some months now.In addition, the cost of NGS reduces greatly, if keep present speed of development, the expense of individual gene sequencing can drop to below 1000 dollars in several years, and when the time comes, the scientific research of NGS and clinical practice environment will be further strengthened.

On announced gene database basis, carrying out functional annotation is one of basic skills to the sequencing data analysis; BLAST (Basic Local Alignment Search Tool wherein; Basic local comparison research tool) [1] software portfolio is the sequence similarity search program by NCBI (National Center for Biotechnology Information) issue, is the most frequently used functional annotation software of increasing income of present academia.Different with accurate matching algorithm is that BLAST adopts seed-and-extend approximate match technology to come similar section between the quick search sequence.In addition, BLAST can (Symmetrical Multiprocessor SMP) moves through multithreading on the machine of structure, to improve counting yield at many symmetrical treatment device.

In general, the NGS data are made up of millions of short sequence DNA sections of reading, and have characteristics such as scale is big, data volume height; The normal high-performance computer cluster that adopts carries out the note analysis to the NGS data in the bioinformatics research; Although BLAST has realized multithreading, but still just to unit operation, and many restrictions are arranged on scale; For example for the SMP machine that surpasses 4 nuclears, processor resource can't be fully used.In order to adapt to the bioinformatic data amount that is exponential increase; Further improve the operational efficiency of BLAST, quicken the process of bioinformatic analysis and research, the research staff has developed multiple parallel BLAST version towards the cluster application environment; MpiBlast [2] for example, pBlast [3] etc.Though these concurrent softwares have strengthened the scalability of analytical algorithm; Can extend to hundreds of easily even thousands of processors are carried out simultaneously; Yet but there are some common shortcomings in they: 1) not all parallel version can both produce the result [4] consistent with NCBI Blast unit operation, and this is to cause owing to having used different database cuttings or result to merge method; 2) in traditional high-performance calculation; Usually adopt and share storage system, that is to say that database, Blast binary file, sequential file, intermediate result all are to leave on the same physical store, though more convenient from the angle of system maintenance; But when degree of parallelism is high; The polymerization IO of all nodes is to Internet resources expense very big [5], and with the overall execution efficient that has a strong impact on whole software, so the IO bandwidth often becomes the bottleneck place of multisequencing compare of analysis; 3) these softwares all need use the degree of coupling high high-performance computer cluster and High Performance Cache and Memory System, and hardware cost is expensive.

Above-cited list of references is following:

[1]S.F.Altschul，W.Gish，W.Miller，E.W.Myers，and?D.J.Lipman，″Basic?Local?Alignment?Search?Tool，″Journal?of?Molecular?Biology，No215，pp.403-410，1990.

[2]A.E.Darling，L.Carey，and?W.C.Feng，″The?design，implementation，and?evaluation?of?mpiBlast，″In?proceedings?of?4th?International?Conference?on?Linux?Clusters：The?HPC?revolution?2003，2003.

[3]D.R.Mathog，″Parallel?Blast?on?Split?Databases，″in?Bioinformatics?Applications?Note，vol.19，no.14，pp.1865-1866，2003.

[4]dBlast，http://www.cmbi.kun.nl/software/dBlast.

[5]M.C.Schatz，″Blast?Reduce：High?Performance?Short?Read?Mapping?with?MapReduce，″2008.

Summary of the invention

The technical matters that (one) will solve

The technical matters that the present invention will solve is: how to design a kind of NCBI BLAST based on traditional unit operation and develop distributed gene order comparison method; Make that on the one hand parallel note analysis result and unit operation result are in full accord; Make the IO bandwidth that adds up of total system be improved on the other hand; The bandwidth of sharing the formula storage networking that surpasses far away alleviates the IO ink-bottle effect that the gene function note is analyzed.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of distributed gene order comparison method based on BLAST, may further comprise the steps:

S1, program are resolved customer parameter, and definite MPI number of threads, read the search sequence file, cut apart the search sequence file according to the task number and obtain the search sequence file fragment, and each MPI thread reads MPI thread sequence number separately respectively then;

S2, judge according to said MPI thread sequence number whether current MPI thread is head node; If current MPI thread is a head node; Then wait for the communication request of other MPI thread; If communication request is arranged then respond the communication request of other MPI thread, and current task is distributed to the thread that proposes communication request, task number is subtracted 1; Continue allocating task, finish, then finish current MPI thread up to all Task Distribution; If current MPI thread is not a head node, then, read the search sequence file fragment according to task number earlier to head node request task number; And to database execution BLAST; Obtain the BLAST comparison result, then task number is subtracted 1, execution BLAST finishes and asks task number again; Continue to carry out BLAST, until last task;

S3, merge all BLAST comparison results.

Preferably, when cutting apart the search sequence file according to the task number among the step S1, be with respect to the more piece of node number with the search sequence file division.

Preferably, among the step S2,, then be computing node if the MPI thread is not a head node, resulting Blast intermediate result all is stored in the local storage of said computing node when said database and execution BLAST.

Preferably, said task number is more than or equal to the MPI number of threads.

Preferably, said search sequence file is the FASTA form.

Preferably, said database is a gene database.

(3) beneficial effect

The NCBI BLAST that the present invention is based on traditional unit operation has developed a kind of distributed gene order comparison method (being also referred to as annotate method); It is on the one hand by means of the original program of BLAST; Make that parallel note analysis result and unit operation result are in full accord; On the other hand, through the distributed storage of gene database and search sequence file division are handled, make the IO bandwidth that adds up of the total system in the sequence alignment method improve greatly; The bandwidth of sharing the formula storage networking that surpasses far away has alleviated the IO ink-bottle effect that the gene function note is analyzed.Further; Because the distributed algorithm among the present invention is applicable to the storage and the computer system of cheapness; Therefore can use the network topology structure based on common server of loose coupling, high isomery to replace the high-performance computer group system; Make the analysis of gene function note on the common PC cluster, to move, thereby reduced the hardware cost of bioinformatics research.In addition, in the search sequence dividing processing of Blast, introduce the method for load balancing, further improved resource utilization, quickened the holistic approach execution speed.

Description of drawings

Fig. 1 is a method flow diagram of the present invention;

Fig. 2 opens up complement for distributed BLAST network.

Embodiment

Below in conjunction with accompanying drawing and embodiment, specific embodiments of the invention describes in further detail.Following examples are used to explain the present invention, but are not used for limiting scope of the present invention.

Method flow diagram of the present invention is as shown in Figure 1, may further comprise the steps:

S1, at first program is resolved customer parameter; And definite MPI (Message Passing Interface; Message passing interface) number of threads; Read search sequence file (FASTA form) and cut apart search sequence file (the task number is more than or equal to the MPI number of threads) according to task (being the search sequence file) number and obtain the search sequence file fragment, each MPI thread reads the MPI thread sequence number of oneself respectively then; Said customer parameter mainly is meant the BLAST parameter, and BLAST is an open source software, and its parameter can obtain at the NCBI query site.Here customer parameter is resolved and be meant that program is resolved user input parameters, and these parameters are passed to BLAST.

S2, judge according to said MPI thread sequence number whether current MPI thread is head node; If the MPI thread is head node (MPI_RANK==1); Then wait for the communication request of other MPI thread; If have then respond the communication request of other MPI thread and current task is distributed to this thread, task number subtracts 1; Continue allocating task, finish, then finish current MPI thread up to all Task Distribution; If not being head node, current MPI thread (not that is to say; Be computing node), then first node request task number headward reads the search sequence file fragment according to task number; And to gene database execution BLAST; Obtain the BLAST comparison result, then task number is subtracted 1, execution BLAST finishes and asks task number again; Continue to carry out BLAST, until last task; Wherein, said database and when carrying out BLAST resulting BLAST intermediate result all be stored in the local storage of said computing node.Used load-balancing algorithm in this step.

S3, merge all BLAST comparison results.The merging here is meant that simple text merges, because the comparison result of all nodes all is the destination file that generates text formatting, last the long and is exactly to merge all texts according to processing sequence simply.For example, in the linux system, through several errorlevels, cat etc. can realize.

In the method for the invention, adopt following two kinds of methods to improve program run efficient, improved existing bottleneck and shortcoming in other parallel versions:

1. for the sequence alignment task of big data quantity, high concurrency, in the architectural framework of sharing storage, IO is the maximum bottleneck place of total system all the time.Especially for Blast, the generation of the visit of lot of data storehouse, search sequence access and intermediate result all need take memory bandwidth.Yet; Even for the high-performance magnetism disk array, HDS 3080 systems that for example adopted in the present invention's test, the about 2GB/s of its maximum memory access bandwidth (peak value); If use the degree of parallelism of 200 threads; If suppose to adopt the IB network, band width in physical is enough, and the memory bandwidth of so average each thread has only 10MB/s.From the angle of unit storage, such memory access performance is very low, even common PC hard disk, memory access speed also can reach average 70-80MB/s.Consider that present user class hard disk price is extremely cheap; The storage space of single hard disk can reach 2TB; And the public gene database take up space that NCBI uses always is below 200GB; Therefore the present invention is positioned over database, Blast intermediate result in the local storage of computing node fully, and each computing node all keeps a complete copy.On the whole, all in the program operation process add up memory bandwidth just can be considerably beyond existing shared memory bandwidth.Distributed storage network of the present invention is opened up and is mended (among Fig. 2, gene database, Blast intermediate result distributed store are under the same directory structure of the local storage on each node) as shown in Figure 2.In addition, because this scheme is applicable to the architectural framework of common server+domestic consumer's level hard disk, and also suitable with the property retention of adopting the high-performance computer cluster; Therefore; Can cut operating costs greatly, reduce the hardware input, make bioinformatics research threshold reduce.

In order to make procedural application in the computer body based environment of isomery, the architectural framework of for example above-mentioned common server+domestic consumer's level hard disk adopts adaptive job assignment algorithm.With respect to the node number, task (search sequence) is divided into the thinner fritter of granularity, improve the resource utilization of each node again through load balancing.For example under the parallel condition of 100 nodes; With task division is 200 parts, moves 100 parts of tasks at first simultaneously, for the thread of FEFO; Then continue to distribute more task, can solve like this that single distributes whole task and the inconsistent problem of concluding time that causes.

Search sequence in the inventive method is cut apart, job scheduling and distribution, thread communication, job result assembling section, can adopt various program languages (like the C language) to realize, the sequence alignment part of concrete single-threaded operation is then directly called Blast and realized.Executive routine of the present invention is tested totally 10 nodes, each node 4 nuclear 4G internal memory in distributed PC cluster; Adopt the distributed storage scheme; (version: 2010-10-09) compare in the storehouse, and 100,000 of test sample book sequences are divided into 40 tasks to NCBI NT.Simultaneously, also traditional parallel BLAST method is moved test in high-performance computing environment, distribute 40 nuclears, the 80G internal memory adopts and shares storage scheme altogether, realizes based on HNAS 3080.The result shows that when this task of high performance environments completion needed 893 minutes machine, average monokaryon spent 22.32 minutes; When this task of common server environment completion has then been used 871 minutes machine, average monokaryon cost 21.78 minutes.This shows that when obtaining similar calculated performance, the technical scheme that the present invention adopted greatly reduces operating cost, has high economic benefit.

The above only is an embodiment of the present invention; Should be pointed out that for those skilled in the art, under the prerequisite that does not break away from know-why of the present invention; Can also make some improvement and modification, these improve and modification also should be regarded as protection scope of the present invention.

Claims

1. the distributed gene order comparison method based on BLAST is characterized in that, may further comprise the steps:

S3, merge all BLAST comparison results.

2. the method for claim 1 is characterized in that, when cutting apart the search sequence file according to the task number among the step S1, is with respect to the more piece of node number with the search sequence file division.

3. the method for claim 1; It is characterized in that, among the step S2, if the MPI thread is not a head node; Then be computing node, resulting Blast intermediate result all is stored in the local storage of said computing node when said database and execution BLAST.

4. the method for claim 1 is characterized in that, said task number is more than or equal to the MPI number of threads.

5. the method for claim 1 is characterized in that, said search sequence file is the FASTA form.

6. like each described method in the claim 1～5, it is characterized in that said database is a gene database.