CN107015868B

CN107015868B - Distributed parallel construction method of universal suffix tree

Info

Publication number: CN107015868B
Application number: CN201710232797.1A
Authority: CN
Inventors: 顾荣; 黄宜华; 郭晨; 朱光辉
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-04-11
Filing date: 2017-04-11
Publication date: 2020-05-01
Anticipated expiration: 2037-04-11
Also published as: CN107015868A

Abstract

The invention discloses a distributed parallel construction method of a universal suffix tree, which comprises the following steps: firstly, integrating input sequences and averagely distributing the input sequences to each computing node; secondly, counting the frequency of the subsequences in parallel, and determining all subtree construction tasks; thirdly, uniformly distributing the subtree construction tasks to different computing nodes as much as possible according to the scale; and fourthly, constructing all subtrees in batches in turns. Each round of batch construction can be divided into the following three steps: firstly, parallelly scanning and inputting suffixes required by the construction tasks of the round, respectively sequencing the suffixes, and summarizing sequencing results to a computing node responsible for the construction tasks; secondly, merging multiple paths to generate a globally ordered suffix sorting result; and thirdly, generating a corresponding suffix subtree by using the sequencing result. The invention can efficiently and parallelly construct the universal suffix tree and solve the problems that the traditional universal suffix tree construction method excessively depends on I/O or main memory capacity, has insufficient universality, is difficult to deal with large-scale input and the like.

Description

Distributed parallel construction method of universal suffix tree

Technical Field

The invention relates to the technical field of sequence processing and parallel computing, in particular to a distributed parallel construction method of a universal suffix tree.

Background

The sequence is a common data organization form and has wide application in the fields of text processing, time series analysis, biotechnology and the like. The universal suffix tree is used as a powerful data structure for sequence processing, and can effectively solve many common sequence analysis problems, such as matching, searching, frequent pattern mining and the like. However, the construction process of the general suffix tree is very complicated, a method for directly constructing the general suffix tree does not exist in the related work in the past, and people pay more attention to research on a special example of the general suffix tree, namely the construction method of the suffix tree, and then the general suffix tree is converted into the general suffix tree through certain conversion. The former proposes a single-machine suffix tree construction method represented by the Ukkonen method, and although these methods can be quite efficient in theory, the fact proves that the method has strict requirements on the storage and operation performance of a computer, and cannot construct a large-scale suffix tree and a general suffix tree. In the background of the era of big data, along with the enlargement of the scale of sequence data, single-machine calculation has an upper limit of scale, and the existing method has difficulty in efficiently constructing a large-scale universal suffix tree. In the face of the challenge of big data, a suffix tree construction method based on a supercomputer is proposed. On the other hand, compared with a super computer, the computing platform of the universal cluster is low in construction cost and easy to use and maintain, and with the appearance and the vigorous development of a data parallel computing framework represented by Apache Spark and a distributed storage system represented by HDFS, the universal cluster has good fault tolerance, and computing and storage capacity can be conveniently and horizontally expanded. Therefore, universal clustering is emerging as one of the mainstream solutions to many large-scale problems. However, no method is currently available that directly addresses the problem of building a common suffix tree. Therefore, how to design an efficient general suffix tree construction method enables the construction problem of a large-scale general suffix tree to be solved on a general cluster computing platform, and the parallel algorithm design challenge is provided.

In the related art, there is a suffix tree single-machine construction method typified by the Ukkonen algorithm. The method needs to read each sequence element in sequence, and each time one element is read, the built implicit suffix tree is built according to a certain rule. The advantage of this approach is that only one pass of the sequence input is scanned to construct the corresponding suffix tree. However, this method requires that both the sequence input and the suffix tree are completely stored in the main memory, and the main memory storage capacity of the single machine is often limited, so this method has no capability of constructing a large-scale suffix tree and a general suffix tree.

The best currently accepted suffix tree construction method is ERa, which is designed based on frequent I/O, taking out different ranges of each suffix each time in a dynamic range method, and constructing a suffix tree based on these information. However, the method has two disadvantages. One is that it relies on I/O in particular, and if the size of the block used to store the sequence elements temporarily in main memory is insufficient, the efficiency of the method becomes very low; and secondly, the problem of sharing intermediate results in the parallel computing process is not solved.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention aims to provide a method for constructing a universal suffix tree on a universal cluster in parallel, and solves the problems that the platform is high in cost and difficult to transplant, the universal suffix tree cannot be directly constructed due to insufficient universality, the expansibility is low, the construction is low in efficiency under large-scale sequence data, and the like.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that the distributed parallel construction method of the universal suffix tree comprises the following steps:

(1) dividing the input according to the number of the computing nodes, wherein each computing node is responsible for processing a part of continuous input;

(2) determining a frequency threshold D of the subsequences according to the maximum scale of a single suffix subtree, scanning the original input in parallel, counting the frequency of each subsequence in the original input, and screening all subtrees not exceeding the frequency threshold D to construct tasks;

(3) integrating all subtree construction tasks screened in the step (2) into a plurality of construction task groups according to the capacity of the computing nodes, so that the loads of each group of tasks are similar, and distributing the tasks to different computing nodes to ensure that the number of the task groups distributed to each computing node is the same;

(4) according to the distribution result of the step (3), executing the step (5) to the step (7) in multiple rounds, completing a construction task group on each computing node in each round, and completing batch construction of all suffix subtrees;

(5) for a certain subtree construction task, all the computing nodes are responsible for searching the positions of suffixes needed to be used on the corresponding partial inputs, sequencing the suffixes, and generating ordered suffix sequencing on the partial inputs;

(6) all the computing nodes send the sequencing results generated in the step (5) to the computing nodes which are specifically implemented and constructed, and after receiving all the results, the computing nodes generate globally ordered suffix sequencing in a multi-path merging and unified I/O mode;

(7) and (4) constructing a corresponding suffix subtree by utilizing an auxiliary stack according to the globally ordered suffix sorting result generated in the step (6).

Further, in step (1), the original input is combined into a file, the start and end positions of each sequence are recorded, and then the file is divided evenly, and each computing node is responsible for processing a continuous file block.

Further, in the step (2), each node uses the trie tree to count the frequency count of the subsequence at the corresponding partial input.

Further, in the step (3), an integration scheme based on a combination of the packing problem and the numerical partitioning problem is adopted, so that the load of each sub-tree construction task group is similar.

Further, in the step (5), each computing node obtains subsequences corresponding to all tree roots that need to be constructed in the current round, scans the corresponding input of each computing node once by using a multi-mode matching method, locates the positions of all suffixes of each tree root on the part of input, and sequences the subsequences respectively.

Further, in the step (6), a patrinia tree is adopted to carry out multi-path merging on the suffix sorting results transmitted by each computing node, so as to obtain a suffix sorting approaching global ordering; then, the sequencing result of the disordered fragments is solved by using a dynamic range unified I/O method; and finally forming a globally ordered suffix sorting result.

Further, in the step (7), the global ordered suffix ordering result generated in the step (6) is traversed once by using the stack as an aid, and a corresponding suffix subtree is generated.

Has the advantages that: the invention can efficiently construct a universal suffix tree on a universal cluster: firstly, the invention deconstructs the general suffix tree construction problem into the steps, the steps are highly independent of each other, the theoretically parallelable steps are all designed into a data parallelable mode, and the data parallelable step is very easy to realize by a data parallelable computing engine; secondly, different from the conventional method that a suffix tree needs to be constructed firstly to be converted into a general suffix tree, the method directly solves the construction problem of the general suffix tree, and can enjoy performance improvement and good expansibility brought by data parallel and distributed storage; thirdly, the invention does not depend on a specific data parallel computing framework and a distributed storage system, is convenient to be realized on any system, and has good portability.

Drawings

FIG. 1 is a schematic flow diagram of the overall process of the present invention;

FIG. 2 is a diagram illustrating a sub-tree division stage according to the present invention;

FIG. 3 is a diagram of multi-way merging using a loser tree in the present invention;

FIG. 4 is a diagram illustrating the generation of suffix subtrees in the present invention using suffix ordering.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

The invention provides a fully parallelized general suffix tree efficient construction method, solves the construction problem of the general suffix tree by using a method of dividing subtrees in parallel and constructing the subtrees by lcp-range multi-path merging, and designs an I/O optimization scheme and a load balancing scheme which are suitable for the invention. The invention is further designed into the mutually independent steps, each step which can be parallel theoretically is designed into a data parallel form, and the invention can conveniently deal with larger-scale input by virtue of the expansibility of a general cluster without being limited by a specific bottom computing engine and a distributed storage system. The invention unifies the construction method of the suffix tree and the construction method of the universal suffix tree, thereby solving the construction problem of the universal suffix tree and the construction problem of the suffix tree.

As shown in FIG. 1, the complete process of the present invention includes 6 parts of an integration input stage, a sub-tree dividing stage, a sub-tree assignment task building stage, a suffix positioning stage, a multi-way merging unified I/O stage and a sub-tree generating stage. Specific embodiments are described below:

the integrated input stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: first, the length of all input sequences and the number of all computing nodes are obtained. Then, the sequence length responsible for each computing node is obtained by averaging. A sequence length counter and an intermediate file are created and all inputs are opened in sequence, all input sequences are written into this file, and the sequence length counter is updated. When the length counter reaches the length of the responsibility of the computing node, the processing task of the part of the file blocks is distributed to one computing node. And records the corresponding relationship between the file block and the original input sequence. When the execution of the stage is finished, an intermediate file which is spliced with all the input is generated, a mapping which maintains the corresponding relation between the content of the intermediate file and the input of the original sequence is generated, and a mapping which maintains the corresponding relation between the file block which is processed by each computing node and the input of the original sequence is generated.

And (3) dividing the subtree stage to correspond to the technical scheme step (2). The specific implementation mode is as follows: firstly, a frequency threshold value D is given according to the resource condition of a computing node, so that the resource of the computing node is enough to construct a sub-tree of which the frequency of the sub-sequence corresponding to the tree root is not more than D. In addition, the initial value a of the statistical window length of the subsequence is given₀And a length a of the statistical window for the subsequences_iIncreasing recurrence function a_i＝f(a_i-1). In the ith round of operation, each compute node receives all the length a_i-1The frequency of which is still greater than the threshold value D. Then the length is counted as a on the input in charge of itself_iAnd summarizing the frequency of the subsequence beginning with a certain element in the P to a main node filtering result after counting is finished. If the frequency of the sub-sequences is larger than the threshold value D, the operation of the (i + 1) th round is continued, otherwise, the parallel counting stage is ended. And after the parallel statistics stage is finished, merging and optimizing the result with the common prefix and the frequency number of the common prefix still not exceeding D. This stage is now complete and is shown in figure 2. The following describes the execution flow of this stage with a specific example: the input is assumed to be two DNA sequences S-GATT ACAT TGT, M-AATC CG; the frequency threshold value D is 3, and two calculation nodes participate in statistics; a is₀＝1，a_i＝a_i-1+2. Then the 0 th round has a sub-sequence with a statistical length of 1, and the result after the summation isA is 5, C is 3, G is 3, and T is 6. Finding that the sub-sequences A and T are larger than the frequency threshold, the next round of operation is required. Thus, in round 1, a subsequence of length 3 and beginning with a or T is counted, and the result after summation is ATT-2, TTA-1, TAC-1, ACA-1, TTG-1, TGT-1, AAT-1, ATC-1, TCC-1. All results do not exceed the frequency threshold, and the parallel counting stage is finished. And entering a merging optimization stage, and searching a common prefix in the result. ATT and ATC have a common prefix AT, and ATT + ATC ≦ 3 ≦ D, so the two results may be combined into AT ≦ 3. The other results were combined in the same way. The final result of the subtree stage is obtained as { G ═ 3, C ═ 3, AT ═ 3, TT ═ 2, TA ═ 1, AC ═ 1, TG ═ 1, AA ═ 1, and TC ═ 1 }. Each subsequence in the set represents a sub-tree construction task.

And (3) allocating a sub-tree construction task stage corresponding to the technical scheme step (3). The specific implementation mode is as follows: for the result of dividing the subtree stage, a solution to the problem of boxing is run, where "box" refers to a space with a capacity of D and can accommodate subsequences whose sum of frequencies does not exceed D. Naturally, in the following sub-tree construction phase, the resources of each compute node are sufficient to construct all sub-trees in a box at the same time, in other words, the box is a sub-tree construction task group. The solution strategy for the packing problem can find the minimum number of boxes num that need to be used to accommodate all tree roots. Then increasing the number num of boxes to be integer multiple k of the number of the nearest calculation nodes, and then operating a solution of k paths of integer division on the result of the subtree division stage once, so that the loading capacity of each box is more uniform. And completing the task stage of distributing the subtree and constructing. The execution flow at this stage is illustrated below continuing with the above example: the results of the bin packing problem solution run were: b1, B2, B3, B4, B5, and B6, respectively. The minimum box number 6 is obtained and is already an integral multiple of the number 2 of the calculation nodes, and the expansion is not needed. Under the current allocation scheme, the process of building a subtree requires three rounds. However, because the load of the B6 is different from that of other boxes, there must be a certain round, and the loads of the two computing nodes are not balanced. Therefore, the strategy of the 6-path integer division problem needs to be operated once again, so that the loading capacity of different boxes is more balanced. The result of the completion of the run may be: b1 ═ G }, B2 ═ C }, B3 ═ AT }, B4 ═ TT, TA }, B5 ═ AC, TG }, B6 ═ TC, AA }. At this time, if the remainder allocation policy is applied, the even bins are allocated to compute node 0 and the odd bins are allocated to compute node 1. In the three rounds of subtree building processes, the loads of two computing nodes are balanced.

The step (4) of the technical scheme is to repeat the sub-round batch subtree construction process, and corresponds to a postfix positioning stage, a multi-path merging unified I/O stage and a subtree generating stage in the process.

The postfix locating stage corresponds to technical solution step (5). The specific implementation mode is as follows: and each computing node obtains the subsequence corresponding to the root of all the subtrees to be constructed in the current round. And positioning all suffixes on the corresponding partial input by utilizing a multi-mode matching method. And respectively sorting by using a sequence sorting method to generate lcp-range between adjacent sequences. lcp-range is a data structure designed by the present invention to characterize the differences between sequences, which can be interpreted as a "first difference segment", here lcp is the Longestcommon Prefix, an abbreviation for the longest common Prefix, used to characterize the location of the first different element of two sequences; range, i.e., the segment, means that the range elements of the two sequences starting from this position will be cached in the lcp-range structure, and will play an important role in the following multi-way merge stage. The execution flow at this stage is illustrated below continuing with the above example: taking the first round of constructing subtrees as an example, compute node 0 processes B2, and compute node 1 processes B1. first, two compute nodes will obtain the subsequences { G, C } corresponding to all subtrees that need to be constructed in this round. Suppose that compute node 0 is responsible for the corresponding input as sequence S and compute node 1 is responsible for the corresponding input as sequence M. The multi-mode matching method is run on compute node 0 with the suffixes G1-GATT ACAT TGT, G2-GT, and C1-CAT TGT, respectively. If a < C < G < T < $ (terminator) and range 1 are specified, respectively, the suffixes are sorted, and it is found that G1< G2, G2 is < a, T,2> with respect to lcp-range of G1, that is, the first different elements of G1 and G2 are at the 2 nd position, one element of G1 from the second position is a, and one element of G2 from the second position is T. Similarly, on compute node 1, the suffix G3 ═ G, C2 ═ CCG, and C3 ═ CG are located. Suffixes are ordered to give C2< C3, with the lcp-range of C3 relative to C2 being < C, G,2 >.

The multi-path merging unified I/O stage corresponds to the technical scheme step (6). The specific implementation mode is as follows: and the calculation node responsible for constructing the subtree receives suffix sorting information of the subtree on all the calculation nodes and performs multi-path merging by using the patricial tree. FIG. 3 shows an example of the present invention using a loser tree structure and lcp-range for multi-way merging. The k merged patricial tree has 2k nodes, the root node has only one child node, the corresponding sub-tree is a complete binary tree, and each of the k leaf nodes corresponds to one path of data. When the patroller tree is constructed, the first element of each path of data enters the patroller tree, the first element moves from the corresponding leaf node to the root node, the two paths of data elements which meet are compared, the patroller is stored in the middle node which meets the first element and does not advance any more, and the winner continues to advance. The data element moved to the root node is the total winner of all the data currently participating in merging. Each time a total winner is generated, a loser tree reconstruction is triggered. The next element of the data of the path where the winner is located enters the patroller tree, moves towards the root node, compares the next element with the data elements stored in the father node, the patroller stays at the node, and the winner continues to advance, so that the data element moved to the root node is the next winner. And repeating the steps until all elements of each path of data leave the patroller tree, and completing the multi-path merging. The loser tree is a classical method for solving the problem of multi-path merging, but the method has the defects that the merging process needs a large number of comparison operations, and the comparison operations need to be executed strictly in series. So if the cost of the data comparison operation is high, the loser tree multi-way merging would be very inefficient. Considering that the original sequence input may be very large in size and cannot be loaded into memory, sequence comparison must access the storage system to obtain the original input to ensure that each comparison has a definite result. This way of comparing sequences is very costly. To solve this problem, the present invention does not access the storage system to read the original sequence input for comparison, but instead uses lcp-range as the criterion for comparison, each comparison updates lcp-range of the winner. If lcp-range cannot determine the size relationship, and once considered equal to each other, then disorder intervals will be formed in the merged result. Note that the more elements cached in lcp-range, the more powerful the lcp-range comparison is, and the fewer sequences the lcp-range cannot compare. At the moment, each computing node arranges suffixes of all unordered intervals of all subtree construction tasks in the current round, uniformly accesses a file system by using a dynamic range ordering suffix method in an ERA method, and takes out original data to order until all the suffixes are ordered. The execution flow at this stage is illustrated below continuing with the above example: the computing node 1 is responsible for constructing a subtree with G as the root, and receives two paths of results, namely a 0 group < G1, a G2> and a 1 group < G3>, respectively. And constructing a two-path patricial tree. First elements of the two suffixes, G1 and G3, are compared. The results are G1< G3, G3 < A, $,2> relative to the lcp-range of G1. Group 0 wins, followed by the loser tree reconstruction triggered by group 0 next element G2. Comparing the magnitude relationship of G2 and G3, G2< G3, G3 is < T, $,2> relative to the lcp-range of G2, since G2 is < A, T,2> relative to the lcp-range of G1 and G3 is < A, $,2> relative to the lcp-range of G1. And then sorting suffixes corresponding to the subtree taking G as the root. No situation arises that requires access to the storage system.

And (7) generating a technical scheme corresponding to the sub-tree stage. The specific implementation mode is as follows: and traversing the globally ordered suffix sorting result generated in the previous step, and generating a corresponding suffix subtree by means of a stack and lcp-range information. The execution flow at this stage is illustrated below continuing with the above example: for example, construct a suffix subtree rooted at G. The previous step has generated a globally ordered suffix ordering result < G1, G2, G3 >. An edge E1 is first generated GATT ACAT TGT, representing the suffix G1, and stacked. Lcp-range < A, T,2> of G2 relative to G1 was read. The stack is continued until the total length of edges in the stack does not exceed the value 2 in lcp-range. According to the definition of lcp-range, 2 herein indicates that G1 and G2 have a common prefix of length 2-1. The last popped edge E1 is then split into the common prefix E11-G and the rest E12-ATT ACAT TGT, and a new edge E3-T is created at the split point. The new edge E3 and the upper half E11 of the split edge are stacked. This process is illustrated in fig. 4. The above process is repeated for the following suffix. The corresponding suffix sub-tree is constructed.

The invention realizes a prototype system LMdst (Lcp-Large distributed suffix Tree, Lcp multi-path merging distributed general suffix Tree) based on the existing open source software. Where the underlying data store uses HDFS and the compute engine uses Apache Spark. The software described above is not part of the present invention.

The prototype system implemented by the present invention was tested by constructing a universal suffix tree for human genome fragments, and table 1 compares the performance of the present method and the method of modifying the best ERa method to construct a universal suffix tree under the same hardware conditions. It can be seen from the table that the magnitude of the leading ERa of the method is larger and larger, from 1.5 times to 4.4 times, as the input scale is increased. And with the increase of the input scale, the method shows good data expansibility. The beneficial effect of the method is verified.

Table 1: performance testing of universal suffix tree constructed from human genome fragments

Claims

1. A distributed parallel construction method of a universal suffix tree comprises the following steps:

(7) constructing a corresponding suffix subtree by utilizing an auxiliary stack according to the globally ordered suffix sorting result generated in the step (6);

in the step (3), at least the number of needed construction subtree rounds is obtained by adopting a solution of the packing problem, and then a construction task group integration scheme with relatively balanced load is obtained by adopting a solution of the integer division problem.

2. The method of claim 1, wherein the method comprises the following steps: in the step (1), all input sequences are firstly spliced into a file, and the specific content of the file block which is handled by each computing node and corresponds to the original input sequence is maintained by mapping.

3. The method of claim 1, wherein the method comprises the following steps: in the step (2), the initial value a of the statistical window length of the subsequence is given₀And a length a of the statistical window for the subsequences_iIncreasing recurrence function a_i＝f(a_i-1) The ith round of operation, each calculation node respectively counts the length as a_iAnd summarizing the statistical result to the main node for screening.

4. The method of claim 1, wherein the method comprises the following steps: in the step (4), each round of subtree construction consists of three data parallel processes of the step (5), the step (6) and the step (7).

5. The method of claim 1, wherein the method comprises the following steps: in the step (5), the multi-mode matching method is used for searching suffixes required by all tree roots constructed in the current round in partial input, and the suffixes are respectively sequenced, and lcp-range between two adjacent suffixes is generated.

6. The method of claim 1, wherein the method comprises the following steps: in the step (6), a multiple merging operation is performed by using an patricial tree, suffixes are not directly compared in the process of establishing and reconstructing the patricial tree, the size relationship of the two suffixes is determined by comparing lcp-range, if some incomparable suffixes exist after the multiple merging process is finished, each computing node collects the suffixes in all subtree construction tasks on the node, uniformly accesses a storage system, acquires original sequence elements, and compares the elements by using a dynamic range method until all the suffixes are ordered.

7. The method of claim 1, wherein the method comprises the following steps: in the step (7), based on the suffix sorting result in the step (6), a corresponding suffix sub-tree is constructed by using an auxiliary stack traversal sorting result and lcp-range information.