CN107015868B - Distributed parallel construction method of universal suffix tree - Google Patents

Distributed parallel construction method of universal suffix tree Download PDF

Info

Publication number
CN107015868B
CN107015868B CN201710232797.1A CN201710232797A CN107015868B CN 107015868 B CN107015868 B CN 107015868B CN 201710232797 A CN201710232797 A CN 201710232797A CN 107015868 B CN107015868 B CN 107015868B
Authority
CN
China
Prior art keywords
suffix
construction
tree
suffixes
subtree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710232797.1A
Other languages
Chinese (zh)
Other versions
CN107015868A (en
Inventor
顾荣
黄宜华
郭晨
朱光辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710232797.1A priority Critical patent/CN107015868B/en
Publication of CN107015868A publication Critical patent/CN107015868A/en
Application granted granted Critical
Publication of CN107015868B publication Critical patent/CN107015868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed parallel construction method of a universal suffix tree, which comprises the following steps: firstly, integrating input sequences and averagely distributing the input sequences to each computing node; secondly, counting the frequency of the subsequences in parallel, and determining all subtree construction tasks; thirdly, uniformly distributing the subtree construction tasks to different computing nodes as much as possible according to the scale; and fourthly, constructing all subtrees in batches in turns. Each round of batch construction can be divided into the following three steps: firstly, parallelly scanning and inputting suffixes required by the construction tasks of the round, respectively sequencing the suffixes, and summarizing sequencing results to a computing node responsible for the construction tasks; secondly, merging multiple paths to generate a globally ordered suffix sorting result; and thirdly, generating a corresponding suffix subtree by using the sequencing result. The invention can efficiently and parallelly construct the universal suffix tree and solve the problems that the traditional universal suffix tree construction method excessively depends on I/O or main memory capacity, has insufficient universality, is difficult to deal with large-scale input and the like.

Description

Distributed parallel construction method of universal suffix tree
Technical Field
The invention relates to the technical field of sequence processing and parallel computing, in particular to a distributed parallel construction method of a universal suffix tree.
Background
The sequence is a common data organization form and has wide application in the fields of text processing, time series analysis, biotechnology and the like. The universal suffix tree is used as a powerful data structure for sequence processing, and can effectively solve many common sequence analysis problems, such as matching, searching, frequent pattern mining and the like. However, the construction process of the general suffix tree is very complicated, a method for directly constructing the general suffix tree does not exist in the related work in the past, and people pay more attention to research on a special example of the general suffix tree, namely the construction method of the suffix tree, and then the general suffix tree is converted into the general suffix tree through certain conversion. The former proposes a single-machine suffix tree construction method represented by the Ukkonen method, and although these methods can be quite efficient in theory, the fact proves that the method has strict requirements on the storage and operation performance of a computer, and cannot construct a large-scale suffix tree and a general suffix tree. In the background of the era of big data, along with the enlargement of the scale of sequence data, single-machine calculation has an upper limit of scale, and the existing method has difficulty in efficiently constructing a large-scale universal suffix tree. In the face of the challenge of big data, a suffix tree construction method based on a supercomputer is proposed. On the other hand, compared with a super computer, the computing platform of the universal cluster is low in construction cost and easy to use and maintain, and with the appearance and the vigorous development of a data parallel computing framework represented by Apache Spark and a distributed storage system represented by HDFS, the universal cluster has good fault tolerance, and computing and storage capacity can be conveniently and horizontally expanded. Therefore, universal clustering is emerging as one of the mainstream solutions to many large-scale problems. However, no method is currently available that directly addresses the problem of building a common suffix tree. Therefore, how to design an efficient general suffix tree construction method enables the construction problem of a large-scale general suffix tree to be solved on a general cluster computing platform, and the parallel algorithm design challenge is provided.
In the related art, there is a suffix tree single-machine construction method typified by the Ukkonen algorithm. The method needs to read each sequence element in sequence, and each time one element is read, the built implicit suffix tree is built according to a certain rule. The advantage of this approach is that only one pass of the sequence input is scanned to construct the corresponding suffix tree. However, this method requires that both the sequence input and the suffix tree are completely stored in the main memory, and the main memory storage capacity of the single machine is often limited, so this method has no capability of constructing a large-scale suffix tree and a general suffix tree.
The best currently accepted suffix tree construction method is ERa, which is designed based on frequent I/O, taking out different ranges of each suffix each time in a dynamic range method, and constructing a suffix tree based on these information. However, the method has two disadvantages. One is that it relies on I/O in particular, and if the size of the block used to store the sequence elements temporarily in main memory is insufficient, the efficiency of the method becomes very low; and secondly, the problem of sharing intermediate results in the parallel computing process is not solved.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention aims to provide a method for constructing a universal suffix tree on a universal cluster in parallel, and solves the problems that the platform is high in cost and difficult to transplant, the universal suffix tree cannot be directly constructed due to insufficient universality, the expansibility is low, the construction is low in efficiency under large-scale sequence data, and the like.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that the distributed parallel construction method of the universal suffix tree comprises the following steps:
(1) dividing the input according to the number of the computing nodes, wherein each computing node is responsible for processing a part of continuous input;
(2) determining a frequency threshold D of the subsequences according to the maximum scale of a single suffix subtree, scanning the original input in parallel, counting the frequency of each subsequence in the original input, and screening all subtrees not exceeding the frequency threshold D to construct tasks;
(3) integrating all subtree construction tasks screened in the step (2) into a plurality of construction task groups according to the capacity of the computing nodes, so that the loads of each group of tasks are similar, and distributing the tasks to different computing nodes to ensure that the number of the task groups distributed to each computing node is the same;
(4) according to the distribution result of the step (3), executing the step (5) to the step (7) in multiple rounds, completing a construction task group on each computing node in each round, and completing batch construction of all suffix subtrees;
(5) for a certain subtree construction task, all the computing nodes are responsible for searching the positions of suffixes needed to be used on the corresponding partial inputs, sequencing the suffixes, and generating ordered suffix sequencing on the partial inputs;
(6) all the computing nodes send the sequencing results generated in the step (5) to the computing nodes which are specifically implemented and constructed, and after receiving all the results, the computing nodes generate globally ordered suffix sequencing in a multi-path merging and unified I/O mode;
(7) and (4) constructing a corresponding suffix subtree by utilizing an auxiliary stack according to the globally ordered suffix sorting result generated in the step (6).
Further, in step (1), the original input is combined into a file, the start and end positions of each sequence are recorded, and then the file is divided evenly, and each computing node is responsible for processing a continuous file block.
Further, in the step (2), each node uses the trie tree to count the frequency count of the subsequence at the corresponding partial input.
Further, in the step (3), an integration scheme based on a combination of the packing problem and the numerical partitioning problem is adopted, so that the load of each sub-tree construction task group is similar.
Further, in the step (5), each computing node obtains subsequences corresponding to all tree roots that need to be constructed in the current round, scans the corresponding input of each computing node once by using a multi-mode matching method, locates the positions of all suffixes of each tree root on the part of input, and sequences the subsequences respectively.
Further, in the step (6), a patrinia tree is adopted to carry out multi-path merging on the suffix sorting results transmitted by each computing node, so as to obtain a suffix sorting approaching global ordering; then, the sequencing result of the disordered fragments is solved by using a dynamic range unified I/O method; and finally forming a globally ordered suffix sorting result.
Further, in the step (7), the global ordered suffix ordering result generated in the step (6) is traversed once by using the stack as an aid, and a corresponding suffix subtree is generated.
Has the advantages that: the invention can efficiently construct a universal suffix tree on a universal cluster: firstly, the invention deconstructs the general suffix tree construction problem into the steps, the steps are highly independent of each other, the theoretically parallelable steps are all designed into a data parallelable mode, and the data parallelable step is very easy to realize by a data parallelable computing engine; secondly, different from the conventional method that a suffix tree needs to be constructed firstly to be converted into a general suffix tree, the method directly solves the construction problem of the general suffix tree, and can enjoy performance improvement and good expansibility brought by data parallel and distributed storage; thirdly, the invention does not depend on a specific data parallel computing framework and a distributed storage system, is convenient to be realized on any system, and has good portability.
Drawings
FIG. 1 is a schematic flow diagram of the overall process of the present invention;
FIG. 2 is a diagram illustrating a sub-tree division stage according to the present invention;
FIG. 3 is a diagram of multi-way merging using a loser tree in the present invention;
FIG. 4 is a diagram illustrating the generation of suffix subtrees in the present invention using suffix ordering.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
The invention provides a fully parallelized general suffix tree efficient construction method, solves the construction problem of the general suffix tree by using a method of dividing subtrees in parallel and constructing the subtrees by lcp-range multi-path merging, and designs an I/O optimization scheme and a load balancing scheme which are suitable for the invention. The invention is further designed into the mutually independent steps, each step which can be parallel theoretically is designed into a data parallel form, and the invention can conveniently deal with larger-scale input by virtue of the expansibility of a general cluster without being limited by a specific bottom computing engine and a distributed storage system. The invention unifies the construction method of the suffix tree and the construction method of the universal suffix tree, thereby solving the construction problem of the universal suffix tree and the construction problem of the suffix tree.
As shown in FIG. 1, the complete process of the present invention includes 6 parts of an integration input stage, a sub-tree dividing stage, a sub-tree assignment task building stage, a suffix positioning stage, a multi-way merging unified I/O stage and a sub-tree generating stage. Specific embodiments are described below:
the integrated input stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: first, the length of all input sequences and the number of all computing nodes are obtained. Then, the sequence length responsible for each computing node is obtained by averaging. A sequence length counter and an intermediate file are created and all inputs are opened in sequence, all input sequences are written into this file, and the sequence length counter is updated. When the length counter reaches the length of the responsibility of the computing node, the processing task of the part of the file blocks is distributed to one computing node. And records the corresponding relationship between the file block and the original input sequence. When the execution of the stage is finished, an intermediate file which is spliced with all the input is generated, a mapping which maintains the corresponding relation between the content of the intermediate file and the input of the original sequence is generated, and a mapping which maintains the corresponding relation between the file block which is processed by each computing node and the input of the original sequence is generated.
And (3) dividing the subtree stage to correspond to the technical scheme step (2). The specific implementation mode is as follows: firstly, a frequency threshold value D is given according to the resource condition of a computing node, so that the resource of the computing node is enough to construct a sub-tree of which the frequency of the sub-sequence corresponding to the tree root is not more than D. In addition, the initial value a of the statistical window length of the subsequence is given0And a length a of the statistical window for the subsequencesiIncreasing recurrence function ai=f(ai-1). In the ith round of operation, each compute node receives all the length ai-1The frequency of which is still greater than the threshold value D. Then the length is counted as a on the input in charge of itselfiAnd summarizing the frequency of the subsequence beginning with a certain element in the P to a main node filtering result after counting is finished. If the frequency of the sub-sequences is larger than the threshold value D, the operation of the (i + 1) th round is continued, otherwise, the parallel counting stage is ended. And after the parallel statistics stage is finished, merging and optimizing the result with the common prefix and the frequency number of the common prefix still not exceeding D. This stage is now complete and is shown in figure 2. The following describes the execution flow of this stage with a specific example: the input is assumed to be two DNA sequences S-GATT ACAT TGT, M-AATC CG; the frequency threshold value D is 3, and two calculation nodes participate in statistics; a is0=1,ai=ai-1+2. Then the 0 th round has a sub-sequence with a statistical length of 1, and the result after the summation isA is 5, C is 3, G is 3, and T is 6. Finding that the sub-sequences A and T are larger than the frequency threshold, the next round of operation is required. Thus, in round 1, a subsequence of length 3 and beginning with a or T is counted, and the result after summation is ATT-2, TTA-1, TAC-1, ACA-1, TTG-1, TGT-1, AAT-1, ATC-1, TCC-1. All results do not exceed the frequency threshold, and the parallel counting stage is finished. And entering a merging optimization stage, and searching a common prefix in the result. ATT and ATC have a common prefix AT, and ATT + ATC ≦ 3 ≦ D, so the two results may be combined into AT ≦ 3. The other results were combined in the same way. The final result of the subtree stage is obtained as { G ═ 3, C ═ 3, AT ═ 3, TT ═ 2, TA ═ 1, AC ═ 1, TG ═ 1, AA ═ 1, and TC ═ 1 }. Each subsequence in the set represents a sub-tree construction task.
And (3) allocating a sub-tree construction task stage corresponding to the technical scheme step (3). The specific implementation mode is as follows: for the result of dividing the subtree stage, a solution to the problem of boxing is run, where "box" refers to a space with a capacity of D and can accommodate subsequences whose sum of frequencies does not exceed D. Naturally, in the following sub-tree construction phase, the resources of each compute node are sufficient to construct all sub-trees in a box at the same time, in other words, the box is a sub-tree construction task group. The solution strategy for the packing problem can find the minimum number of boxes num that need to be used to accommodate all tree roots. Then increasing the number num of boxes to be integer multiple k of the number of the nearest calculation nodes, and then operating a solution of k paths of integer division on the result of the subtree division stage once, so that the loading capacity of each box is more uniform. And completing the task stage of distributing the subtree and constructing. The execution flow at this stage is illustrated below continuing with the above example: the results of the bin packing problem solution run were: b1, B2, B3, B4, B5, and B6, respectively. The minimum box number 6 is obtained and is already an integral multiple of the number 2 of the calculation nodes, and the expansion is not needed. Under the current allocation scheme, the process of building a subtree requires three rounds. However, because the load of the B6 is different from that of other boxes, there must be a certain round, and the loads of the two computing nodes are not balanced. Therefore, the strategy of the 6-path integer division problem needs to be operated once again, so that the loading capacity of different boxes is more balanced. The result of the completion of the run may be: b1 ═ G }, B2 ═ C }, B3 ═ AT }, B4 ═ TT, TA }, B5 ═ AC, TG }, B6 ═ TC, AA }. At this time, if the remainder allocation policy is applied, the even bins are allocated to compute node 0 and the odd bins are allocated to compute node 1. In the three rounds of subtree building processes, the loads of two computing nodes are balanced.
The step (4) of the technical scheme is to repeat the sub-round batch subtree construction process, and corresponds to a postfix positioning stage, a multi-path merging unified I/O stage and a subtree generating stage in the process.
The postfix locating stage corresponds to technical solution step (5). The specific implementation mode is as follows: and each computing node obtains the subsequence corresponding to the root of all the subtrees to be constructed in the current round. And positioning all suffixes on the corresponding partial input by utilizing a multi-mode matching method. And respectively sorting by using a sequence sorting method to generate lcp-range between adjacent sequences. lcp-range is a data structure designed by the present invention to characterize the differences between sequences, which can be interpreted as a "first difference segment", here lcp is the Longestcommon Prefix, an abbreviation for the longest common Prefix, used to characterize the location of the first different element of two sequences; range, i.e., the segment, means that the range elements of the two sequences starting from this position will be cached in the lcp-range structure, and will play an important role in the following multi-way merge stage. The execution flow at this stage is illustrated below continuing with the above example: taking the first round of constructing subtrees as an example, compute node 0 processes B2, and compute node 1 processes B1. first, two compute nodes will obtain the subsequences { G, C } corresponding to all subtrees that need to be constructed in this round. Suppose that compute node 0 is responsible for the corresponding input as sequence S and compute node 1 is responsible for the corresponding input as sequence M. The multi-mode matching method is run on compute node 0 with the suffixes G1-GATT ACAT TGT, G2-GT, and C1-CAT TGT, respectively. If a < C < G < T < $ (terminator) and range 1 are specified, respectively, the suffixes are sorted, and it is found that G1< G2, G2 is < a, T,2> with respect to lcp-range of G1, that is, the first different elements of G1 and G2 are at the 2 nd position, one element of G1 from the second position is a, and one element of G2 from the second position is T. Similarly, on compute node 1, the suffix G3 ═ G, C2 ═ CCG, and C3 ═ CG are located. Suffixes are ordered to give C2< C3, with the lcp-range of C3 relative to C2 being < C, G,2 >.
The multi-path merging unified I/O stage corresponds to the technical scheme step (6). The specific implementation mode is as follows: and the calculation node responsible for constructing the subtree receives suffix sorting information of the subtree on all the calculation nodes and performs multi-path merging by using the patricial tree. FIG. 3 shows an example of the present invention using a loser tree structure and lcp-range for multi-way merging. The k merged patricial tree has 2k nodes, the root node has only one child node, the corresponding sub-tree is a complete binary tree, and each of the k leaf nodes corresponds to one path of data. When the patroller tree is constructed, the first element of each path of data enters the patroller tree, the first element moves from the corresponding leaf node to the root node, the two paths of data elements which meet are compared, the patroller is stored in the middle node which meets the first element and does not advance any more, and the winner continues to advance. The data element moved to the root node is the total winner of all the data currently participating in merging. Each time a total winner is generated, a loser tree reconstruction is triggered. The next element of the data of the path where the winner is located enters the patroller tree, moves towards the root node, compares the next element with the data elements stored in the father node, the patroller stays at the node, and the winner continues to advance, so that the data element moved to the root node is the next winner. And repeating the steps until all elements of each path of data leave the patroller tree, and completing the multi-path merging. The loser tree is a classical method for solving the problem of multi-path merging, but the method has the defects that the merging process needs a large number of comparison operations, and the comparison operations need to be executed strictly in series. So if the cost of the data comparison operation is high, the loser tree multi-way merging would be very inefficient. Considering that the original sequence input may be very large in size and cannot be loaded into memory, sequence comparison must access the storage system to obtain the original input to ensure that each comparison has a definite result. This way of comparing sequences is very costly. To solve this problem, the present invention does not access the storage system to read the original sequence input for comparison, but instead uses lcp-range as the criterion for comparison, each comparison updates lcp-range of the winner. If lcp-range cannot determine the size relationship, and once considered equal to each other, then disorder intervals will be formed in the merged result. Note that the more elements cached in lcp-range, the more powerful the lcp-range comparison is, and the fewer sequences the lcp-range cannot compare. At the moment, each computing node arranges suffixes of all unordered intervals of all subtree construction tasks in the current round, uniformly accesses a file system by using a dynamic range ordering suffix method in an ERA method, and takes out original data to order until all the suffixes are ordered. The execution flow at this stage is illustrated below continuing with the above example: the computing node 1 is responsible for constructing a subtree with G as the root, and receives two paths of results, namely a 0 group < G1, a G2> and a 1 group < G3>, respectively. And constructing a two-path patricial tree. First elements of the two suffixes, G1 and G3, are compared. The results are G1< G3, G3 < A, $,2> relative to the lcp-range of G1. Group 0 wins, followed by the loser tree reconstruction triggered by group 0 next element G2. Comparing the magnitude relationship of G2 and G3, G2< G3, G3 is < T, $,2> relative to the lcp-range of G2, since G2 is < A, T,2> relative to the lcp-range of G1 and G3 is < A, $,2> relative to the lcp-range of G1. And then sorting suffixes corresponding to the subtree taking G as the root. No situation arises that requires access to the storage system.
And (7) generating a technical scheme corresponding to the sub-tree stage. The specific implementation mode is as follows: and traversing the globally ordered suffix sorting result generated in the previous step, and generating a corresponding suffix subtree by means of a stack and lcp-range information. The execution flow at this stage is illustrated below continuing with the above example: for example, construct a suffix subtree rooted at G. The previous step has generated a globally ordered suffix ordering result < G1, G2, G3 >. An edge E1 is first generated GATT ACAT TGT, representing the suffix G1, and stacked. Lcp-range < A, T,2> of G2 relative to G1 was read. The stack is continued until the total length of edges in the stack does not exceed the value 2 in lcp-range. According to the definition of lcp-range, 2 herein indicates that G1 and G2 have a common prefix of length 2-1. The last popped edge E1 is then split into the common prefix E11-G and the rest E12-ATT ACAT TGT, and a new edge E3-T is created at the split point. The new edge E3 and the upper half E11 of the split edge are stacked. This process is illustrated in fig. 4. The above process is repeated for the following suffix. The corresponding suffix sub-tree is constructed.
The invention realizes a prototype system LMdst (Lcp-Large distributed suffix Tree, Lcp multi-path merging distributed general suffix Tree) based on the existing open source software. Where the underlying data store uses HDFS and the compute engine uses Apache Spark. The software described above is not part of the present invention.
The prototype system implemented by the present invention was tested by constructing a universal suffix tree for human genome fragments, and table 1 compares the performance of the present method and the method of modifying the best ERa method to construct a universal suffix tree under the same hardware conditions. It can be seen from the table that the magnitude of the leading ERa of the method is larger and larger, from 1.5 times to 4.4 times, as the input scale is increased. And with the increase of the input scale, the method shows good data expansibility. The beneficial effect of the method is verified.
Table 1: performance testing of universal suffix tree constructed from human genome fragments
Figure BDA0001267101300000071

Claims (7)

1. A distributed parallel construction method of a universal suffix tree comprises the following steps:
(1) dividing the input according to the number of the computing nodes, wherein each computing node is responsible for processing a part of continuous input;
(2) determining a frequency threshold D of the subsequences according to the maximum scale of a single suffix subtree, scanning the original input in parallel, counting the frequency of each subsequence in the original input, and screening all subtrees not exceeding the frequency threshold D to construct tasks;
(3) integrating all subtree construction tasks screened in the step (2) into a plurality of construction task groups according to the capacity of the computing nodes, so that the loads of each group of tasks are similar, and distributing the tasks to different computing nodes to ensure that the number of the task groups distributed to each computing node is the same;
(4) according to the distribution result of the step (3), executing the step (5) to the step (7) in multiple rounds, completing a construction task group on each computing node in each round, and completing batch construction of all suffix subtrees;
(5) for a certain subtree construction task, all the computing nodes are responsible for searching the positions of suffixes needed to be used on the corresponding partial inputs, sequencing the suffixes, and generating ordered suffix sequencing on the partial inputs;
(6) all the computing nodes send the sequencing results generated in the step (5) to the computing nodes which are specifically implemented and constructed, and after receiving all the results, the computing nodes generate globally ordered suffix sequencing in a multi-path merging and unified I/O mode;
(7) constructing a corresponding suffix subtree by utilizing an auxiliary stack according to the globally ordered suffix sorting result generated in the step (6);
in the step (3), at least the number of needed construction subtree rounds is obtained by adopting a solution of the packing problem, and then a construction task group integration scheme with relatively balanced load is obtained by adopting a solution of the integer division problem.
2. The method of claim 1, wherein the method comprises the following steps: in the step (1), all input sequences are firstly spliced into a file, and the specific content of the file block which is handled by each computing node and corresponds to the original input sequence is maintained by mapping.
3. The method of claim 1, wherein the method comprises the following steps: in the step (2), the initial value a of the statistical window length of the subsequence is given0And a length a of the statistical window for the subsequencesiIncreasing recurrence function ai=f(ai-1) The ith round of operation, each calculation node respectively counts the length as aiAnd summarizing the statistical result to the main node for screening.
4. The method of claim 1, wherein the method comprises the following steps: in the step (4), each round of subtree construction consists of three data parallel processes of the step (5), the step (6) and the step (7).
5. The method of claim 1, wherein the method comprises the following steps: in the step (5), the multi-mode matching method is used for searching suffixes required by all tree roots constructed in the current round in partial input, and the suffixes are respectively sequenced, and lcp-range between two adjacent suffixes is generated.
6. The method of claim 1, wherein the method comprises the following steps: in the step (6), a multiple merging operation is performed by using an patricial tree, suffixes are not directly compared in the process of establishing and reconstructing the patricial tree, the size relationship of the two suffixes is determined by comparing lcp-range, if some incomparable suffixes exist after the multiple merging process is finished, each computing node collects the suffixes in all subtree construction tasks on the node, uniformly accesses a storage system, acquires original sequence elements, and compares the elements by using a dynamic range method until all the suffixes are ordered.
7. The method of claim 1, wherein the method comprises the following steps: in the step (7), based on the suffix sorting result in the step (6), a corresponding suffix sub-tree is constructed by using an auxiliary stack traversal sorting result and lcp-range information.
CN201710232797.1A 2017-04-11 2017-04-11 Distributed parallel construction method of universal suffix tree Active CN107015868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710232797.1A CN107015868B (en) 2017-04-11 2017-04-11 Distributed parallel construction method of universal suffix tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710232797.1A CN107015868B (en) 2017-04-11 2017-04-11 Distributed parallel construction method of universal suffix tree

Publications (2)

Publication Number Publication Date
CN107015868A CN107015868A (en) 2017-08-04
CN107015868B true CN107015868B (en) 2020-05-01

Family

ID=59446506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710232797.1A Active CN107015868B (en) 2017-04-11 2017-04-11 Distributed parallel construction method of universal suffix tree

Country Status (1)

Country Link
CN (1) CN107015868B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362736B (en) * 2018-04-03 2024-06-18 北京京东尚科信息技术有限公司 Information pushing method, device, electronic equipment and computer readable medium
CN108595624A (en) * 2018-04-23 2018-09-28 南京大学 A kind of large-scale distributed functional dependence discovery method
CN109375989B (en) * 2018-09-10 2022-04-08 中山大学 Parallel suffix ordering method and system
CN111191103B (en) * 2019-12-30 2021-08-24 河南拓普计算机网络工程有限公司 Method, device and storage medium for identifying and analyzing enterprise subject information from internet
CN112015734B (en) * 2020-08-06 2021-05-07 华东师范大学 Block chain-oriented compact Merkle multi-value proof parallel generation and verification method
CN113128592B (en) * 2021-04-20 2022-10-18 重庆邮电大学 Medical instrument identification analysis method and system for isomerism and storage medium
CN113670609B (en) * 2021-07-21 2022-10-04 广州大学 Fault detection method, system, device and medium based on wolf optimization algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819569A (en) * 2012-07-18 2012-12-12 中国科学院软件研究所 Matching method for data in distributed interactive simulation system
CN103678695A (en) * 2013-12-27 2014-03-26 中国科学院深圳先进技术研究院 Concurrent processing method and device
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array
US8914415B2 (en) * 2010-01-29 2014-12-16 International Business Machines Corporation Serial and parallel methods for I/O efficient suffix tree construction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515961B2 (en) * 2010-01-19 2013-08-20 Electronics And Telecommunications Research Institute Method and apparatus for indexing suffix tree in social network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8914415B2 (en) * 2010-01-29 2014-12-16 International Business Machines Corporation Serial and parallel methods for I/O efficient suffix tree construction
CN102819569A (en) * 2012-07-18 2012-12-12 中国科学院软件研究所 Matching method for data in distributed interactive simulation system
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array
CN103678695A (en) * 2013-12-27 2014-03-26 中国科学院深圳先进技术研究院 Concurrent processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于大规模语料的中文新词识别技术研究;张海军;《中国博士学位论文全文数据库》;20110915(第9期);全文 *
构建面向知识服务的医学文献相关性数据库方法研究;余希田;《中国优秀硕士学位论文全文数据库》;20090715;全文 *

Also Published As

Publication number Publication date
CN107015868A (en) 2017-08-04

Similar Documents

Publication Publication Date Title
CN107015868B (en) Distributed parallel construction method of universal suffix tree
Guo et al. Gpu-accelerated subgraph enumeration on partitioned graphs
Davidson et al. Efficient parallel merge sort for fixed and variable length keys
Zhou et al. Balanced parallel fp-growth with mapreduce
EP2011035A1 (en) System based method for content-based partitioning and mining
Halim et al. A MapReduce-based maximum-flow algorithm for large small-world network graphs
Jain et al. An adaptive parallel algorithm for computing connected components
Ngu et al. B+-tree construction on massive data with Hadoop
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
Chang et al. A novel incremental data mining algorithm based on fp-growth for big data
Selvitopi et al. Distributed many-to-many protein sequence alignment using sparse matrices
Hendrix et al. A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures
JP4758429B2 (en) Shared memory multiprocessor system and information processing method thereof
Kolb et al. Iterative computation of connected graph components with MapReduce
CN102207935A (en) Method and system for establishing index
Durad et al. Performance analysis of parallel sorting algorithms using MPI
Ou et al. Parallel remapping algorithms for adaptive problems
Zeng et al. Htc: Hybrid vertex-parallel and edge-parallel triangle counting
Ediger et al. Computational graph analytics for massive streaming data
Zou et al. An efficient data structure for dynamic graph on GPUS
Abdolazimi et al. Connected components of big graphs in fixed mapreduce rounds
Gottesbüren Parallel and Flow-Based High Quality Hypergraph Partitioning
Gupta et al. Distributed Incremental Graph Analysis
Ma et al. Parallel exact inference on multicore using mapreduce
CN111309786A (en) Parallel frequent item set mining method based on MapReduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 210093 Nanjing, Gulou District, Jiangsu, No. 22 Hankou Road

Patentee after: NANJING University

Address before: 210093 No. 22, Hankou Road, Suzhou, Jiangsu

Patentee before: NANJING University

CP02 Change in the address of a patent holder