CN112735528A

CN112735528A - Gene sequence comparison method and system

Info

Publication number: CN112735528A
Application number: CN202110024450.4A
Authority: CN
Inventors: 郑芳; 赵良; 田芳; 倪福川; 汪毅; 姚雅鹃; 姚娟; 章程
Original assignee: Huazhong Agricultural University
Current assignee: Huazhong Agricultural University
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2021-04-30

Abstract

The invention discloses a gene sequence comparison method and a system, wherein the method comprises the following steps: storing the reference genomic sequence and the query genomic sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework: dividing the reference genome sequence according to the row offset, and preprocessing to obtain a plurality of preprocessed reference data sets; establishing indexes for each preprocessed reference data set by adopting a suffix array algorithm, and combining all preprocessed reference data sets after indexes are established to obtain a reference sequence index file; performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and a reference sequence index file by adopting a seed expansion algorithm, and determining position information of each fragment in the reference sequence index file; and combining the position information of all the fragments in the reference sequence index file to obtain the gene sequence comparison result. The invention improves the speed and the precision of the calculation of the large-scale sequence alignment algorithm.

Description

Gene sequence comparison method and system

Technical Field

The invention relates to the technical field of gene comparison, in particular to a gene sequence comparison method and system based on a heterogeneous cloud platform.

Background

Since the first generation of genome sequencing technology, genome sequencing technology has undergone several technical iterations. The first generation of genomic sequencing technology was also known as Sanger sequencing. The method is characterized by long Sequencing sequence, the length of the Sequencing sequence can reach 1kbp (kilo bases pair), and high accuracy, but the method is gradually replaced by Next Generation genome Sequencing (NGS) due to the defects of high price and low flux. Various scholars design various algorithms of sequence comparison, sequence splicing and sequence classification aiming at the characteristics of the second-generation sequencing data. Most of these studies of sequence alignment algorithms are based on Seed-and-extended frameworks, such as suffix arrays, BWT, etc. With the widespread use of next-generation sequencing technologies, computational molecular biologists have acquired valuable research data. The effect of second generation sequencing data on specific genomic problems is still not ideal due to the short length of the second generation sequencing sequences. The third generation sequencing technology uses a single molecule sequencing technology for sequencing, has the advantage of long sequencing sequence, and the sequencing sequence can be as long as 10kbp on average. Because the sequence characteristics generated by the third generation sequencing technology are obviously different from the first and second generation sequencing data, and the third generation sequencing technology has the characteristic of long sequence read length. Therefore, the whole genome sequencing can find more genetic variation information, and the sequence alignment software for the third generation sequencing is foreign: BLASR, Minimap2, and the like. BLASR is the first sequence alignment method specifically designed for third generation sequencing. Respectively adopting FM-index and suffix arrays as index structures, searching the longest common prefix of the sequencing fragment on the reference genome as a seed through indexing, and then positioning the candidate region through a chain-chaining method. However, since the index is designed for exact alignment, the computational performance is low when searching for sequences with mismatches. Minimap2 uses seed indexing (seed), seed chaining (chain) and sequence alignment (align) strategies. Minimap2 is one of the fastest software for comparing third generation sequences at present, is the main targeting software in third generation sequencing data analysis, but the sequence comparison precision is not high.

For the third generation sequencing technology, Chinese scholars also develop some sequence alignment tools reaching the international advanced level, such as LAMSA and rHAT [20] developed by the institute of bioinformatics of Hadamard, etc., the rHAT algorithm has good performance on small-scale genomes, but the rHAT has slower speed on large-scale genes and has a great promotion space.

In summary, although researchers at home and abroad propose different algorithms for the third generation sequencing sequence comparison, the algorithm implementation methods and implementation details are different. However, the method has the problems that the precision of the comparison result is not high due to insufficient memory during index establishment because of large comparison scale, or the matching error is large because of adoption of non-precise matching because of long sequence comparison time, and the like. Therefore, how to design an efficient sequence comparison algorithm according to the data characteristics of third-generation sequencing, accelerate the sequence comparison process, and reduce the time consumption of the comparison process is a main problem facing the third-generation sequencing sequence comparison.

The third generation data has unique advantages in processing sequence comparison of repeated regions and complex regions, detecting structural variation, reconstructing transcripts and the like. The third generation sequencing technology has longer reads than the second generation sequencing technology, and the average length of each read can reach 10 kbp. Algorithms oriented to second-generation sequencing data cannot meet the requirements of third-generation sequencing data analysis. Furthermore, due to the large amount of data and high computational complexity, some applications have very long runtime, such as SNP detection and GWAS alignment, which may take days or even months to complete the processing of a data set. How to process and analyze a large amount of genome data obtained by sequencing and design a high-performance processing algorithm becomes an urgent problem to be solved. The sequence alignment algorithm facing three generations of data is continuously developed in recent years, but has a great space for improvement in both speed and accuracy.

Disclosure of Invention

The invention aims to provide a gene sequence comparison method and a gene sequence comparison system so as to improve the speed and the precision of large-scale sequence comparison algorithm calculation.

In order to achieve the above object, the present invention provides a gene sequence alignment method, comprising:

storing the reference genomic sequence and the query genomic sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework:

dividing the reference genome sequence according to the row offset, and preprocessing to obtain a plurality of preprocessed reference data sets;

establishing indexes for each preprocessed reference data set by adopting a suffix array algorithm, and combining all preprocessed reference data sets after indexes are established to obtain a reference sequence index file;

performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and the reference sequence index file by adopting a seed expansion algorithm, and determining the position information of each fragment in the reference sequence index file;

and combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.

Optionally, the dividing the reference genome sequence by the row offset, and performing preprocessing to obtain a plurality of preprocessed reference data sets, specifically includes:

assigning a unique label to each character string ri by using a zip WithIndex function according to the row offset, and cutting the reference genome into a plurality of fragment sequences according to the unique label;

sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;

in parallel, sequencing the fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.

Performing connection operation on the RDD1 data set and the RDD2 data set by adopting an RDD.join ByKey function operator to obtain a connection data set;

determining the relative position of each fragment in a reference genome in the connection data set, and sequencing each fragment in the connection data set by taking the relative position as values to obtain a pre-processing initial data set;

and triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.

Optionally, the creating an index for each preprocessed reference data set by using a suffix array algorithm, and merging all preprocessed reference data sets after the index is created to obtain a reference sequence index file, which specifically includes:

dividing each segment sequence in the pre-processing reference data set into a plurality of seeds by preset offset, wherein the length of each seed is k;

dividing each seed into four types according to A, G, C, T, and respectively establishing index arrays;

and sorting each index array by using the value of < key, value > to obtain a reference sequence index file.

Optionally, the performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and the reference sequence index file by using a seed expansion algorithm to determine the position information of each fragment in the reference sequence index file specifically includes:

slicing each of the fragments in the query genomic sequence into query fragment seeds by a maximum allowed mismatch k;

matching positions in the reference sequence index file according to the head and tail letters of the query segment seeds to serve as candidate positions;

expanding the query segment seeds at the candidate positions, carrying out approximate matching on the adjacent sequences before and after the candidate positions and the expanded query segment seeds, judging whether the candidate positions are real matching positions, if so, determining that the query segments are matched with reference sequence segments, and enumerating the matched query segments and the reference sequence segments; the nearby sequence is a reference genome sequence determined according to the expansion length of the query segment seed; the extension length is less than or equal to a maximum allowed mismatch k.

Optionally, the expanding the query segment seeds at the candidate position, performing approximate matching between the sequences near the candidate position and the expanded query segment seeds, and determining whether the candidate position is a true matching position specifically includes:

constructing a scoring matrix H of (n +1) × (m + 1); the elements of the scoring matrix H are comparison results, n is the length of the query segment, and m is the length of the reference sequence segment;

calculating element values in the scoring matrix according to values of elements of a left adjacent point, an upper adjacent point and a left upper corner adjacent point of the element values; that is, after the computation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, the H (i, j) values can be computed; recording a path for obtaining H (i, j) while recording the value of the matrix element H (i, j), and establishing a path matrix T;

and performing parallel calculation by using the element values on the g-1 anti-diagonal line and the g-2 anti-diagonal line to obtain the element value on the g-1 anti-diagonal line.

The invention also provides a gene sequence alignment system, which comprises:

the storage unit is used for storing the reference genome sequence and the query genome sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework:

the preprocessing unit is used for dividing the reference genome sequence according to the row offset and preprocessing the reference genome sequence to obtain a plurality of preprocessed reference data sets;

the index establishing unit is used for establishing an index for each preprocessed reference data set by adopting a suffix array algorithm and combining all preprocessed reference data sets after the indexes are established to obtain a reference sequence index file;

an alignment unit, configured to perform CUDA fine-grained sequence alignment on each fragment in the query genome sequence and the reference sequence index file by using a seed expansion algorithm, and determine position information of each fragment in the reference sequence index file;

and the result output unit is used for combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.

Optionally, the preprocessing unit specifically includes:

the segmentation module is used for allocating a unique label to each character string ri by using a zipWithIndex function according to the row offset, and segmenting the reference genome into a plurality of fragment sequences according to the unique label;

the first sequencing module is used for sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;

a second sequencing module, configured to sequence, in parallel, the plurality of fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.

A connection module, configured to perform a connection operation on the RDD1 data set and the RDD2 data set by using an RDD.

A third sorting module, configured to determine, in the concatenated data set, a relative position of each of the fragments in a reference genome, and sort each of the fragments in the concatenated data set with the relative position as a value, so as to obtain a preprocessed initial data set;

and the persistence module is used for triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.

Optionally, the cable building unit specifically includes:

a seed segmentation module, configured to segment each segment sequence in the pre-processing reference data set into a plurality of seeds with a preset offset, where the length of each seed is k;

the classification module is used for classifying each seed into four classes according to A, G, C, T and respectively establishing index arrays;

and the integral sorting module is used for sorting each index array by using < key, value > according to the value of the value to obtain a reference sequence index file.

Optionally, the comparison unit specifically includes:

a query sequence segmentation module for segmenting each segment in the query genome sequence into query segment seeds according to a maximum allowable mismatch k;

the candidate matching module is used for matching positions in the reference sequence index file according to the head and tail letters of the query segment seeds to serve as candidate positions;

a true matching module, configured to expand the query segment seeds at the candidate positions, perform approximate matching between the sequences in the vicinity of the candidate positions and the expanded query segment seeds, determine whether the candidate positions are true matching positions, determine that the query segments are matched with reference sequence segments if the candidate positions are true matching positions, and enumerate the matched query segments and reference sequence segments; the nearby sequence is a reference genome sequence determined according to the expansion length of the query segment seed; the extension length is less than or equal to a maximum allowed mismatch k.

Optionally, the true matching module specifically includes:

a score matrix construction sub-module for constructing a score matrix H of (n +1) × (m + 1); the elements of the scoring matrix H are comparison results, n is the length of the query segment, and m is the length of the reference sequence segment;

the element value calculation operator module is used for calculating element values in the score matrix according to values of elements of a left adjacent point, an upper adjacent point and a left upper adjacent point of the element values; that is, after the computation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, the H (i, j) values can be computed; recording a path for obtaining H (i, j) while recording the value of the matrix element H (i, j), and establishing a path matrix T;

and the diagonal element value operator module is used for executing parallel calculation by using the element values on the g-1 th anti-diagonal line and the g-2 th anti-diagonal line obtained by calculation to obtain the element value on the g-th anti-diagonal line.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: on the basis of in-depth analysis of the application requirements of current biological information processing, the invention aims to improve the calculation speed and precision of a large-scale sequence comparison algorithm, and mainly solves the practical problems encountered in sequence comparison in large-scale genome sequencing from the perspective of a heterogeneous distributed cloud processing technology. The method mainly researches a distributed heterogeneous cloud platform processing method and an implementation technology based on a seed expansion comparison method widely applied at present and discussing sequence comparison problems, provides a sequence comparison calculation method suitable for a distributed heterogeneous cloud platform, fully excavates the parallelism of related algorithms, discusses a calculation strategy on the algorithm heterogeneous cloud platform, and forms a set of new high-performance calculation scheme.

The sequence comparison in the invention is calculated under the condition of fusing a GPU heterogeneous framework under a Spark distributed computing platform, and a multi-resource mixed fine-grained distributed parallel computing framework is constructed. And the parallelism of the platform and the algorithm is fully mined to improve the computing efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart of a gene sequence alignment method provided in an embodiment of the present invention;

FIG. 2 is a diagram of a distributed computing framework for sequence alignment under a Spark heterogeneous cloud platform according to an embodiment of the present invention;

FIG. 3 is a flow chart of parallel processing of index generation Spark in an embodiment of the present invention;

FIG. 4 is a diagram illustrating a dynamic seed-and-extended algorithm process in accordance with an embodiment of the present invention;

FIG. 5 is a diagram of dependencies between computed data in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, distributed computing frameworks mainly comprise Hadoop and Spark, and the new Spark is an open source cluster computing system based on memory computing. The calculation speed of Spark in the memory is 100 times higher than that of Hadoop. In addition, as the development of the GPU and CUDA technologies mature, CUDA parallel computing is also widely used in various fields. Spark and CUDA have natural parallel processing capability, and how to integrate CPU and GPU computing resources on Spark platform also becomes a new research hotspot. Because the calculation characteristics of sequence alignment are suitable for parallel calculation, the research of the sequence alignment algorithm in Spark and GPU heterogeneous calculation platforms has feasibility of implementation.

Therefore, on the basis of deeply analyzing the application requirements of the current biological information processing, the invention aims to improve the calculation speed and precision of a large-scale sequence alignment algorithm, and mainly solves the practical problems encountered in sequence alignment in large-scale genome sequencing from the perspective of the heterogeneous distributed cloud processing technology. The method mainly researches a distributed heterogeneous cloud platform processing method and an implementation technology based on a seed expansion comparison method widely applied at present and discussing sequence comparison problems, provides a sequence comparison calculation method suitable for a distributed heterogeneous cloud platform, fully excavates the parallelism of related algorithms, discusses a calculation strategy on the algorithm heterogeneous cloud platform, and forms a set of new high-performance calculation scheme.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in FIG. 1, data preprocessing is firstly performed on a reference genome Fasta file, and then a reference genome sequence index is established; and then based on Reads fragments in the query sequence, performing reading Mapping sequence comparison by adopting a seed expansion algorithm, and finally outputting a comparison result. Specifically, the gene sequence alignment method provided in this embodiment includes:

storing the reference genomic sequence and the query genomic sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework, as shown in fig. 2, the Spark heterogeneous distributed computing platform framework performs the following steps:

step S1: dividing the reference genome sequence according to the row offset, and preprocessing to obtain a plurality of preprocessed reference data sets;

at this stage the Reference genome sequence Reference is divided into n successive segments (r) of length l (l ═ strenlen (ri))₁,r₂,…r_i…r_n) The method comprises the steps of storing a reference genome sequence in an HDFS (Hadoop distributed file system), reading the reference genome sequence into a Spark cloud computing framework by using a functional operator, cutting the reference genome into a plurality of different elastic Distributed Data Sets (RDDs) according to row offset, caching the RDDs in a memory, and dividing the shuffling and sorting of Key Value pairs into two processes of shuffling and sorting according to Value values and shuffling and sorting according to Key values.

The specific processing procedure of this step S1 is described in detail below:

Step S2: establishing indexes for each preprocessed reference data set by adopting a suffix array algorithm, and combining all preprocessed reference data sets after indexes are established to obtain a reference sequence index file;

most gene sequence alignment methods are heuristic algorithms, and the algorithms have no difference from an index structure no matter whether the algorithms are precise alignment algorithms or non-precise alignment algorithms. Therefore, the establishment of the index is an important step of a gene sequence comparison algorithm, and the index is established by adopting an algorithm based on a suffix array. The construction of the index of a larger genome sequence (such as a human genome sequence) requires hours of serial calculation, so the invention parallelizes the index establishment process and improves the calculation efficiency.

Storing the pre-processed Reference genome Reference _ preIndex _ file on the HDFS as input, with k as seed length, r_iDividing the seed into a plurality of different seed seeds by the offset 1, storing the seed seeds in a memory through RDD, dividing the seed into four types according to A, G, C, T in the stage of REDUCE, and respectively establishing index arrays as indexes_a,index_t,index_c,index_gIn combination with each other<key,value>Sorting by value and finally outputting the reference genome sequenceIndex_file。

Specifically, as shown in fig. 3, the step S2 may include:

Step S3: performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and the reference sequence index file by adopting a seed expansion algorithm, and determining the position information of each fragment in the reference sequence index file;

after the index of the reference genome is established, reads of the query genome can be aligned to the index file of the reference genome.

And (3) inputting data which are reads segments in a FASTQ format, distributing a large number of input reads sequences to different RDD memories, matching the input reads sequences with the reference genome index established in the first step, determining coordinate information of the input reads sequences in the reference genome, outputting the coordinate information as a comparison result, and performing a parallel computing process in Spark.

For a given reads sequence read₁＝q₁q₂…q_nAnd Ref ═ d₁d₂…d_mAnd (4) sequencing. The whole comparison process can adopt a dynamic seed-and-extended method for matching, and the seed extension algorithm is divided into 2 steps:

cutting each segment in the query genome sequence reads into a query segment seed according to a maximum allowable mismatching k; according to the matching position of the head letter and the tail letter (A, T, C or G) of the query fragment seed in the reference sequence index file, taking the matching position as a candidate position;

expanding a query segment seed at a candidate position, performing approximate matching on a sequence near the front and back of the candidate position and the expanded query segment seed, judging whether the candidate position is a real matching position, if so, determining that the query segment is matched with a reference sequence segment, and enumerating the matched query segment and the reference sequence segment; the nearby sequence is a reference genome sequence determined according to the expansion length of the query fragment seed; the extension length is less than or equal to a maximum allowed mismatch k. Specifically, the dynamic expansion algorithm starts from both ends of the seed, expands outwards along the reference genome until the required length is met or the mismatching exceeds a threshold, and finally enumerates the matched combinations, and the calculation process is shown in fig. 4 below.

The second step in the seed expansion algorithm is a step with high computational intensity and complexity, the matching process is carried out by adopting a dynamically planned Smith-Waterman algorithm, and the algorithm has better task and data parallelism.

(1) Task parallelism

When the reference genome sequence and a plurality of seed are compared, a plurality of tasks can be executed simultaneously due to the independence of the genome sequence and the seed data in the comparison at one time, and the tasks can be executed in parallel, namely the coarse-grained parallelism of the algorithm.

(2) Data parallelism

For a given two sequences seed r₁r₂…r_nAnd Ref ═ d₁d₂…d_mThe length of the matrix is n and m, respectively, and a (n +1) × (m +1) scoring matrix H is required to be constructed for storing the comparison results. The element values in the score matrix in the calculation process depend on the values of the elements of the left adjacent point, the upper adjacent point and the upper left adjacent point, namely the H (i, j) value can be calculated only after the calculation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, and the dependency relationship is shown in the following FIG. 5.

At some point, multiple elements in the matrix H may be computed simultaneously, which indicates some parallelism in the computation of the scoring matrix. The elements on the same anti-diagonal line have no data correlation, and the matrix units on the anti-diagonal line have data independence and can be calculated in parallel. That is, as long as the elements on the g-1 anti-diagonals and the g-2 anti-diagonals have been computed, the elements on the g-th anti-diagonals can be concurrently computed in parallel.

(3) Operational parallelism

The value of the matrix element H (i, j) is obtained by taking the maximum value of one or more of the three adjacent elements H (i-1, j-1), H (i, j-1) and H (i-1, j), so that the path for obtaining H (i, j) is recorded while the value of the matrix element H (i, j) is recorded, and therefore, in order to facilitate the backtracking of the final result, the calculation of the path matrix T can be executed in parallel with the score matrix H. There is no data correlation between elements on the same anti-diagonal. Therefore, a reasonable thread organization mode can be designed according to the GPU architecture, and CUDA fine-grained parallelism is realized, so that the whole computing energy efficiency is improved.

Step S4: and combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.

The present invention also provides a system corresponding to the above gene sequence alignment method, the system comprising:

Wherein, the preprocessing unit specifically comprises:

Wherein, the establishing cable unit specifically comprises:

Wherein, the comparison unit specifically comprises:

Wherein, the real matching module specifically comprises:

For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method of gene sequence alignment, comprising:

2. The method of claim 1, wherein the dividing the reference genome sequence by row offset and preprocessing to obtain a plurality of preprocessed reference data sets comprises:

3. The method of claim 1, wherein the indexing of each preprocessed reference data set by using a suffix array algorithm and combining all the preprocessed reference data sets with indexes built to obtain a reference sequence index file, specifically comprises:

4. The method according to claim 1, wherein the step of performing CUDA fine-grained sequence alignment on each fragment in the query genome sequence with the reference sequence index file by using a seed expansion algorithm to determine the position information of each fragment in the reference sequence index file specifically comprises:

5. The method of claim 4, wherein the expanding the query segment seeds at the candidate positions, approximately matching the nearby sequences before and after the candidate positions with the expanded query segment seeds, and determining whether the candidate positions are true matching positions specifically comprises:

6. A system for aligning gene sequences, the system comprising:

7. The gene sequence alignment system of claim 6, wherein the preprocessing unit comprises:

8. The gene sequence alignment system of claim 6, wherein the building a cable unit specifically comprises:

9. The system according to claim 6, wherein the alignment unit comprises:

10. The system according to claim 9, wherein the true match module comprises: