CN112735528A - Gene sequence comparison method and system - Google Patents

Gene sequence comparison method and system Download PDF

Info

Publication number
CN112735528A
CN112735528A CN202110024450.4A CN202110024450A CN112735528A CN 112735528 A CN112735528 A CN 112735528A CN 202110024450 A CN202110024450 A CN 202110024450A CN 112735528 A CN112735528 A CN 112735528A
Authority
CN
China
Prior art keywords
sequence
query
data set
fragment
index file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110024450.4A
Other languages
Chinese (zh)
Inventor
郑芳
赵良
田芳
倪福川
汪毅
姚雅鹃
姚娟
章程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong Agricultural University
Original Assignee
Huazhong Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong Agricultural University filed Critical Huazhong Agricultural University
Priority to CN202110024450.4A priority Critical patent/CN112735528A/en
Publication of CN112735528A publication Critical patent/CN112735528A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a gene sequence comparison method and a system, wherein the method comprises the following steps: storing the reference genomic sequence and the query genomic sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework: dividing the reference genome sequence according to the row offset, and preprocessing to obtain a plurality of preprocessed reference data sets; establishing indexes for each preprocessed reference data set by adopting a suffix array algorithm, and combining all preprocessed reference data sets after indexes are established to obtain a reference sequence index file; performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and a reference sequence index file by adopting a seed expansion algorithm, and determining position information of each fragment in the reference sequence index file; and combining the position information of all the fragments in the reference sequence index file to obtain the gene sequence comparison result. The invention improves the speed and the precision of the calculation of the large-scale sequence alignment algorithm.

Description

Gene sequence comparison method and system
Technical Field
The invention relates to the technical field of gene comparison, in particular to a gene sequence comparison method and system based on a heterogeneous cloud platform.
Background
Since the first generation of genome sequencing technology, genome sequencing technology has undergone several technical iterations. The first generation of genomic sequencing technology was also known as Sanger sequencing. The method is characterized by long Sequencing sequence, the length of the Sequencing sequence can reach 1kbp (kilo bases pair), and high accuracy, but the method is gradually replaced by Next Generation genome Sequencing (NGS) due to the defects of high price and low flux. Various scholars design various algorithms of sequence comparison, sequence splicing and sequence classification aiming at the characteristics of the second-generation sequencing data. Most of these studies of sequence alignment algorithms are based on Seed-and-extended frameworks, such as suffix arrays, BWT, etc. With the widespread use of next-generation sequencing technologies, computational molecular biologists have acquired valuable research data. The effect of second generation sequencing data on specific genomic problems is still not ideal due to the short length of the second generation sequencing sequences. The third generation sequencing technology uses a single molecule sequencing technology for sequencing, has the advantage of long sequencing sequence, and the sequencing sequence can be as long as 10kbp on average. Because the sequence characteristics generated by the third generation sequencing technology are obviously different from the first and second generation sequencing data, and the third generation sequencing technology has the characteristic of long sequence read length. Therefore, the whole genome sequencing can find more genetic variation information, and the sequence alignment software for the third generation sequencing is foreign: BLASR, Minimap2, and the like. BLASR is the first sequence alignment method specifically designed for third generation sequencing. Respectively adopting FM-index and suffix arrays as index structures, searching the longest common prefix of the sequencing fragment on the reference genome as a seed through indexing, and then positioning the candidate region through a chain-chaining method. However, since the index is designed for exact alignment, the computational performance is low when searching for sequences with mismatches. Minimap2 uses seed indexing (seed), seed chaining (chain) and sequence alignment (align) strategies. Minimap2 is one of the fastest software for comparing third generation sequences at present, is the main targeting software in third generation sequencing data analysis, but the sequence comparison precision is not high.
For the third generation sequencing technology, Chinese scholars also develop some sequence alignment tools reaching the international advanced level, such as LAMSA and rHAT [20] developed by the institute of bioinformatics of Hadamard, etc., the rHAT algorithm has good performance on small-scale genomes, but the rHAT has slower speed on large-scale genes and has a great promotion space.
In summary, although researchers at home and abroad propose different algorithms for the third generation sequencing sequence comparison, the algorithm implementation methods and implementation details are different. However, the method has the problems that the precision of the comparison result is not high due to insufficient memory during index establishment because of large comparison scale, or the matching error is large because of adoption of non-precise matching because of long sequence comparison time, and the like. Therefore, how to design an efficient sequence comparison algorithm according to the data characteristics of third-generation sequencing, accelerate the sequence comparison process, and reduce the time consumption of the comparison process is a main problem facing the third-generation sequencing sequence comparison.
The third generation data has unique advantages in processing sequence comparison of repeated regions and complex regions, detecting structural variation, reconstructing transcripts and the like. The third generation sequencing technology has longer reads than the second generation sequencing technology, and the average length of each read can reach 10 kbp. Algorithms oriented to second-generation sequencing data cannot meet the requirements of third-generation sequencing data analysis. Furthermore, due to the large amount of data and high computational complexity, some applications have very long runtime, such as SNP detection and GWAS alignment, which may take days or even months to complete the processing of a data set. How to process and analyze a large amount of genome data obtained by sequencing and design a high-performance processing algorithm becomes an urgent problem to be solved. The sequence alignment algorithm facing three generations of data is continuously developed in recent years, but has a great space for improvement in both speed and accuracy.
Disclosure of Invention
The invention aims to provide a gene sequence comparison method and a gene sequence comparison system so as to improve the speed and the precision of large-scale sequence comparison algorithm calculation.
In order to achieve the above object, the present invention provides a gene sequence alignment method, comprising:
storing the reference genomic sequence and the query genomic sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework:
dividing the reference genome sequence according to the row offset, and preprocessing to obtain a plurality of preprocessed reference data sets;
establishing indexes for each preprocessed reference data set by adopting a suffix array algorithm, and combining all preprocessed reference data sets after indexes are established to obtain a reference sequence index file;
performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and the reference sequence index file by adopting a seed expansion algorithm, and determining the position information of each fragment in the reference sequence index file;
and combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.
Optionally, the dividing the reference genome sequence by the row offset, and performing preprocessing to obtain a plurality of preprocessed reference data sets, specifically includes:
assigning a unique label to each character string ri by using a zip WithIndex function according to the row offset, and cutting the reference genome into a plurality of fragment sequences according to the unique label;
sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;
in parallel, sequencing the fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.
Performing connection operation on the RDD1 data set and the RDD2 data set by adopting an RDD.join ByKey function operator to obtain a connection data set;
determining the relative position of each fragment in a reference genome in the connection data set, and sequencing each fragment in the connection data set by taking the relative position as values to obtain a pre-processing initial data set;
and triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.
Optionally, the creating an index for each preprocessed reference data set by using a suffix array algorithm, and merging all preprocessed reference data sets after the index is created to obtain a reference sequence index file, which specifically includes:
dividing each segment sequence in the pre-processing reference data set into a plurality of seeds by preset offset, wherein the length of each seed is k;
dividing each seed into four types according to A, G, C, T, and respectively establishing index arrays;
and sorting each index array by using the value of < key, value > to obtain a reference sequence index file.
Optionally, the performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and the reference sequence index file by using a seed expansion algorithm to determine the position information of each fragment in the reference sequence index file specifically includes:
slicing each of the fragments in the query genomic sequence into query fragment seeds by a maximum allowed mismatch k;
matching positions in the reference sequence index file according to the head and tail letters of the query segment seeds to serve as candidate positions;
expanding the query segment seeds at the candidate positions, carrying out approximate matching on the adjacent sequences before and after the candidate positions and the expanded query segment seeds, judging whether the candidate positions are real matching positions, if so, determining that the query segments are matched with reference sequence segments, and enumerating the matched query segments and the reference sequence segments; the nearby sequence is a reference genome sequence determined according to the expansion length of the query segment seed; the extension length is less than or equal to a maximum allowed mismatch k.
Optionally, the expanding the query segment seeds at the candidate position, performing approximate matching between the sequences near the candidate position and the expanded query segment seeds, and determining whether the candidate position is a true matching position specifically includes:
constructing a scoring matrix H of (n +1) × (m + 1); the elements of the scoring matrix H are comparison results, n is the length of the query segment, and m is the length of the reference sequence segment;
calculating element values in the scoring matrix according to values of elements of a left adjacent point, an upper adjacent point and a left upper corner adjacent point of the element values; that is, after the computation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, the H (i, j) values can be computed; recording a path for obtaining H (i, j) while recording the value of the matrix element H (i, j), and establishing a path matrix T;
and performing parallel calculation by using the element values on the g-1 anti-diagonal line and the g-2 anti-diagonal line to obtain the element value on the g-1 anti-diagonal line.
The invention also provides a gene sequence alignment system, which comprises:
the storage unit is used for storing the reference genome sequence and the query genome sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework:
the preprocessing unit is used for dividing the reference genome sequence according to the row offset and preprocessing the reference genome sequence to obtain a plurality of preprocessed reference data sets;
the index establishing unit is used for establishing an index for each preprocessed reference data set by adopting a suffix array algorithm and combining all preprocessed reference data sets after the indexes are established to obtain a reference sequence index file;
an alignment unit, configured to perform CUDA fine-grained sequence alignment on each fragment in the query genome sequence and the reference sequence index file by using a seed expansion algorithm, and determine position information of each fragment in the reference sequence index file;
and the result output unit is used for combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.
Optionally, the preprocessing unit specifically includes:
the segmentation module is used for allocating a unique label to each character string ri by using a zipWithIndex function according to the row offset, and segmenting the reference genome into a plurality of fragment sequences according to the unique label;
the first sequencing module is used for sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;
a second sequencing module, configured to sequence, in parallel, the plurality of fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.
A connection module, configured to perform a connection operation on the RDD1 data set and the RDD2 data set by using an RDD.
A third sorting module, configured to determine, in the concatenated data set, a relative position of each of the fragments in a reference genome, and sort each of the fragments in the concatenated data set with the relative position as a value, so as to obtain a preprocessed initial data set;
and the persistence module is used for triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.
Optionally, the cable building unit specifically includes:
a seed segmentation module, configured to segment each segment sequence in the pre-processing reference data set into a plurality of seeds with a preset offset, where the length of each seed is k;
the classification module is used for classifying each seed into four classes according to A, G, C, T and respectively establishing index arrays;
and the integral sorting module is used for sorting each index array by using < key, value > according to the value of the value to obtain a reference sequence index file.
Optionally, the comparison unit specifically includes:
a query sequence segmentation module for segmenting each segment in the query genome sequence into query segment seeds according to a maximum allowable mismatch k;
the candidate matching module is used for matching positions in the reference sequence index file according to the head and tail letters of the query segment seeds to serve as candidate positions;
a true matching module, configured to expand the query segment seeds at the candidate positions, perform approximate matching between the sequences in the vicinity of the candidate positions and the expanded query segment seeds, determine whether the candidate positions are true matching positions, determine that the query segments are matched with reference sequence segments if the candidate positions are true matching positions, and enumerate the matched query segments and reference sequence segments; the nearby sequence is a reference genome sequence determined according to the expansion length of the query segment seed; the extension length is less than or equal to a maximum allowed mismatch k.
Optionally, the true matching module specifically includes:
a score matrix construction sub-module for constructing a score matrix H of (n +1) × (m + 1); the elements of the scoring matrix H are comparison results, n is the length of the query segment, and m is the length of the reference sequence segment;
the element value calculation operator module is used for calculating element values in the score matrix according to values of elements of a left adjacent point, an upper adjacent point and a left upper adjacent point of the element values; that is, after the computation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, the H (i, j) values can be computed; recording a path for obtaining H (i, j) while recording the value of the matrix element H (i, j), and establishing a path matrix T;
and the diagonal element value operator module is used for executing parallel calculation by using the element values on the g-1 th anti-diagonal line and the g-2 th anti-diagonal line obtained by calculation to obtain the element value on the g-th anti-diagonal line.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: on the basis of in-depth analysis of the application requirements of current biological information processing, the invention aims to improve the calculation speed and precision of a large-scale sequence comparison algorithm, and mainly solves the practical problems encountered in sequence comparison in large-scale genome sequencing from the perspective of a heterogeneous distributed cloud processing technology. The method mainly researches a distributed heterogeneous cloud platform processing method and an implementation technology based on a seed expansion comparison method widely applied at present and discussing sequence comparison problems, provides a sequence comparison calculation method suitable for a distributed heterogeneous cloud platform, fully excavates the parallelism of related algorithms, discusses a calculation strategy on the algorithm heterogeneous cloud platform, and forms a set of new high-performance calculation scheme.
The sequence comparison in the invention is calculated under the condition of fusing a GPU heterogeneous framework under a Spark distributed computing platform, and a multi-resource mixed fine-grained distributed parallel computing framework is constructed. And the parallelism of the platform and the algorithm is fully mined to improve the computing efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flowchart of a gene sequence alignment method provided in an embodiment of the present invention;
FIG. 2 is a diagram of a distributed computing framework for sequence alignment under a Spark heterogeneous cloud platform according to an embodiment of the present invention;
FIG. 3 is a flow chart of parallel processing of index generation Spark in an embodiment of the present invention;
FIG. 4 is a diagram illustrating a dynamic seed-and-extended algorithm process in accordance with an embodiment of the present invention;
FIG. 5 is a diagram of dependencies between computed data in an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, distributed computing frameworks mainly comprise Hadoop and Spark, and the new Spark is an open source cluster computing system based on memory computing. The calculation speed of Spark in the memory is 100 times higher than that of Hadoop. In addition, as the development of the GPU and CUDA technologies mature, CUDA parallel computing is also widely used in various fields. Spark and CUDA have natural parallel processing capability, and how to integrate CPU and GPU computing resources on Spark platform also becomes a new research hotspot. Because the calculation characteristics of sequence alignment are suitable for parallel calculation, the research of the sequence alignment algorithm in Spark and GPU heterogeneous calculation platforms has feasibility of implementation.
Therefore, on the basis of deeply analyzing the application requirements of the current biological information processing, the invention aims to improve the calculation speed and precision of a large-scale sequence alignment algorithm, and mainly solves the practical problems encountered in sequence alignment in large-scale genome sequencing from the perspective of the heterogeneous distributed cloud processing technology. The method mainly researches a distributed heterogeneous cloud platform processing method and an implementation technology based on a seed expansion comparison method widely applied at present and discussing sequence comparison problems, provides a sequence comparison calculation method suitable for a distributed heterogeneous cloud platform, fully excavates the parallelism of related algorithms, discusses a calculation strategy on the algorithm heterogeneous cloud platform, and forms a set of new high-performance calculation scheme.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in FIG. 1, data preprocessing is firstly performed on a reference genome Fasta file, and then a reference genome sequence index is established; and then based on Reads fragments in the query sequence, performing reading Mapping sequence comparison by adopting a seed expansion algorithm, and finally outputting a comparison result. Specifically, the gene sequence alignment method provided in this embodiment includes:
storing the reference genomic sequence and the query genomic sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework, as shown in fig. 2, the Spark heterogeneous distributed computing platform framework performs the following steps:
step S1: dividing the reference genome sequence according to the row offset, and preprocessing to obtain a plurality of preprocessed reference data sets;
at this stage the Reference genome sequence Reference is divided into n successive segments (r) of length l (l ═ strenlen (ri))1,r2,…ri…rn) The method comprises the steps of storing a reference genome sequence in an HDFS (Hadoop distributed file system), reading the reference genome sequence into a Spark cloud computing framework by using a functional operator, cutting the reference genome into a plurality of different elastic Distributed Data Sets (RDDs) according to row offset, caching the RDDs in a memory, and dividing the shuffling and sorting of Key Value pairs into two processes of shuffling and sorting according to Value values and shuffling and sorting according to Key values.
The specific processing procedure of this step S1 is described in detail below:
assigning a unique label to each character string ri by using a zip WithIndex function according to the row offset, and cutting the reference genome into a plurality of fragment sequences according to the unique label;
sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;
in parallel, sequencing the fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.
Performing connection operation on the RDD1 data set and the RDD2 data set by adopting an RDD.join ByKey function operator to obtain a connection data set;
determining the relative position of each fragment in a reference genome in the connection data set, and sequencing each fragment in the connection data set by taking the relative position as values to obtain a pre-processing initial data set;
and triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.
Step S2: establishing indexes for each preprocessed reference data set by adopting a suffix array algorithm, and combining all preprocessed reference data sets after indexes are established to obtain a reference sequence index file;
most gene sequence alignment methods are heuristic algorithms, and the algorithms have no difference from an index structure no matter whether the algorithms are precise alignment algorithms or non-precise alignment algorithms. Therefore, the establishment of the index is an important step of a gene sequence comparison algorithm, and the index is established by adopting an algorithm based on a suffix array. The construction of the index of a larger genome sequence (such as a human genome sequence) requires hours of serial calculation, so the invention parallelizes the index establishment process and improves the calculation efficiency.
Storing the pre-processed Reference genome Reference _ preIndex _ file on the HDFS as input, with k as seed length, riDividing the seed into a plurality of different seed seeds by the offset 1, storing the seed seeds in a memory through RDD, dividing the seed into four types according to A, G, C, T in the stage of REDUCE, and respectively establishing index arrays as indexesa,indext,indexc,indexgIn combination with each other<key,value>Sorting by value and finally outputting the reference genome sequenceIndex_file。
Specifically, as shown in fig. 3, the step S2 may include:
dividing each segment sequence in the pre-processing reference data set into a plurality of seeds by preset offset, wherein the length of each seed is k;
dividing each seed into four types according to A, G, C, T, and respectively establishing index arrays;
and sorting each index array by using the value of < key, value > to obtain a reference sequence index file.
Step S3: performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and the reference sequence index file by adopting a seed expansion algorithm, and determining the position information of each fragment in the reference sequence index file;
after the index of the reference genome is established, reads of the query genome can be aligned to the index file of the reference genome.
And (3) inputting data which are reads segments in a FASTQ format, distributing a large number of input reads sequences to different RDD memories, matching the input reads sequences with the reference genome index established in the first step, determining coordinate information of the input reads sequences in the reference genome, outputting the coordinate information as a comparison result, and performing a parallel computing process in Spark.
For a given reads sequence read1=q1q2…qnAnd Ref ═ d1d2…dmAnd (4) sequencing. The whole comparison process can adopt a dynamic seed-and-extended method for matching, and the seed extension algorithm is divided into 2 steps:
cutting each segment in the query genome sequence reads into a query segment seed according to a maximum allowable mismatching k; according to the matching position of the head letter and the tail letter (A, T, C or G) of the query fragment seed in the reference sequence index file, taking the matching position as a candidate position;
expanding a query segment seed at a candidate position, performing approximate matching on a sequence near the front and back of the candidate position and the expanded query segment seed, judging whether the candidate position is a real matching position, if so, determining that the query segment is matched with a reference sequence segment, and enumerating the matched query segment and the reference sequence segment; the nearby sequence is a reference genome sequence determined according to the expansion length of the query fragment seed; the extension length is less than or equal to a maximum allowed mismatch k. Specifically, the dynamic expansion algorithm starts from both ends of the seed, expands outwards along the reference genome until the required length is met or the mismatching exceeds a threshold, and finally enumerates the matched combinations, and the calculation process is shown in fig. 4 below.
The second step in the seed expansion algorithm is a step with high computational intensity and complexity, the matching process is carried out by adopting a dynamically planned Smith-Waterman algorithm, and the algorithm has better task and data parallelism.
(1) Task parallelism
When the reference genome sequence and a plurality of seed are compared, a plurality of tasks can be executed simultaneously due to the independence of the genome sequence and the seed data in the comparison at one time, and the tasks can be executed in parallel, namely the coarse-grained parallelism of the algorithm.
(2) Data parallelism
For a given two sequences seed r1r2…rnAnd Ref ═ d1d2…dmThe length of the matrix is n and m, respectively, and a (n +1) × (m +1) scoring matrix H is required to be constructed for storing the comparison results. The element values in the score matrix in the calculation process depend on the values of the elements of the left adjacent point, the upper adjacent point and the upper left adjacent point, namely the H (i, j) value can be calculated only after the calculation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, and the dependency relationship is shown in the following FIG. 5.
At some point, multiple elements in the matrix H may be computed simultaneously, which indicates some parallelism in the computation of the scoring matrix. The elements on the same anti-diagonal line have no data correlation, and the matrix units on the anti-diagonal line have data independence and can be calculated in parallel. That is, as long as the elements on the g-1 anti-diagonals and the g-2 anti-diagonals have been computed, the elements on the g-th anti-diagonals can be concurrently computed in parallel.
(3) Operational parallelism
The value of the matrix element H (i, j) is obtained by taking the maximum value of one or more of the three adjacent elements H (i-1, j-1), H (i, j-1) and H (i-1, j), so that the path for obtaining H (i, j) is recorded while the value of the matrix element H (i, j) is recorded, and therefore, in order to facilitate the backtracking of the final result, the calculation of the path matrix T can be executed in parallel with the score matrix H. There is no data correlation between elements on the same anti-diagonal. Therefore, a reasonable thread organization mode can be designed according to the GPU architecture, and CUDA fine-grained parallelism is realized, so that the whole computing energy efficiency is improved.
Step S4: and combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.
The present invention also provides a system corresponding to the above gene sequence alignment method, the system comprising:
the storage unit is used for storing the reference genome sequence and the query genome sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework:
the preprocessing unit is used for dividing the reference genome sequence according to the row offset and preprocessing the reference genome sequence to obtain a plurality of preprocessed reference data sets;
the index establishing unit is used for establishing an index for each preprocessed reference data set by adopting a suffix array algorithm and combining all preprocessed reference data sets after the indexes are established to obtain a reference sequence index file;
an alignment unit, configured to perform CUDA fine-grained sequence alignment on each fragment in the query genome sequence and the reference sequence index file by using a seed expansion algorithm, and determine position information of each fragment in the reference sequence index file;
and the result output unit is used for combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.
Wherein, the preprocessing unit specifically comprises:
the segmentation module is used for allocating a unique label to each character string ri by using a zipWithIndex function according to the row offset, and segmenting the reference genome into a plurality of fragment sequences according to the unique label;
the first sequencing module is used for sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;
a second sequencing module, configured to sequence, in parallel, the plurality of fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.
A connection module, configured to perform a connection operation on the RDD1 data set and the RDD2 data set by using an RDD.
A third sorting module, configured to determine, in the concatenated data set, a relative position of each of the fragments in a reference genome, and sort each of the fragments in the concatenated data set with the relative position as a value, so as to obtain a preprocessed initial data set;
and the persistence module is used for triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.
Wherein, the establishing cable unit specifically comprises:
a seed segmentation module, configured to segment each segment sequence in the pre-processing reference data set into a plurality of seeds with a preset offset, where the length of each seed is k;
the classification module is used for classifying each seed into four classes according to A, G, C, T and respectively establishing index arrays;
and the integral sorting module is used for sorting each index array by using < key, value > according to the value of the value to obtain a reference sequence index file.
Wherein, the comparison unit specifically comprises:
a query sequence segmentation module for segmenting each segment in the query genome sequence into query segment seeds according to a maximum allowable mismatch k;
the candidate matching module is used for matching positions in the reference sequence index file according to the head and tail letters of the query segment seeds to serve as candidate positions;
a true matching module, configured to expand the query segment seeds at the candidate positions, perform approximate matching between the sequences in the vicinity of the candidate positions and the expanded query segment seeds, determine whether the candidate positions are true matching positions, determine that the query segments are matched with reference sequence segments if the candidate positions are true matching positions, and enumerate the matched query segments and reference sequence segments; the nearby sequence is a reference genome sequence determined according to the expansion length of the query segment seed; the extension length is less than or equal to a maximum allowed mismatch k.
Wherein, the real matching module specifically comprises:
a score matrix construction sub-module for constructing a score matrix H of (n +1) × (m + 1); the elements of the scoring matrix H are comparison results, n is the length of the query segment, and m is the length of the reference sequence segment;
the element value calculation operator module is used for calculating element values in the score matrix according to values of elements of a left adjacent point, an upper adjacent point and a left upper adjacent point of the element values; that is, after the computation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, the H (i, j) values can be computed; recording a path for obtaining H (i, j) while recording the value of the matrix element H (i, j), and establishing a path matrix T;
and the diagonal element value operator module is used for executing parallel calculation by using the element values on the g-1 th anti-diagonal line and the g-2 th anti-diagonal line obtained by calculation to obtain the element value on the g-th anti-diagonal line.
For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A method of gene sequence alignment, comprising:
storing the reference genomic sequence and the query genomic sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework:
dividing the reference genome sequence according to the row offset, and preprocessing to obtain a plurality of preprocessed reference data sets;
establishing indexes for each preprocessed reference data set by adopting a suffix array algorithm, and combining all preprocessed reference data sets after indexes are established to obtain a reference sequence index file;
performing CUDA fine-grained sequence comparison on each fragment in the query genome sequence and the reference sequence index file by adopting a seed expansion algorithm, and determining the position information of each fragment in the reference sequence index file;
and combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.
2. The method of claim 1, wherein the dividing the reference genome sequence by row offset and preprocessing to obtain a plurality of preprocessed reference data sets comprises:
assigning a unique label to each character string ri by using a zip WithIndex function according to the row offset, and cutting the reference genome into a plurality of fragment sequences according to the unique label;
sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;
in parallel, sequencing the fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.
Performing connection operation on the RDD1 data set and the RDD2 data set by adopting an RDD.join ByKey function operator to obtain a connection data set;
determining the relative position of each fragment in a reference genome in the connection data set, and sequencing each fragment in the connection data set by taking the relative position as values to obtain a pre-processing initial data set;
and triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.
3. The method of claim 1, wherein the indexing of each preprocessed reference data set by using a suffix array algorithm and combining all the preprocessed reference data sets with indexes built to obtain a reference sequence index file, specifically comprises:
dividing each segment sequence in the pre-processing reference data set into a plurality of seeds by preset offset, wherein the length of each seed is k;
dividing each seed into four types according to A, G, C, T, and respectively establishing index arrays;
and sorting each index array by using the value of < key, value > to obtain a reference sequence index file.
4. The method according to claim 1, wherein the step of performing CUDA fine-grained sequence alignment on each fragment in the query genome sequence with the reference sequence index file by using a seed expansion algorithm to determine the position information of each fragment in the reference sequence index file specifically comprises:
slicing each of the fragments in the query genomic sequence into query fragment seeds by a maximum allowed mismatch k;
matching positions in the reference sequence index file according to the head and tail letters of the query segment seeds to serve as candidate positions;
expanding the query segment seeds at the candidate positions, carrying out approximate matching on the adjacent sequences before and after the candidate positions and the expanded query segment seeds, judging whether the candidate positions are real matching positions, if so, determining that the query segments are matched with reference sequence segments, and enumerating the matched query segments and the reference sequence segments; the nearby sequence is a reference genome sequence determined according to the expansion length of the query segment seed; the extension length is less than or equal to a maximum allowed mismatch k.
5. The method of claim 4, wherein the expanding the query segment seeds at the candidate positions, approximately matching the nearby sequences before and after the candidate positions with the expanded query segment seeds, and determining whether the candidate positions are true matching positions specifically comprises:
constructing a scoring matrix H of (n +1) × (m + 1); the elements of the scoring matrix H are comparison results, n is the length of the query segment, and m is the length of the reference sequence segment;
calculating element values in the scoring matrix according to values of elements of a left adjacent point, an upper adjacent point and a left upper corner adjacent point of the element values; that is, after the computation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, the H (i, j) values can be computed; recording a path for obtaining H (i, j) while recording the value of the matrix element H (i, j), and establishing a path matrix T;
and performing parallel calculation by using the element values on the g-1 anti-diagonal line and the g-2 anti-diagonal line to obtain the element value on the g-1 anti-diagonal line.
6. A system for aligning gene sequences, the system comprising:
the storage unit is used for storing the reference genome sequence and the query genome sequence in a distributed storage system; under the Spark heterogeneous distributed computing platform framework:
the preprocessing unit is used for dividing the reference genome sequence according to the row offset and preprocessing the reference genome sequence to obtain a plurality of preprocessed reference data sets;
the index establishing unit is used for establishing an index for each preprocessed reference data set by adopting a suffix array algorithm and combining all preprocessed reference data sets after the indexes are established to obtain a reference sequence index file;
an alignment unit, configured to perform CUDA fine-grained sequence alignment on each fragment in the query genome sequence and the reference sequence index file by using a seed expansion algorithm, and determine position information of each fragment in the reference sequence index file;
and the result output unit is used for combining the position information of all the fragments in the reference sequence index file to obtain a gene sequence comparison result.
7. The gene sequence alignment system of claim 6, wherein the preprocessing unit comprises:
the segmentation module is used for allocating a unique label to each character string ri by using a zipWithIndex function according to the row offset, and segmenting the reference genome into a plurality of fragment sequences according to the unique label;
the first sequencing module is used for sequencing the fragment sequences according to the positions of the fragment sequences in the reference genome by using a sortByKey function operator to obtain an RDD1 data set; wherein the position of the fragment sequence in the reference genome is a key in a sortByKey function operator;
a second sequencing module, configured to sequence, in parallel, the plurality of fragment sequences according to the line positions of the fragment sequences in the reference genome by using an RDD.
A connection module, configured to perform a connection operation on the RDD1 data set and the RDD2 data set by using an RDD.
A third sorting module, configured to determine, in the concatenated data set, a relative position of each of the fragments in a reference genome, and sort each of the fragments in the concatenated data set with the relative position as a value, so as to obtain a preprocessed initial data set;
and the persistence module is used for triggering persistence by using an Action operator to store the preprocessing initial data set into a distributed storage system to obtain a preprocessing reference data set.
8. The gene sequence alignment system of claim 6, wherein the building a cable unit specifically comprises:
a seed segmentation module, configured to segment each segment sequence in the pre-processing reference data set into a plurality of seeds with a preset offset, where the length of each seed is k;
the classification module is used for classifying each seed into four classes according to A, G, C, T and respectively establishing index arrays;
and the integral sorting module is used for sorting each index array by using < key, value > according to the value of the value to obtain a reference sequence index file.
9. The system according to claim 6, wherein the alignment unit comprises:
a query sequence segmentation module for segmenting each segment in the query genome sequence into query segment seeds according to a maximum allowable mismatch k;
the candidate matching module is used for matching positions in the reference sequence index file according to the head and tail letters of the query segment seeds to serve as candidate positions;
a true matching module, configured to expand the query segment seeds at the candidate positions, perform approximate matching between the sequences in the vicinity of the candidate positions and the expanded query segment seeds, determine whether the candidate positions are true matching positions, determine that the query segments are matched with reference sequence segments if the candidate positions are true matching positions, and enumerate the matched query segments and reference sequence segments; the nearby sequence is a reference genome sequence determined according to the expansion length of the query segment seed; the extension length is less than or equal to a maximum allowed mismatch k.
10. The system according to claim 9, wherein the true match module comprises:
a score matrix construction sub-module for constructing a score matrix H of (n +1) × (m + 1); the elements of the scoring matrix H are comparison results, n is the length of the query segment, and m is the length of the reference sequence segment;
the element value calculation operator module is used for calculating element values in the score matrix according to values of elements of a left adjacent point, an upper adjacent point and a left upper adjacent point of the element values; that is, after the computation of the matrix H (i-1, j-1), H (i, j-1) and H (i-1, j) values is finished, the H (i, j) values can be computed; recording a path for obtaining H (i, j) while recording the value of the matrix element H (i, j), and establishing a path matrix T;
and the diagonal element value operator module is used for executing parallel calculation by using the element values on the g-1 th anti-diagonal line and the g-2 th anti-diagonal line obtained by calculation to obtain the element value on the g-th anti-diagonal line.
CN202110024450.4A 2021-01-08 2021-01-08 Gene sequence comparison method and system Pending CN112735528A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110024450.4A CN112735528A (en) 2021-01-08 2021-01-08 Gene sequence comparison method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110024450.4A CN112735528A (en) 2021-01-08 2021-01-08 Gene sequence comparison method and system

Publications (1)

Publication Number Publication Date
CN112735528A true CN112735528A (en) 2021-04-30

Family

ID=75589854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110024450.4A Pending CN112735528A (en) 2021-01-08 2021-01-08 Gene sequence comparison method and system

Country Status (1)

Country Link
CN (1) CN112735528A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362901A (en) * 2021-05-14 2021-09-07 海南大学 Method and system for rapidly comparing whole genome annotation intervals
CN113488106A (en) * 2021-07-02 2021-10-08 苏州赛美科基因科技有限公司 Method for rapidly acquiring comparison result data of target genome region
CN113555061A (en) * 2021-07-23 2021-10-26 哈尔滨因极科技有限公司 Data workflow processing method for variation detection without reference genome
CN113903411A (en) * 2021-08-11 2022-01-07 东北林业大学 Genome assembly preprocessing method based on suffix array and monotonic stack
CN114550820A (en) * 2022-02-28 2022-05-27 桂林电子科技大学 WFA algorithm-based third-generation sequencing RNA-seq comparison method
CN114564306A (en) * 2022-02-28 2022-05-31 桂林电子科技大学 Third-generation sequencing RNA-seq comparison method based on GPU parallel computation
CN115602246A (en) * 2022-10-31 2023-01-13 哈尔滨工业大学(Cn) Sequence comparison method based on group genome
CN115862740A (en) * 2022-12-06 2023-03-28 中国人民解放军军事科学院军事医学研究院 Rapid distributed multi-sequence comparison method for large-scale virus genome data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107358061A (en) * 2017-07-31 2017-11-17 中国科学技术大学 Elasticity distribution formula sequence alignment system and method based on Spark and SIMD

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107358061A (en) * 2017-07-31 2017-11-17 中国科学技术大学 Elasticity distribution formula sequence alignment system and method based on Spark and SIMD

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JOSÉ M. ABUÍN等: "SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data", PLOS ONE, vol. 11, no. 5, 16 May 2016 (2016-05-16), pages 2 *
SANJAY RATHEE等: "StreamAligner: a streaming based sequence aligner on Apache Spark StreamAligner: Apache Spark", JOURNAL OF BIG DATA, vol. 5, no. 8, 27 February 2018 (2018-02-27), pages 6 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362901A (en) * 2021-05-14 2021-09-07 海南大学 Method and system for rapidly comparing whole genome annotation intervals
CN113362901B (en) * 2021-05-14 2023-09-01 海南大学 Method and system for rapidly comparing whole genome annotation intervals
CN113488106A (en) * 2021-07-02 2021-10-08 苏州赛美科基因科技有限公司 Method for rapidly acquiring comparison result data of target genome region
CN113555061B (en) * 2021-07-23 2023-03-14 哈尔滨因极科技有限公司 Data workflow processing method for variation detection without reference genome
CN113555061A (en) * 2021-07-23 2021-10-26 哈尔滨因极科技有限公司 Data workflow processing method for variation detection without reference genome
CN113903411A (en) * 2021-08-11 2022-01-07 东北林业大学 Genome assembly preprocessing method based on suffix array and monotonic stack
CN113903411B (en) * 2021-08-11 2024-06-28 东北林业大学 Genome assembly pretreatment method based on suffix array and monotone stack
CN114564306A (en) * 2022-02-28 2022-05-31 桂林电子科技大学 Third-generation sequencing RNA-seq comparison method based on GPU parallel computation
CN114550820A (en) * 2022-02-28 2022-05-27 桂林电子科技大学 WFA algorithm-based third-generation sequencing RNA-seq comparison method
CN114550820B (en) * 2022-02-28 2024-05-03 桂林电子科技大学 WFA algorithm-based third-generation sequencing RNA-seq comparison method
CN114564306B (en) * 2022-02-28 2024-05-03 桂林电子科技大学 Third generation sequencing RNA-seq comparison method based on GPU parallel computing
CN115602246A (en) * 2022-10-31 2023-01-13 哈尔滨工业大学(Cn) Sequence comparison method based on group genome
CN115862740A (en) * 2022-12-06 2023-03-28 中国人民解放军军事科学院军事医学研究院 Rapid distributed multi-sequence comparison method for large-scale virus genome data
CN115862740B (en) * 2022-12-06 2023-09-12 中国人民解放军军事科学院军事医学研究院 Rapid distributed multi-sequence comparison method for large-scale virus genome data

Similar Documents

Publication Publication Date Title
CN112735528A (en) Gene sequence comparison method and system
CN108985008B (en) Method and system for rapidly comparing gene data
Liu et al. CUDA–MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units
Houtgast et al. An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm
Carr et al. Parallel peak pruning for scalable SMP contour tree computation
US20160019339A1 (en) Bioinformatics tools, systems and methods for sequence assembly
Flick et al. A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications
US20150142334A1 (en) System, method and computer-accessible medium for genetic base calling and mapping
Wang et al. A branch and bound irredundant graph algorithm for large-scale MLCS problems
Xu et al. SLPal: Accelerating long sequence alignment on many-core and multi-core architectures
CN103699819A (en) Peak expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry
Vasimuddin et al. Identification of significant computational building blocks through comprehensive investigation of NGS secondary analysis methods
US7043371B2 (en) Method for search based character optimization
Flick et al. Reprint of “a parallel connectivity algorithm for de Bruijn graphs in metagenomic applications”
Angeleri et al. DNA fragment assembly using neural prediction techniques
Yin et al. DGCF: A Distributed Greedy Clustering Framework for Large-scale Genomic Sequences
Jain et al. GAMS: genome assembly on Multi-GPU using string graph
Roddy et al. nail: software for high-speed, high-sensitivity protein sequence annotation
Myoupo et al. Coarse-grained multicomputer based-parallel algorithms for the longest common subsequence problem with a string-exclusion constraint
Hou et al. Long read error correction algorithm based on the de bruijn graph for the third-generation sequencing
Rengasamy et al. Parallel Read Partitioning for Concurrent Assembly of Metagenomic Data
CN103699813A (en) Method for identifying and removing repeated bidirectional edges of bidirectional multistep De Bruijn graph
Vasimuddin et al. Identification of significant computational building blocks through comprehensive deep dive of ngs secondary analysis methods
Warnke-Sommer et al. Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge
Parsaeian et al. Local Sequence Alignment with a Parallel Hash based Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination