CN115602246B

CN115602246B - Sequence alignment method based on group genome

Info

Publication number: CN115602246B
Application number: CN202211366121.9A
Authority: CN
Inventors: 国宏哲; 李高阳; 刘博�; 王亚东
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-06-20
Anticipated expiration: 2042-10-31
Also published as: CN115602246A

Abstract

A sequence comparison method based on a group genome, in particular to a human DNA sequence comparison method based on the group genome, which aims to solve the problems that the sequence comparison method based on a single reference genome has precision deviation, so that sequencing data mutation detection fails or an error mutation site is obtained, firstly, a group genome index based on a seed sequence is constructed, the seed sequence of the sequencing sequence is extracted, and the position of the seed sequence in the genome is obtained by utilizing the index; secondly, obtaining an optimal and suboptimal path set of the seed sequence on a genome by using a sparse dynamic programming method according to the position; and finally, obtaining a sequence corresponding to the sequencing sequence on the genome according to the optimal and suboptimal path sets, comparing the sequencing sequence with the corresponding sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score. Belongs to the field of sequence alignment.

Description

Sequence alignment method based on group genome

Technical Field

The invention relates to a sequence alignment method, in particular to a human DNA sequence alignment method based on a group genome, and belongs to the field of sequence alignment.

Background

Genome sequencing data analysis has become an important research direction and development focus in the fields of life sciences, accurate health, personalized medicine and the like. A number of large-scale human genome sequencing programs including the international thousand-person genome project (1 KG), the UK ten-thousand-person genome project (UK 10K), the american precision medical project (All of Us), the singapore ten-thousand-person genome project (SG 10K), and the chinese hundred-thousand-person genome project have been developed internationally. The size of sequencing data generated by genome project will reach PB-ZB grade, and genome data analysis demand is explosively increased.

Genomic sequence alignment is an important basic step of genomic sequence analysis, and is also a bottleneck step consuming a large amount of computation time and resources, so that problems are more obvious when the data volume reaches PB magnitude. The existing second generation sequencing technology cannot directly generate complete genome, so the aim of genome sequence comparison is to locate a large number of unordered sequencing fragments on a reference genome, so that variation information of each site is obtained, and the existing common sequence comparison method taking a single reference genome as a standard has the problem of systematic deviation of precision. The genome sequence comparison is carried out on the input of the upstream receiving sequencing data, and the downstream continuous mutation analysis and the disease association analysis, so that the quality of the sequence comparison has great influence on the downstream analysis, and inaccurate sequence comparison can lead to mutation detection failure or give out wrong mutation sites, thereby directly influencing the correctness and credibility of the downstream analysis.

Disclosure of Invention

The invention provides a sequence comparison method based on a group genome, which aims to solve the problems that the sequence comparison method based on a single reference genome has precision deviation, so that the detection of sequencing data variation fails or an error variation site is obtained.

The technical scheme adopted by the invention is as follows:

it comprises the following steps:

s1, constructing a seed sequence-based group genome index, wherein the seed sequence-based group genome index comprises a graph model index of a reference genome based on a seed sequence and a local variation sequence index of variation data based on the seed sequence;

s2, extracting a seed sequence of the sequencing sequence, and obtaining a variation sequence of the seed sequence on variation data by utilizing a local variation sequence index to obtain the position of the variation sequence on a reference genome, namely the position of the seed sequence on the reference genome;

s3, carrying out path search on the reference genome by using a sparse dynamic programming method according to the position of the seed sequence on the reference genome to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome;

s4, obtaining a genome sequence corresponding to the sequencing sequence of S2 on a reference genome according to the optimal path and the suboptimal path set, comparing the sequencing sequence with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score;

S5, acquiring a sequence to be sequenced, and executing S1-S4 to compare the sequence to be sequenced to obtain a result comparison file.

Further, in S1, constructing a Minimizer-based group genome index, wherein the Minimizer-based group genome index comprises a graph model index based on a reference genome and a local mutation sequence index based on mutation data, and the specific process is as follows:

s11, acquiring human reference genome data and corresponding variation data containing known variations of different types, and preprocessing the two types of data by normalized operations such as removing redundant data;

s12, constructing a de Bruijn graph model index representation of a reference genome to obtain a de Bruijn graph model index based on the reference genome;

s13, constructing a local variation sequence index representation of variation data to obtain a local variation sequence index of the variation data;

s14, constructing a group genome index representation based on Minimizer according to the unique path and the local variation sequence index of the de Bruijn graph model index.

Further, extracting a seed sequence of the sequencing sequence in S2, and obtaining a variation sequence of the seed sequence on variation data by utilizing a local variation sequence index to obtain a position of the variation sequence on a reference genome, namely the position of the seed sequence on the reference genome, wherein the specific process is as follows:

For each pair of double-ended sequencing sequences in the sequencing sequences, the following operations are performed on each end sequence:

1) Defining a sliding window, and extracting a seed sequence Minimizer:

dividing one end sequence into a plurality of intervals with fixed length by utilizing sliding windows, wherein each interval comprises a plurality of k-mer sequences, adjacent sliding windows on one end sequence are defined to be covered with each other, and the initial positions of the adjacent sliding windows differ by 1bp;

traversing one end sequence from the beginning, and selecting a k-mer sequence with the smallest ASC2 value or hash value in each sliding window as a seed sequence Minimizer according to the defined sliding window size to obtain all seed sequence minimzers on one end sequence;

2) Obtaining the position of the seed sequence Minimizer on the reference genome:

if the total number of seed sequence minimzers found on two opposite end sequences of a pair of double-ended sequencing sequences is smaller than a preset threshold, wherein the preset threshold is 1/3 of the total number of all seed sequence minimzers of the sequencing sequences, searching a variant sequence corresponding to the seed sequence minimzers of the pair of double-ended sequencing sequences on variant data by utilizing a local variant sequence index, if the corresponding variant sequence is found on the variant data, namely the variant sequence is consistent with the seed sequence minimzers, recording a quadruple { Unipath } of the variant sequence at the corresponding position on a reference genome _start ，Unipath _end ，Read _start ，Read _end Sets of (Unipath) _start Representing the starting position of the variant sequence on the unique path of the de Bruijn map model of the reference genome, unipath _end Representing the termination position of the variant sequence on the unique path of the de Bruijn map model of the reference genome, read _start Indicating the starting position of the variant sequence on the sequenced sequence, read _end Indicating the termination position of the variant sequence on the sequenced sequence; if no corresponding variant sequence is found on the variant sequence, executing S3 according to the quadruple set corresponding to the obtained seed sequence Minimizer on the reference genome position;

and if the total number of the seed sequence minimzers found on the two end sequences with opposite directions of a certain pair of double-end sequencing sequences is greater than or equal to a preset threshold value, obtaining a quadruplet set corresponding to the seed sequence minimzers on a reference genome, and executing S3.

Further, in S3, according to the position of the seed sequence on the reference genome, a path search is performed on the reference genome by using a sparse dynamic programming method to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome, and the specific process is as follows:

s31, judging the direction of each end sequencing sequence in the double-end sequencing sequence:

defining the paired-end reads of the double-ended sequencing sequence as L according to the corresponding tetrad set of the seed sequence Minimizer on the reference genome _1f 、L _1r Or L _2f 、L _2r ，L _1f Read1 representing a pair-end readsThe forward sequence coverage length of L _1r Read1, which represents the paired-end reads, is the reverse complement overlap length, L _2f Read2, representing the paired-end reads, is the forward sequence coverage length, L _2r Read2, which represents the paired-end reads, is the reverse complement sequence coverage length;

if L _1f +L _2r ＞L _1r +L _2f An +offset value, which is 1/10 of the length of the sequencing sequence, assuming that the correct direction is read1 as a forward sequence, derived from the forward strand, read2 as a reverse complement sequence, derived from the reverse strand, and returning the set of all Minimizer contained in the corresponding direction;

if L _1r +L _2f ＞L _1f +L _2r The +offset value, consider that the correct direction is read2 as the forward sequence, derived from the forward strand, read1 as the reverse complement sequence, derived from the reverse strand, and return to the set of all Minimizer contained in the corresponding direction; otherwise, the correct direction cannot be distinguished, and a set containing all Minimizer corresponding to the two directions is returned;

if the total coverage length of the two-end sequencing sequences is smaller than a specified threshold, wherein the specified threshold is 1/3 of the total length of the two-end sequencing sequences, performing Minimizer comparison by using the local variation sequences to obtain the directions of the two-end sequencing sequences, and returning all Minimizer sets contained in the corresponding directions;

S32, extending and combining seed sequence Minimizer according to the direction of each end sequencing sequence:

a. extension seed sequence Minimizer:

acquiring the starting position and the ending position of the current seed sequence Minimizer on a sequencing sequence and the starting position and the ending position of the current seed sequence Minimizer on a unique path in a de Bruijn graph model, extending the corresponding positions of the seed sequence Minimizer on the sequencing sequence and the unique path forwards and backwards until a base matching error on the sequencing sequence is met or the last base position of the sequencing sequence is reached or the last base position of the unique path is reached, recording the front-back extending position of the seed sequence Minimizer on the sequencing sequence and the unique path and the covering length of the seed sequence Minimizer on the sequencing sequence after the extension, and traversing all the seed sequence minimzers on the sequencing sequence according to the method;

if, on the traversal, the termination position of the current seed-sequence Minimizer on the sequence is less than the backward extension position of the previous seed-sequence Minimizer, indicating that the current seed-sequence Minimizer is contained within the extension coverage area of the previous seed-sequence Minimizer without extension; otherwise, performing the above-mentioned extension;

Obtaining a candidate seed sequence after completing the extension of a seed sequence Minimizer, and recording the unique path number ID corresponding to the candidate seed sequence _Unipath Offset position on the sequencing, position on the reference genome, length of coverage on the sequencing, candidate seed sequence number ID _seed Obtaining all candidate seed sequences;

b. merging candidate seed sequences:

merging IDs having the same unique path number _Unipath Obtaining a set of candidate seed sequences, and merging coverage areas of all candidate seed sequences in each set to obtain a new coverage length;

splitting the candidate seed sequences in the set into a plurality of six-tuples (candidate seed sequence number ID) _seed The start and end positions of the coverage region on the sequencing sequence, the end number ID of the double ended sequencing sequence _end Starting position and ending position on the reference genome), according to the above steps, a six-tuple set of six-tuples of all candidate seed sequences is obtained;

s33, path searching based on a dynamic programming method:

sorting the six-tuple set according to the initial position of each six-tuple on the reference genome to obtain an ordered set, defining each six-tuple in the ordered set as a node, traversing the ordered set from the beginning, extracting adjacent nodes in a designated number after the current node, and establishing a connecting edge if the current node and a certain downstream node meet the condition at the same time; otherwise, the connecting edge is not established;

II, taking the negative value of the connecting edge as the weight of the edge, and the weight calculating method is as follows:

weight _edge ＝cover _match -score _panalty

diff _distance ＝|(P1 _read -P1 _read )-(P1 _ref -P2 _ref )

wherein weight is _edge Representing the weight of the connecting edge; cover _match A coverage length of the sequencing sequence representing the current node; score _panalty Is a connection penalty; diff (diff) _distance Representing the deviation distance between the reference genome position of the current node and a certain downstream node and the sequencing sequence, and calculating according to the initial position; p1 _read Representing the offset position of the current node on the sequencing sequence read; p1 _ref Representing the location of the current node on the reference genome; p2 _read Representing the offset position of a downstream node on the sequencing sequence read; p2 _ref Representing the position of a downstream node on a reference genome;

if the end number ID of the double-ended sequencing sequence of the current node and a certain downstream node _end If the deviation distance is different, taking the logarithm of the deviation distance, otherwise, taking the deviation distance value; obtaining a plurality of connecting edges according to the method;

thirdly, traversing the ordered set from the beginning according to the dynamic programming thought of the dynamic programming method, selecting a precursor node with the largest edge weight for each node to construct an effective connection edge between two nodes, and adding a successor node number connected with the precursor node to obtain a path in the graph model;

Traversing the ordered set from tail to front, traversing the precursor nodes forward according to the connection relation of the edges if the current node has no precursor node and the current node is a termination node of a certain path, recording the node numbers until a certain node has no precursor node, indicating that a certain node is a start node of the current path, obtaining a path, and recording all nodes on the path; according to the method, all path sets are obtained;

IV, synchronously accumulating and calculating the weight of each connecting edge in the path construction process to obtain a corresponding path score _path Score based on path _path And (3) sequencing all paths from high to low by using a rapid sequencing method to obtain an ordered path set, and traversing the set according to specified conditions to obtain an optimal path and a suboptimal path set.

Further, in the step S31, when judging the direction of each end sequencing sequence in the double-ended sequencing sequence, if two adjacent seed sequence minimzers on each end sequence overlap, the coverage length is the union length of the coverage areas of the two seed sequence minimzers on the sequencing sequence.

Further, in S31, the Minimizer alignment is performed by using the local variant sequence, which specifically includes:

Searching the corresponding variation sequence on the variation data by utilizing the local variation sequence index according to the seed sequence Minimizer of the sequence sequenced at both ends, and returning the quadruple { ALT } corresponding to the variation sequence on the variation data _start ，ALT _end ，Read′ _start ，Read′ _end }，ALT _start ，ALT _end ，Read′ _start ，Read′ _end Respectively representing the initial position and the final position of the mutation sequence on mutation data, the initial position and the final position on sequencing, updating L according to the coverage length corresponding to the quadruple _1f 、L _1r 、L _2f 、L _2r And judging the direction of each sequencing sequence according to the step S31, and returning all Minimizer sets contained in the corresponding direction.

Further, the conditions in S33I are:

1) Seed number ID of two nodes _seed Different;

2) The chromosomes corresponding to the reference genome positions of the two nodes are the same;

3) The difference between the positions of the reference genomes of the two nodes is smaller than a specified threshold value;

4) The reference genome position of the two nodes is consistent with the position size relationship on the sequencing.

Further, the specified conditions in iv in S33 are:

1) If the overall coverage length of each node on the sequencing sequence of the first path in the path set is smaller than a specified threshold, the specified threshold is 2/3 of the sequencing sequence length, and the seed sequence Minimizer is not queried on the variation sequence, comparing the Minimizer by using the local variation sequence index, if a new seed sequence Minimizer is obtained after the comparison, adding a six-tuple set corresponding to the new seed sequence Minimizer into the six-tuple set of S32b, and re-executing S33 to obtain an optimal path and a suboptimal path set; otherwise, executing the step 2);

2) And in the ordered path set, taking the first path as an optimal path, taking paths which have the same score as the first path and are in the designated number as suboptimal paths, selecting suboptimal paths according to the designated number if the number of paths in the ordered path set is larger than the designated number, and taking all paths except the first path as suboptimal paths if the number of paths in the ordered path set is smaller than or equal to the designated number to obtain the optimal path and the suboptimal path set.

Further, in the step S4, a genome sequence corresponding to the sequencing sequence of the step S2 on a reference genome is obtained according to the optimal path and the suboptimal path set, the sequencing sequence and the corresponding genome sequence are compared by adopting a KSW2 algorithm to obtain a comparison score, and a result comparison file is obtained according to the comparison score, wherein the specific process is as follows:

and (3) calculating the initial position of the sequencing sequence of S2 on the reference genome according to the initial node information of the optimal path and all paths in the suboptimal path set generated in the step (S33), extracting a genome sequence corresponding to the sequencing sequence from the reference genome, and performing end-to-end global sequence comparison on the sequencing sequence and the genome sequence corresponding to the sequencing sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score.

Further, if the comparison result of S4 includes a known mutation sequence, the comparison result format indicates mutation information according to the following method:

a start identification character 'V' and an end representation character 'E' are respectively added before and after the mutation number of the mutation data acquired in the S1; the binary three-level index structure is built again to represent the comparison result, and the construction method of each level index structure is as follows:

1) The first-stage index records the comparison result number of each sequencing sequence, namely the offset position on the second-stage index;

2) The second level indexes each data unit to record the CIGAR sequence length of each sequencing sequence comparison result, namely the offset position on the third level index;

3) The third level index records the CIGAR sequences of each sequencing sequence alignment.

The beneficial effects are that:

comparing sequencing sequences according to indexes of group genome based on seed sequences, wherein the group genome indexes based on the seed sequences comprise a graph model index of a reference genome based on the seed sequences and a local variation sequence index of variation data based on the seed sequences; extracting a seed sequence on the sequencing sequence, and obtaining a variation sequence of the seed sequence on variation data by utilizing a local variation sequence index to obtain the position of the variation sequence on a reference genome, namely the position of the seed sequence on the reference genome; judging the direction of each end sequencing sequence in the double-end sequencing sequence according to the position, and extending and combining the seed sequences according to the direction to obtain six-tuple of the candidate seed sequence (candidate seed sequence number ID) _seed The start and end positions of the coverage region on the sequencing sequence, the end number ID of the double ended sequencing sequence _end Starting position and ending position on a reference genome), establishing a connecting edge between six tuples, performing path search by combining the connecting edge by using a sparse dynamic programming method to obtain an optimal path and a suboptimal path set of a sequencing sequence on a reference genome graph model, obtaining a genome sequence corresponding to the sequencing sequence on the reference genome according to the optimal path and the suboptimal path set, and adopting a KSW2 algorithm to perform path search on the sequencing sequence and the corresponding genesAnd comparing the group sequences to obtain a comparison score, and obtaining a result comparison file according to the comparison score, so that sequence comparison of the group genome is realized.

The invention provides a sequence comparison method based on a group genome by utilizing a group genome index structure formed by a group reference genome and a corresponding known common variation sequence and a sparse dynamic programming method, which greatly improves the sequence comparison speed, precision and sensitivity of the genome, and solves the problems of systematic precision deviation, low sequence comparison quality and incorrect comparison position of the sequence comparison formed by taking a single reference genome as a reference.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

Detailed Description

The first embodiment is as follows: the present embodiment is described with reference to fig. 1, which is a sequence alignment method based on a population genome, comprising the steps of:

s1, constructing a group genome index based on a seed sequence Minimizer, wherein the group genome index based on the seed sequence Minimizer comprises a graph model index based on a reference genome of the seed sequence Minimizer and a local variation sequence index based on variation data of the seed sequence Minimizer, and the specific process is as follows:

s11, acquiring human reference genome data and corresponding mutation data containing known different types of mutation, and preprocessing the two types of data by normalized operations such as removing redundant data. The mutation data is obtained from a database.

S12, constructing a de Bruijn graph model index representation of a reference genome, wherein the specific process is as follows:

traversing a complete reference genome, extracting all k-mer sequences, wherein the k-mer sequences have short sequence fragments with the length of k, recording tetrads (k-mer sequences, incoming base, outgoing base and offset position) corresponding to each k-mer sequence, extracting association relations between different k-mer sequences of input data and k-mer sequences based on the tetrads corresponding to each k-mer sequence to construct a de Bruijn graph model, wherein de Bruin graph G= < V, E > is a directed graph model, nodes V in the graph are k-mer sequences with different lengths of k, and extracting node sets V= { V1, V2, …, vM } formed by all different k-mer sequences on the reference genome; for two nodes vi and vj, if vi overlaps with vj with the sequence of (k-1) -mer, then there is a directed edge vi→vj.

Each k-mer sequence represents a node of the de Bruijn graph model, a unique path unipath is determined according to the number of the incoming edges and the outgoing edges of the node, the incoming degree of the node is defined as the incoming edges of the node, the outgoing degree is defined as the number of the outgoing edges, and one or more incoming edges and one or more outgoing edges exist in each node; and if the ingress degree of the initial node of the path is greater than 1 and the egress degree of the path is 1, the egress degree of the end node is greater than 1 and the ingress degree of the intermediate node is 1 and the egress degree of the intermediate node is 1, the path is a unique path unipath, and all unique paths are obtained.

For each unique path, the nodes thereon are associated with a set of k-mer sequence positions as described above; traversing each node from the beginning to the back aiming at a certain path, acquiring two position sets of a current node and a subsequent node, and finding out similar position elements of the two sets; because each position set is an ordered set, traversing certain set elements, adopting a binary search method, finding out all similar positions and forming a new position set; according to the method, each node on the path is traversed backwards in turn, the position searching and the set merging operation are executed after each new node is traversed until the end node of the path is encountered, and the finally formed position set is the position set corresponding to the current unique path.

According to the method, a position set corresponding to all the unique paths is generated and expressed as a binary index file, and the position set is expressed as F-pos.

When all k-mer sequences are extracted by traversing the complete reference genome, the k-mer sequences need to be sequenced and de-duplicated, and the specific process comprises the following steps:

according to the sequence content of the first f bp of the k-mer as a file name, wherein bp represents the total quantity of bases, writing the quadruple corresponding to the k-mer sequence into a corresponding temporary file; importing all k-mer sequences in a file into a memory, performing quick sequencing on k-mers in the file in the current memory to order four tuples corresponding to the k-mer sequences in the file, and outputting the four tuples to a hard disk to form a corresponding ordered file; quick sequencing: traversing the k-mer set in the current file from the beginning, merging adjacent identical k-mer sequences, and merging offset position combinations corresponding to the k-mers; and sequentially importing all files into a memory and rapidly sequencing the files to generate all different k-mer sequences and ensure that the position sets corresponding to the k-mers are orderly.

The process of determining the unique path unipath according to the number of the incoming edges and the outgoing edges of the nodes comprises the following steps:

firstly, according to the quantity of the incoming edges and the outgoing edges of the nodes, the nodes are divided into the following types:

1) 'X' -type node: a plurality of edges and a plurality of edges;

2) 'FY' node: a plurality of edges and an edge;

3) 'RY' -type node: one in side and a plurality of out sides;

4) 'L' -shaped node: one in side and one out side;

the way in which the start and end nodes of a unique path are formed is several possibilities:

(1) The start and end nodes are the same 'X' -type node;

(2) The start node is a 'FY' type node and the end node is a 'RY' type node;

(3) The starting node is a successor node of an 'X' type node or a successor node of an 'FY' type node and the ending node is a predecessor node of an 'X' type node or a predecessor node of an 'RY' type node;

then traversing all nodes in the graph, and if the current node is the initial node of the unique path, executing backward extension operation; backward extending operation: calculating to obtain a rear-drive node according to the edge information of the current node, if the rear-drive node is an L-shaped node, continuing to extend backwards, and sequentially cycling according to the method until encountering an end node; all unique paths of the graph model can be generated by the method, and each node in the paths is ensured not to be repeated;

all the unique path index files are generated through the method, and each unique path index file is correspondingly used as a unipath.

S13, constructing local variation sequence index representation of variation data, wherein the specific process is as follows:

overlapping and dividing the reference genome according to fixed-length intervals, wherein each divided fixed-length interval is used as a local area, each two adjacent local areas are partially overlapped, and the length of the overlapped area is half of the length of the divided area.

For a haplotype of an individual, collecting the variation in each local region and forming a local individual genome sequence with known variation, namely generating a local variation sequence; the length of the overlap region is half the length of the division region.

Through the method, the variant sequence index file is generated, and each local variant sequence index file is correspondingly used as an alt string.

The reference genome is divided into a plurality of intervals with fixed length (300 bp as default) and coverage (150 bp as default) from beginning to end, local sequences containing known mutation in each interval are extracted, and if a plurality of mutation exists in each interval, the combination among the mutation is constructed according to real haplotype information, so that a local mutation sequence index is created.

S14, constructing a group genome index representation based on the seed sequence Minimizer according to the unique path unipath and the local variation sequence index alt string of the de Bruijn graph model index, namely, the group genome index based on the seed sequence Minimizer comprises a graph model index of a reference genome based on the seed sequence Minimizer and a local variation sequence index of variation data based on the seed sequence Minimizer:

Before indexes are built for 'unipath' and 'alt string', respectively converting a 'unipath' diagram and an 'alt string' list into a character string list; aiming at each character string in the character string list, selecting a k-mer with the smallest hash value in a sliding window as a local representative k-mer according to a seed sequence Minimizer selection criterion; these local minimum k-mers, known as seed sequence minimzers, will be stored in a unified list.

The specific process for selecting the k-mer sequence with the smallest hash value in the sliding window comprises the following steps:

the Minimizer is selected with two key parameters, namely window size and k-mer sequence length; assuming that the sequencing sequence of the index to be constructed is L, and the length of the K-mer is L_K, when the sliding window is selected, (L-L_K+1) K-mers are selected; assuming that the windows size is W, selecting one min_k-mer1 with the smallest hash value from the 5 k-mers with the numbers of 1-5; then, selecting the smallest min_k-mer2 from the 5 of 2 to 6; then 3-7 and so on; the min_k-mers selected by adjacent different Minimizer selection windows may be identical, and each repetition is recorded only once; the min_k-mer is called seed sequence Minimizer.

The Minimizer lists corresponding to all the character strings are combined into a complete list, and the lists are ordered according to the hash value of the Minimizer; and constructing a secondary index of the Minimizer according to the high K bit of the hash value of each Minimizer, wherein K is a secondary index parameter.

The secondary index divides the overall index of the Minimizer into b=2 ^{^K} Each cell, when assuming that the total number of Minimizer is M, each cell contains M/B Minimizer on average; each constructed Minimizer occupies 96 bits of storage space, wherein the last Q bits of the hash value of the Minimizer occupy 30 bits of storage space; the Minimizer's position on the original "unipath" or "alt string" list occupies 65 bits of space.

S2, extracting a seed sequence of the sequencing sequence, and obtaining a variation sequence of the seed sequence on variation data by utilizing a local variation sequence index to obtain the position of the variation sequence on a reference genome, namely the position of the seed sequence on the reference genome, wherein the specific process is as follows:

for each pair of double-ended sequences (sequence a and sequence b) in the sequencing sequence, the following is performed for each end sequence (sequence a or sequence b) of each pair of double-ended sequences:

1) Defining a sliding window, and extracting a seed sequence:

the sliding window is utilized to divide one end sequence (sequence a or sequence b) into a plurality of intervals with fixed length, the intervals with fixed length are generally default to set the window size to 10bp, the specific window size can be adjusted according to actual application scenes, and each interval comprises a plurality of k-mer sequences. Defining that adjacent sliding windows on one end sequence are mutually covered, and the initial positions of the adjacent sliding windows are different by 1bp; traversing one end sequence from the beginning, selecting a k-mer sequence with the smallest ASC2 value or hash value in each sliding window as a seed sequence Minimizer according to the defined sliding window size, and obtaining all seed sequence minimzers on one end sequence (sequence a or sequence b).

2) Obtaining the position of the seed sequence on the reference genome:

if the total number of seed sequence minimzers found on the two opposite end sequences (sequence a and sequence b) of a pair of paired end sequencing sequences is less than a preset threshold, wherein the preset threshold is 1/3 of the total number of all seed sequence minimzers of all sequencing sequences, searching the mutation sequence (data) corresponding to the seed sequence minimzers of a pair of paired end sequencing sequences on known mutation data by utilizing local mutation sequence indexes in group genome indexes based on the Minimizer, if the corresponding mutation sequence is found on the mutation data, namely the mutation sequence is consistent with the seed sequence Minimizer, recording the tetramic match { Unipath of the corresponding position of the mutation sequence on a reference genome _start ，Unipath _end ，Read _start ，Read _end Sets of (Unipath) _start Representing the starting position of the variant sequence (seed sequence Minimizer) on the unique path of the de Bruijn map model of the reference genome, unipath _end Representing the termination position of the variant sequence (seed sequence Minimizer) on the unique path of the de Bruijn map model of the reference genome, read _start Representing the starting position of the variant sequence (seed sequence Minimizer) on the sequenced sequence, read _end Representing the termination position of the variant sequence (seed sequence Minimizer) on the sequenced sequence; if no corresponding variant sequence is found on the variant sequence, the sequence is located at the reference genome position according to the obtained seed sequence MinimizerS3, executing corresponding four-element group sets;

if the total number of the seed sequence minimzers found on the two end sequences opposite to each other in the paired end sequence pair-end direction is larger than or equal to a preset threshold, the number of the found seed sequence minimzers is enough, the seed sequence searching of the local variation sequence index is not carried out, a four-element set corresponding to the seed sequence minimzers on a reference genome is obtained, and S3 is executed.

The invention utilizes the de Bruijn graph model index representation and the local variation sequence index representation of the reference genome to search the quadruple { Unipath } corresponding to all candidate positions of the seed sequence Minimizer on the reference genome _start ，Unipath _end ，Read _start ，Read _end A collection of sequence seed sequence Minimizer locations on unique path unipath.

S3, carrying out path search on the reference genome by using a sparse dynamic programming method according to the position of the seed sequence on the reference genome to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome, wherein the specific process is as follows:

defining the paired-enddata of the double-ended sequencing sequence as L according to the corresponding tetrad set of the seed sequence Minimizer on the reference genome _1f 、L _1r Or L _2f 、L _2r ，L _1f The forward sequence coverage length, L, of read1, representing the paired-endareas _1r Read1, representing the paired-endreads, is the reverse complement cover length, L _2f Read2, representing the paired-end reads, is the forward sequence coverage length, L _2r Read2, which represents the paired-endareas, is the reverse complement sequence coverage length.

L _1f Is referred to as L _1forward Refers to the forward sequence of read1 of the paired-endareas, L _1r The reverse complement of read1, referring to the paired-endareas; l (L) _2f And L _2r And the same is true. Since the principle of second generation sequencing is to sequence the double helix structure (double strand) of DNA simultaneously from 5 'end to 3' end simultaneously, the productThe sequencing sequences generated are in pairs, namely, paired-end reads, wherein one read is derived from a positive strand of a genome, one read is derived from a negative strand of the genome, and the sequencing direction is opposite to that of the positive strand, but the sequence of the pair of reads in the next data is unknown from the positive strand, and the sequence of the pair of reads is unknown from the negative strand, so that the coverage length of the seed sequences in the two combination cases needs to be calculated as a judgment basis for correct inversion.

Judging the real direction of each end sequence according to the overall coverage length formed by all seed sequence minimzers on each end sequence: if L _1f +L _2r ＞L _1r +L _2f The +offset value, which is an empirical value, is related to the length of the sequencing sequence, typically 1/10 of the length of the sequencing sequence, then the correct direction is considered to be read1 as the forward sequence, derived from the forward strand, read2 as the reverse complement sequence, derived from the reverse strand, and returned to the set of all Minimizer contained in the corresponding direction, e.g., read1 as the forward set Minimizer, read2 as the reverse set Minimizer; if L _1r +L _2f ＞L _1f +L _2r The +offset value, consider that the correct direction is read2 as the forward sequence, derived from the forward strand, read1 as the reverse complement sequence, derived from the reverse strand, and return to the set of all Minimizer contained in the corresponding direction; otherwise, the correct direction cannot be distinguished, and the set containing all Minimizer corresponding to the two possible directions is returned.

If two seed sequence minimzers (c and d) adjacent to each other on each end sequence overlap, the coverage length is the union length of the coverage areas of the two seed sequence minimzers (c and d) on the sequencing sequence; according to this principle, the coverage length of the two-terminal sequencing sequences in two possible directions is calculated separately.

If the total coverage length of the two-end sequencing sequences is smaller than a specified threshold, wherein the specified threshold is 1/3 of the total length of the two-end sequencing sequences, comparing Minimizer by using the local variation sequences to obtain the directions of the two-end sequencing sequences, and returning all Minimizer sets contained in the corresponding directions.

The Minimizer comparison method based on the local variant sequence comprises the following specific steps:

sequencing according to two-terminal sequencingThe seed sequence Minimizer of the column is used for searching the corresponding variation sequence (seed sequence Minimizer) on the variation data by using the local variation sequence index representation, and returning the quadruple { ALT } corresponding to the variation sequence on the variation data _start ，ALT _end ，Read′ _start ，Read′ _end Each element (ALT) in the quadruple _start ，ALT _end ，Read′ _start ，Read′ _end ) Respectively representing the starting position and the ending position of a variant sequence (seed sequence Minimizer) on variant data (ALT), the starting position and the ending position on a sequencing sequence, and updating L according to the coverage length corresponding to the quadruple _1f ，L _1r ，L _2f ，L _2r And judging the direction of each sequencing sequence according to the method, and returning all Minimizer sets contained in the corresponding direction.

S32, extending and combining seed sequences according to the direction of each end sequencing sequence:

a. extending the seed sequence:

acquiring the starting position and the ending position of the current seed sequence Minimizer on a sequencing sequence and the starting position and the ending position of the current seed sequence Minimizer on a unique path in a de Bruijn graph model, carrying out forward and backward accurate matching extension on the corresponding positions of the seed sequence Minimizer on the sequencing sequence and the unique path until a base matching error on the sequencing sequence is met or the last base position of the sequencing sequence is reached or the last base position of the unique path is reached, recording the forward and backward extension positions of the seed sequence Minimizer on the sequencing sequence and the unique path and the coverage length of the seed sequence Minimizer on the sequencing sequence after extension, and traversing all the seed sequence minimzers on the sequencing sequence according to the method.

If the end position of the current seed sequence Minimizer on the sequence is smaller than the backward extension position of the previous seed sequence Minimizer, the current seed sequence Minimizer is included in the extension coverage area of the previous seed sequence Minimizer, and then the precise matching extension is not needed; otherwise, the exact match extension described above is performed.

Each time it is finishedObtaining a candidate seed sequence seed by extending the exact match of a Minimizer, and recording the unique path number ID corresponding to the candidate seed sequence seed _Unipath Offset position on the sequencing, position on the reference genome, length of coverage on the sequencing, candidate seed sequence number ID _seed All candidate seed sequences seed are obtained.

b. Merging candidate seed sequences:

since different candidate seed sequences seed may fall on the same unique path unipath, it is necessary to merge with the same ID _Unipath Candidate seed sequences seed of (a) to obtain a seed sequence having the same unique path number ID _Unipath The coverage areas of all candidate seed sequences seed in each set are combined to obtain a recalculated coverage length. At the same time, since each unique path unipath corresponds to multiple reference genome positions, the same ID will be used _Unipath The candidate seed sequence seed of (a) is split into a plurality of six-tuple according to the corresponding reference genome position, wherein the values of the first four elements in the six-tuple from the same candidate seed sequence are the same, and the values of the starting position and the ending position on the reference genome are different. A set of six-tuples of different six-tuples is used to support the following steps.

S33, path searching based on a dynamic programming method:

first, associations between part of the six-tuples within a set of six-tuples are established according to specified conditions, similar to establishing connection edges for nodes in a graph according to specified conditions. Sequencing the six-tuple SET according to the initial position of each six-tuple on the reference genome to obtain an ordered SET SET-match, and defining each six-tuple in the ordered SET SET-match as a node; traversing the ordered SET SET-match from the beginning, extracting the adjacent nodes in the appointed number (which can be a variable or an appointed number and the default value is 10) after the current node for the current six-tuple (node), and establishing a connecting edge if the current node and a certain downstream node meet the following conditions at the same time; otherwise, the connecting edge is not established.

1) Seed number ID of two nodes _seed Different;

3) The difference between the reference genome positions of the two nodes is less than a specified threshold (default 500 bp);

4) The positions of the reference genome of the two nodes are consistent with the position and size relationship on the sequencing;

downstream refers to the physical positional relationship between six tuples (nodes) on the reference genome, e.g., node 1 is located at 100bp on the reference genome and node 2 is located at 200bp on the reference genome, then node 2 is downstream of node 1.

The negative value of the connecting edge is taken as the weight of the edge, and the weight calculation method of the edge is as follows:

taking node 1 (the current node) and node 2 (some downstream node) as an example to establish a connection edge for explanation;

weight _edge ＝cover _match -score _panalty

diff _distance ＝|(P1 _read -P1 _read )-(P1 _ref -P2 _ref )

wherein weight is _edge Representation ofThe weight of the connecting edge; cover _match The length of coverage of the sequencing sequence representing node 1 on the junction edge; score _panalty Is a connection penalty (known); diff (diff) _distance Representing the reference genome position and the offset distance on the sequencing of two nodes (node 1 and node 2) on the junction side, calculated as the starting position; p1 _read Representing the offset position of node 1 on the sequencing sequence read; p1 _ref Representing the position of node 1 on the reference genome; p2 _read Representing the offset position of node 2 on the sequencing sequence read; p2 _ref The position of node 2 on the reference genome is indicated. Meanwhile, in order to eliminate the influence of the insertion distance between both ends of the paired-end double-ended sequence on the penalty calculation, if the end numbers ID of the double-end sequenced sequences of the node 1 and the node 2 are _end If the two sequences are different, the deviation distance takes logarithm, so that the connection can be established when the nodes from the two ends meet the connection condition, the integral calculation of the nodes of the two-end sequences is realized, and the accuracy of path search is improved; otherwise, taking the deviation distance value. And establishing the connecting edges meeting the connecting conditions according to the method to obtain a plurality of connecting edges.

Secondly, searching and obtaining an optimal path and a suboptimal path in the graph model by using a dynamic programming method:

a certain node may have a plurality of precursor nodes connected to it, and according to the dynamic programming concept, each node needs to be connected to the precursor node with the largest edge weight, so that the ordered SET-match needs to be traversed from the head. For each node, selecting a precursor node with the largest edge weight to construct an effective connection edge between two nodes, adding a precursor node number connected with the precursor node, and indicating the precursor node number connected with the precursor node. And obtaining the path in the graph model according to the operation.

Constructing a path set: traversing the ordered SET SET-match from tail to front, if the current node has no back-drive node and has a front-drive node, indicating that the current node is a termination node (six-tuple) of a certain path, traversing the front-drive node forward according to the connection relation of edges, traversing the front-drive node of the current node circularly according to the method, and recording the node number until a certain node is reachedAnd if the point has no precursor node, indicating that a certain node is the initial node of the current path, obtaining a new path, recording all nodes on the path, and obtaining all path sets according to the method. Simultaneously, synchronously accumulating and calculating the weight of each connecting edge in the path construction process to obtain a corresponding path score _path . Since the same node may be on multiple paths, each path and corresponding path score may be obtained in this way. Score based on path _path The value adopts a rapid ordering method to order all paths from high to low to obtain an ordered path set, and the ordered path set is traversed according to the following rules:

rule 1), if the overall coverage length of each node on the sequencing sequence of the first path in the ordered path set is smaller than a specified threshold, the specified threshold is 2/3 of the length of the sequencing sequence, and the seed sequence Minimizer is not queried on the variant sequence, comparing the Minimizer by using the local variant sequence index, if the new seed sequence Minimizer is obtained after the comparison, adding the six-tuple set corresponding to the new seed sequence Minimizer into the six-tuple set of S32b, and re-executing S33 to obtain an optimal path and a suboptimal path set; otherwise, rule 2) is executed, otherwise all reverse possibilities of the conditions described in rule 1) are included.

Rule 2), in the ordered path set, taking the first path as an optimal path, taking paths which have the same score as the first path and are in a specified number (default value is 100) as suboptimal paths, if the number of paths in the ordered path set is greater than the specified number, selecting suboptimal paths according to the specified number, and if the number of paths in the ordered path set is less than or equal to the specified number, taking all paths except the first path as suboptimal paths to obtain the optimal path and the suboptimal path set from the ordered path set.

S4, obtaining a genome sequence corresponding to the sequencing sequence of S2 on a reference genome according to the optimal path and the suboptimal path set, and comparing the sequencing sequence with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score, wherein the specific process is as follows:

performing local sequence comparison on all paths in the optimal path and suboptimal path set generated in the step S33 and the sequence to be sequenced to obtain local sequence comparison scores, and obtaining a result comparison file according to the local sequence comparison scores, wherein the initial position of the sequence to be sequenced on the reference genome is calculated according to the initial node information (the position of the reference genome and the offset position on the sequencing sequence) of the path, the corresponding genome sequence is extracted from the reference genome, and the end-to-end global sequence comparison is performed on the sequence to be sequenced by adopting a KSW2 algorithm;

If the comparison result contains a known variation sequence, the format of the comparison result represents the known variation information according to the following method:

and respectively adding a start identification character V and an end representation character E before and after the mutation number of the mutation data acquired in S1 for custom representing the known mutation in the CIGAR information. Meanwhile, in order to realize the compressed representation and storage of the sequence comparison result and connect with the subsequent research of the mutation detection method based on the group genome, the binary three-level index structure is designed to represent the sequence comparison result, and the design method of each level index structure is as follows:

CIGAR information: the cigs character string is a sequence consisting of a base length and related operations, representing an alignment between two sequences. Cigs ar strings generally indicate which bases on a sequence match, do not match, delete or insert with a reference genomic sequence at which position.

Claims

1. A method of sequence alignment based on a population genome, characterized by: it comprises the following steps:

1) Defining a sliding window, and extracting a seed sequence Minimizer:

if the total number of seed sequence minimzers found on two opposite end sequences of a pair of double-ended sequencing sequences is smaller than a preset threshold, wherein the preset threshold is 1/3 of the total number of all seed sequence minimzers of the sequencing sequences, searching a variant sequence corresponding to the seed sequence minimzers of the pair of double-ended sequencing sequences on variant data by utilizing a local variant sequence index, if the corresponding variant sequence is found on the variant data, namely the variant sequence is consistent with the seed sequence minimzers, recording a quadruple { Unipath } of the variant sequence at the corresponding position on a reference genome _start ，Unipath _end ，Read _start ，Read _end Sets of (Unipath) _start Representing the starting position of the variant sequence on the unique path of the de Bruijn map model of the reference genome, unipath _end Representing the termination position of the variant sequence on the unique path of the de Bruijn map model of the reference genome, read _start Indicating the starting position of the variant sequence on the sequenced sequence, read _end Indicating the termination position of the variant sequence on the sequenced sequence; if no corresponding variation sequence is found on the variation data, executing S3 according to the quadruple set corresponding to the obtained seed sequence Minimizer on the reference genome position;

If the total number of the seed sequence minimzers found on the two end sequences with opposite directions of a certain pair of double-end sequencing sequences is larger than or equal to a preset threshold value, a four-element set corresponding to the seed sequence minimzers on a reference genome is obtained, and S3 is executed;

2. A method of genome-based sequence alignment according to claim 1, wherein: s1, constructing a Minimizer-based group genome index, wherein the Minimizer-based group genome index comprises a reference genome-based graph model index and a mutation data-based local mutation sequence index, and the specific process is as follows:

S11, acquiring human reference genome data and corresponding variation data containing known variations of different types, and performing redundant data removal normalization operation pretreatment on the two types of data;

3. A method of genome-based sequence alignment according to claim 2, wherein: and S3, carrying out path search on the reference genome by using a sparse dynamic programming method according to the position of the seed sequence on the reference genome to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome, wherein the specific process is as follows:

defining the paired-end reads of the double-ended sequencing sequence as L according to the corresponding tetrad set of the seed sequence Minimizer on the reference genome _1f 、L _1r Or L _2f 、L _2r ，L _1f The forward sequence coverage length, L, of read1, representing the paired-end reads _1r Read1, which represents the paired-end reads, is the reverse complement overlap length, L _2f Read2, representing the paired-end reads, is the forward sequence coverage length, L _2r Read2, which represents the paired-end reads, is the reverse complement sequence coverage length;

if L _1r +L _2f ＞L _1f +L _2r The +offset value is considered to be the correct direction as read2 being the forward sequence, from the forward chain, read1 being the reverse directionThe complement sequence is derived from the reverse chain and returns the collection of all Minimizer contained in the corresponding direction; otherwise, the correct direction cannot be distinguished, and a set containing all Minimizer corresponding to the two directions is returned;

a. extension seed sequence Minimizer:

acquiring the starting position and the ending position of the current seed sequence Minimizer on a sequencing sequence and the starting position and the ending position of the current seed sequence Minimizer on a unique path in a de Bruijn graph model, extending the corresponding positions of the seed sequence Minimizer on the sequencing sequence and the unique path forwards and backwards until a base matching error on the sequencing sequence is met or the last base position of the sequencing sequence is reached or the last base position of the unique path is reached, recording the front-back extending position of the seed sequence Minimizer on the sequencing sequence and the unique path and the covering length of the seed sequence Minimizer on the sequencing sequence after the extending, and traversing all the seed sequence minimzers on the sequencing sequence according to the method;

Obtaining a candidate seed sequence after completing the extension of a seed sequence Minimizer, and recording the unique path number ID corresponding to the candidate seed sequence _Unipath Offset position on the sequencing, position on the reference genome, length of coverage on the sequencing, candidate seed sequence number ID _seed Obtaining all candidate seed sequencesA column;

b. merging candidate seed sequences:

splitting the candidate seed sequences in the set into a plurality of six-tuple according to the corresponding reference genome positions, wherein the six-tuple comprises the candidate seed sequence number ID _seed The start and end positions of the coverage region on the sequencing sequence, the end number ID of the double ended sequencing sequence _end Obtaining a six-tuple set consisting of six-tuples of all candidate seed sequences according to the steps above at the starting position and the ending position on the reference genome;

s33, path searching based on a dynamic programming method:

weight _edge ＝cover _match -score _panalty

diff _distance ＝|(P1 _read -P2 _read )-(P1 _ref -P2 _ref )

wherein weight is _edge Representing the weight of the connecting edge; cover _match A coverage length of the sequencing sequence representing the current node; score _panalty Is a connection penalty; diff (diff) _distance Representing the current node and a certain point downThe deviation distance between the reference genome position of the upstream node and the sequencing sequence is calculated according to the initial position; p1 _read Representing the offset position of the current node on the sequencing sequence read; p1 _ref Representing the location of the current node on the reference genome; p2 _read Representing the offset position of a downstream node on the sequencing sequence read; p2 _ref Representing the position of a downstream node on a reference genome;

4. A method of group genome-based sequence alignment according to claim 3, wherein: s31, when judging the direction of each end sequencing sequence in the double-end sequencing sequence, if two seed sequence minimzers adjacent to each end sequence overlap, the coverage length is the union length of the coverage areas of the two seed sequence minimzers on the sequencing sequence.

5. The method for genome-based sequence alignment of claim 4, wherein: s31, performing Minimizer comparison by using a local variation sequence, wherein the specific process is as follows:

6. The method for genome-based sequence alignment of claim 5, wherein: the conditions in S33I are:

1) Seed number ID of two nodes _seed Different;

7. The method of sequence alignment based on genome of a population according to claim 6, wherein: the specified conditions in IV in S33 are:

8. A method of genome-based sequence alignment according to claim 7, wherein: and S4, obtaining a genome sequence corresponding to the sequencing sequence of S2 on a reference genome according to the optimal path and the suboptimal path set, and comparing the sequencing sequence with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score, wherein the specific process is as follows:

9. A method of genome-based sequence alignment according to claim 8, wherein: if the comparison result of S4 contains a known variation sequence, the comparison result format represents variation information according to the following method: