CN115602246A - Sequence comparison method based on group genome - Google Patents

Sequence comparison method based on group genome Download PDF

Info

Publication number
CN115602246A
CN115602246A CN202211366121.9A CN202211366121A CN115602246A CN 115602246 A CN115602246 A CN 115602246A CN 202211366121 A CN202211366121 A CN 202211366121A CN 115602246 A CN115602246 A CN 115602246A
Authority
CN
China
Prior art keywords
sequence
seed
sequencing
path
genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211366121.9A
Other languages
Chinese (zh)
Other versions
CN115602246B (en
Inventor
国宏哲
李高阳
刘博�
王亚东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202211366121.9A priority Critical patent/CN115602246B/en
Publication of CN115602246A publication Critical patent/CN115602246A/en
Application granted granted Critical
Publication of CN115602246B publication Critical patent/CN115602246B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A sequence comparison method based on group genome, relate to a human DNA sequence comparison method based on group genome specifically, in order to solve and have the precision deviation in the sequence comparison method taking single reference genome as benchmark, cause the question of the detection failure of variation of sequencing data or obtain the wrong variation site, construct the group genome index based on seed sequence at first, extract the seed sequence of the sequencing sequence, utilize the position of the seed sequence in the genome of acquisition of the index; secondly, obtaining optimal and suboptimal path sets of the seed sequences on the genome by using a sparse dynamic programming method according to the positions; and finally, obtaining a sequence of the sequencing sequence corresponding to the genome according to the optimal path set and the suboptimal path set, comparing the sequencing sequence with the corresponding sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score. Belongs to the field of sequence comparison.

Description

Sequence comparison method based on group genome
Technical Field
The invention relates to a sequence comparison method, in particular to a human DNA sequence comparison method based on a group genome, and belongs to the field of sequence comparison.
Background
Genome sequencing data analysis has become an important research direction and development focus in the fields of life science, accurate health, personalized medicine and the like. A number of large-scale human genome sequencing projects including the international thousand-man genome project (1 KG), the UK thousand-man genome project (UK 10K), the american precision medicine project (All of Us), the singapore thousand-man genome project (SG 10K) and the chinese hundred thousand-man genome project have been internationally developed. The genome project generated sequencing data scale to PB-ZB level, and the demand of genome data analysis increased explosively.
Genome sequence alignment is an important basic step of genome sequence analysis, and is also a bottleneck step consuming a large amount of computing time and resources, and the problem is more obvious when the data volume reaches PB level. The existing second generation sequencing technology cannot directly generate a complete genome, so the aim of genome sequence alignment is to locate a large number of disordered sequencing fragments on a reference genome so as to obtain the variation information of each site, and the existing commonly used sequence alignment method taking a single reference genome as a benchmark has the problem of systematic deviation of precision. The upstream of genome sequence alignment accepts the input of sequencing data, and the downstream continuous variation analysis and disease correlation analysis, so that the quality of sequence alignment has great influence on downstream analysis, and inaccurate sequence alignment can cause failure of variation detection or give wrong variation sites, and directly influence the correctness and credibility of downstream analysis.
Disclosure of Invention
The invention aims to solve the problem that sequencing data variation detection fails or wrong variation sites are obtained due to the fact that precision deviation exists in a sequence comparison method taking a single reference genome as a benchmark, and further provides a group genome-based sequence comparison method.
The technical scheme adopted by the invention is as follows:
it comprises the following steps:
s1, constructing a seed sequence-based population genome index, wherein the seed sequence-based population genome index comprises a seed sequence-based map model index of a reference genome and a seed sequence-based local variation sequence index of variation data;
s2, extracting a seed sequence of the sequencing sequence, and obtaining a variant sequence of the seed sequence on variant data by using a local variant sequence index to obtain the position of the variant sequence on a reference genome, namely the position of the seed sequence on the reference genome;
s3, according to the position of the seed sequence on the reference genome, performing path search on the reference genome by using a sparse dynamic programming method to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome;
s4, obtaining a genome sequence corresponding to the sequencing sequence of the S2 on a reference genome according to the optimal path and the suboptimal path set, comparing the sequencing sequence with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score;
s5, obtaining a sequence to be sequenced, and executing S1-S4 to compare the sequence to be sequenced to obtain a result comparison file.
Further, a minizer-based population genome index is constructed in the step S1, and the minizer-based population genome index comprises a reference genome-based map model index and a variation data-based local variation sequence index, and the specific process comprises the following steps of:
s11, acquiring human reference genome data and corresponding variant data containing known different types of variants, and performing standardized operation pretreatment such as redundant data removal on the two data;
s12, constructing de Bruijn graph model index representation of the reference genome to obtain de Bruijn graph model index based on the reference genome;
s13, constructing local variation sequence index representation of variation data to obtain the local variation sequence index of the variation data;
s14, constructing a minisizer-based population genome index representation according to the unique path and local variation sequence index of the de Bruijn graph model index.
Further, a seed sequence of the sequencing sequence is extracted in S2, a variation sequence of the seed sequence on the variation data is obtained by using the local variation sequence index, and a position of the variation sequence on the reference genome, that is, a position of the seed sequence on the reference genome, is obtained, and the specific process is as follows:
for each pair of paired-end sequencing sequences in the sequencing sequence, the following operations are performed for each end sequence:
1) Defining a sliding window and extracting a seed sequence Minimizer:
dividing one end sequence into a plurality of intervals with fixed lengths by using sliding windows, wherein each interval comprises a plurality of k-mer sequences, and defining that adjacent sliding windows on one end sequence are mutually covered and the difference between the initial positions of the adjacent sliding windows is 1bp;
traversing a terminal sequence from the beginning, selecting a k-mer sequence with the minimum ASC2 value or hash value in each sliding window as a seed sequence Minimizer according to the defined size of the sliding window, obtaining all seed sequences Minimizer on one end sequence;
2) Obtaining the position of the seed sequence Minimizer on the reference genome:
if the total number of the minimizers of the seed sequences found on the two end sequences opposite to each other in the direction of a pair of double-end sequencing sequences is smaller than a preset threshold which is 1/3 of the total number of the minimizers of all the seed sequences of the sequencing sequences, searching a variant sequence corresponding to the Minimizer of the seed sequences of the pair of double-end sequencing sequences on variant data by using a local variant sequence index, if the corresponding variant sequence is found on the variant data, indicating that the variant sequence is consistent with the Minimizer of the seed sequence, and recording a quadruplet { Unipath } of the corresponding position of the variant sequence on a reference genome start ,Unipath end ,Read start ,Read end Set of { Unipath }, where start Representing the starting position of the variant sequence on the unique path of the de Bruijn graph model of the reference genome, unipath end Representing the termination position of the variant sequence on the unique path of the de Bruijn graph model of the reference genome, read start Indicating the start position of the variant sequence on the sequenced sequence, read end Indicates the termination position of the variant sequence on the sequenced sequence; if no correspondence is found in the variant sequencePerforming S3 on the basis of the corresponding quadruplet set of the obtained seed sequence Minimizer at the reference genome position;
and if the total number of the seed sequences minimizers found on the two end sequences opposite to the direction of a certain pair of double-end sequencing sequences is greater than or equal to a preset threshold value, obtaining a corresponding quadruplet set of the seed sequences minimizers on the reference genome, and executing S3.
Further, in S3, according to the position of the seed sequence on the reference genome, a sparse dynamic programming method is used to perform path search on the reference genome to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome, and the specific process is as follows:
s31, judging the direction of each end sequencing sequence in the double-end sequencing sequence:
defining the paired-end reads of the double-ended sequencing sequence as L according to the corresponding quadruplet set of the seed sequence minizer on the reference genome 1f 、L 1r Or L 2f 、L 2r ,L 1f Forward sequence coverage length, L, of read1 representing paired-end reads 1r Read1 representing paired-end reads is the reverse complement covered length, L 2f Read2, representing paired-end reads, is the forward sequence coverage length, L 2r Read2 representing paired-end reads is the reverse complement sequence coverage length;
if L is 1f +L 2r >L 1r +L 2f + an offset value, wherein the offset value is 1/10 of the length of the sequencing sequence, if the correct direction is considered that read1 is a forward sequence and comes from a forward strand, read2 is a reverse complementary sequence and comes from a reverse strand, and a set of all minimizers contained in the corresponding direction is returned;
if L is 1r +L 2f >L 1f +L 2r + offset value, regarding read2 as forward sequence from the forward strand, read1 as reverse complementary sequence from the reverse strand, and returning all minimizers contained in the corresponding direction; otherwise, the correct direction is considered to be indistinguishable, and the sets which correspond to the two directions and contain all minimizers are returned;
if the total coverage length of the sequencing sequences at the two ends is less than a specified threshold which is 1/3 of the total length of the sequencing sequences at the two ends, carrying out Minimizer comparison by using the local variant sequences to obtain the directions of the sequencing sequences at the two ends, and returning all Minimizer sets contained in the corresponding directions;
s32, extending and merging the seed sequence Minimizer according to the direction of the sequencing sequence at each end:
a. extension of seed sequence Minimizer:
acquiring the starting position and the ending position of a current seed sequence Minimizer on a sequencing sequence, and the starting position and the ending position of the current seed sequence Minimizer on a unique path in a de Bruijn graph model, extending the corresponding positions of the seed sequence Minimizer on the sequencing sequence and the unique path forwards and backwards until a certain base matching error on the sequencing sequence is met or the last base position of the sequencing sequence is reached or the last base position of the unique path is reached, recording the front-back extending position and the covering length of the seed sequence Minimizer on the sequencing sequence after the front-back extending position and the extending position of the seed sequence Minimizer on the sequencing sequence and the unique path are recorded, and traversing all the seed sequences minimizers on the sequencing sequence according to the method;
during traversal, if the termination position of the current seed sequence Minimizer on the sequencing sequence is smaller than the backward extension position of the previous seed sequence Minimizer, the current seed sequence Minimizer is contained in the extension coverage area of the previous seed sequence Minimizer without extension; otherwise, executing the extension;
obtaining a candidate seed sequence after each time of extension of a seed sequence Minimizer, and recording a unique path number ID corresponding to the candidate seed sequence Unipath Offset position on sequencing sequence, position on reference genome, coverage length on sequencing sequence, candidate seed sequence number ID seed Obtaining all candidate seed sequences;
b. merging candidate seed sequences:
merging IDs having the same unique path number Unipath Obtaining a set of candidate seed sequences, merging each setCombining the coverage areas of all candidate seed sequences to obtain a new coverage length;
splitting the candidate seed sequences in the set into a plurality of six-tuple (candidate seed sequence number ID) according to the corresponding reference genome positions seed The start and end positions of the region covered on the sequencing sequence, the end number ID of the paired-end sequencing sequence end The initial position and the termination position on the reference genome) to obtain a six-element group set consisting of six-element groups of all candidate seed sequences according to the steps;
s33, path search based on the dynamic planning method:
the method comprises the steps that I, six-tuple sets are sequenced according to the initial positions of all six-tuple sets on a reference genome to obtain an ordered set, each six-tuple in the ordered set is defined as a node, the ordered set is traversed from the beginning, adjacent nodes in a specified number behind the current node are extracted, and if the current node and a certain downstream node meet the condition at the same time, a connecting edge is established; otherwise, no connecting edge is established;
and II, taking the negative value of the connecting edge as the weight of the edge, wherein the weight calculation method comprises the following steps:
weight edge =cover match -score panalty
Figure BDA0003919106520000051
diff distance =|(P1 read -P1 read )-(P1 ref -P2 ref )
wherein, weight edge Representing the weight of the connecting edge; cover match Representing the coverage length of the sequencing sequence of the current node; score panalty Is a connection penalty; diff (diff) distance Representing the reference genome positions of the current node and a certain downstream node and the deviation distance on the sequencing sequence, and calculating according to the initial position; p1 read Representing the offset position of the current node on the sequencing sequence read; p1 ref Representing the position of the current node on the reference genome; p is2 read Indicating the offset position of a certain downstream node on the sequencing sequence read; p2 ref Represents the position of a downstream node on the reference genome;
if the end number ID of the double-end sequencing sequence of the current node and a certain downstream node end If the difference is different, taking the logarithm of the deviation distance, otherwise, taking the value of the deviation distance; obtaining a plurality of connecting edges according to the method;
III, traversing the ordered set from the beginning according to the dynamic planning idea of the dynamic planning method, selecting a precursor node with the maximum edge weight to construct an effective connecting edge between two nodes aiming at each node, and adding a back-driving node number connected with the precursor node to obtain a path in the graph model;
the ordered set is traversed from tail to front, if the current node has no back-drive node and has a front-drive node, the current node is a termination node of a certain path, the front-drive node is traversed forward according to the connection relation of edges, the front-drive node of the current node is traversed circularly according to the method, the node number is recorded until a certain node has no front-drive node, namely, the certain node is an initial node of the current path, a path is obtained, and all nodes on the path are recorded; obtaining all path sets according to the method;
and IV, synchronously accumulating and calculating the weight of each connecting edge in the path construction process to obtain a corresponding path score path According to the path score path And sequencing all paths from high to low by adopting a quick sequencing method to obtain an ordered path set, and traversing the set according to specified conditions to obtain an optimal path set and a suboptimal path set.
Further, in the step S31, when determining the direction of each end of the paired end sequencing sequence, if two adjacent seed sequences minimizers on each end of the sequencing sequence overlap, the coverage length is the union length of the coverage areas of the two seed sequences minimizers on the sequencing sequence.
Further, in S31, minimizer alignment is performed using the local variant sequences, which comprises the following steps:
seed sequence Mi from two-terminal sequencing sequencesnimizer, using local variation sequence index to search variation sequence corresponding to variation data, and returning quadruple { ALT ] corresponding to variation sequence on variation data start ,ALT end ,Read′ start ,Read′ end },ALT start ,ALT end ,Read′ start ,Read′ end Respectively representing the initial position and the end position of the variant sequence on variant data and the initial position and the end position of the variant sequence on a sequencing sequence, and updating L according to the coverage length corresponding to the quadruple 1f 、L 1r 、L 2f 、L 2r And judging the direction of the sequencing sequence at each end according to S31, and returning all Minimizer sets contained in the corresponding direction.
Further, the conditions in i in S33 are:
1) Seed number ID of two nodes seed Different;
2) Chromosomes corresponding to the reference genome positions of the two nodes are the same;
3) The difference of the reference genome positions of the two nodes is smaller than a specified threshold value;
4) And the reference genome positions of the two nodes are consistent with the position size relation on the sequencing sequence.
Further, the specified conditions within iv in S33 are:
1) If the overall coverage length of each node on the sequencing sequence on the first path in the path set is smaller than a specified threshold, the specified threshold is 2/3 of the length of the sequencing sequence, and the seed sequence minizer is not inquired on the variant sequence, minizer comparison is carried out by using a local variant sequence index, if a new seed sequence minizer is obtained after comparison, the hexahydric set corresponding to the new seed sequence minizer is added to the hexahydric set of the S32b, and the S33 is executed again to obtain an optimal path and a suboptimal path set; otherwise, execute 2);
2) In the ordered path set, a first path is used as an optimal path, paths which have the same score as the first path and are within the specified number are used as suboptimal paths, if the number of the paths in the ordered path set is greater than the specified number, the suboptimal paths are selected according to the specified number, and if the number of the paths in the ordered path set is less than or equal to the specified number, all the paths except the first path are used as suboptimal paths to obtain an optimal path and suboptimal path set.
Further, in S4, a genome sequence corresponding to the sequencing sequence of S2 on the reference genome is obtained according to the optimal path and the suboptimal path set, the sequencing sequence is compared with the corresponding genome sequence by using a KSW2 algorithm to obtain a comparison score, and a result comparison file is obtained according to the comparison score, which comprises the following specific processes:
and calculating the initial position of the sequencing sequence of the S2 on the reference genome according to the initial node information of all the paths in the optimal path set and the suboptimal path set generated in the step S33, extracting the genome sequence corresponding to the sequencing sequence from the reference genome, performing end-to-end global sequence comparison on the sequencing sequence and the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score.
Further, if the alignment result of S4 includes a known variant sequence, the format of the alignment result represents variant information according to the following method:
respectively adding a starting identification character 'V' and an ending representation character 'E' before and after the variation number of the variation data acquired in S1; then, a binary three-level index structure is constructed to represent a comparison result, and the construction method of each level of index structure is as follows:
1) Recording the number of comparison results of each sequencing sequence, namely the offset position on the second-level index, in each data unit of the first-level index;
2) Recording the length of a CIGAR sequence of each sequencing sequence alignment result, namely the offset position on the third-level index, in each data unit of the second-level index;
3) And the third-level index records the CIGAR sequence of each sequencing sequence alignment result.
Has the advantages that:
the invention aligns the sequencing sequences according to an index of a seed sequence-based population genome, the seed sequence-based population genome index comprising a map of a seed sequence-based reference genomeA model index and a local variant sequence index based on variant data of the seed sequence; extracting a seed sequence on the sequencing sequence, and acquiring a variant sequence of the seed sequence on variant data by using a local variant sequence index to obtain the position of the variant sequence on a reference genome, namely the position of the seed sequence on the reference genome; judging the direction of each end sequencing sequence in the double-end sequencing sequence according to the position, extending and combining the seed sequences according to the direction to obtain a six-element group (candidate seed sequence number ID) of the candidate seed sequences seed The start and end positions of the region covered on the sequencing sequence, the end number ID of the paired-end sequencing sequence end Initial position and termination position on a reference genome), establishing a connection edge between six tuples, performing path search by combining a sparse dynamic programming method with the connection edge to obtain an optimal path and a suboptimal path set of a sequencing sequence on a reference genome map model, obtaining a genome sequence of the sequencing sequence on the reference genome according to the optimal path and the suboptimal path set, comparing the sequencing sequence with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score, thereby realizing the sequence comparison of the genome groups.
The invention provides a group genome index structure formed by utilizing a group reference genome and a corresponding known common variant sequence and a sparse dynamic programming method, and provides a group genome-based sequence comparison method, thereby greatly improving the comparison speed, precision and sensitivity of genome sequences and solving the problems of systematic precision deviation, low sequence comparison quality and incorrect comparison position of sequence comparison formed by taking a single reference genome as a reference.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
Detailed Description
The first embodiment is as follows: referring to fig. 1, the present embodiment is described as a method for aligning sequences based on genome of a population, which comprises the following steps:
s1, constructing a population genome index based on a seed sequence minizer, wherein the population genome index based on the seed sequence minizer comprises a map model index based on a reference genome of the seed sequence minizer and a local variation sequence index based on variation data of the seed sequence minizer, and the specific process comprises the following steps of:
s11, acquiring human reference genome data and corresponding variant data containing known different types of variants, and carrying out normalization operation preprocessing such as redundant data removal on the two data. Variant data is obtained from a database.
S12, constructing de Bruijn graph model index representation of the reference genome, and specifically comprising the following steps:
traversing a complete reference genome, extracting all k-mer sequences, wherein the length of the k-mer sequence is a short sequence fragment of k, simultaneously recording quadruplets (k-mer sequences, edge entering bases, edge exiting bases and offset positions) corresponding to each k-mer sequence, extracting correlation relations between different k-mer sequences and the k-mer sequences of input data to construct a de Bruin graph model based on the quadruplets corresponding to each k-mer sequence, wherein a de Bruin graph G = < V, E > is a directed graph model, nodes V in the graph are k-mer sequences with different lengths of k, and a node set V = { V1, V2, …, vM } consisting of all different k-mer sequences on the reference genome is extracted; for two nodes vi and vj, if vi overlaps vj with a sequence of (k-1) -mers, there is a directed edge vi → vj.
Each k-mer sequence represents one de Bruijn graph model node, a unique path unipath is determined according to the number of incoming edges and outgoing edges of the node, the incoming degree of the node is defined as the incoming edge of the node, the outgoing degree is defined as the number of the outgoing edges, and each node has one or more incoming edges and outgoing edges; if the initial node in-degree of the path is greater than 1 and the out-degree is 1, the end node out-degree is greater than 1 and the in-degree is 1, the intermediate node in-degree is 1 and the out-degree is 1, the path is a unique path unipath, and all unique paths are obtained.
For each unique path, the nodes on the unique path are associated with the k-mer sequence position set; for a certain path, traversing each node from the head to the back, acquiring two position sets of a current node and a subsequent node thereof, and finding out the similar position elements of the two sets; because each position set is an ordered set, traversing a certain set element and adopting a binary search method to find out all similar positions and form a new position set; according to the method, each node on the path is traversed backwards in sequence, the position searching and set combining operation is executed every time a new node is traversed until the end node of the path is encountered, and the finally formed position set is the position set corresponding to the current unique path.
And generating a position set corresponding to all the unique paths according to the method, and representing the position set as a binary index file which is represented as F-pos.
When traversing a complete reference genome to extract all k-mer sequences, the k-mer sequences need to be sequenced and deduplicated, and the specific process comprises the following steps:
writing the quadruple corresponding to the k-mer sequence into a corresponding temporary file according to the sequence content of the first f bp of the k-mer as a file name, wherein bp represents the total amount of bases; importing all k-mer sequences in a file into a memory, performing rapid sequencing on the k-mers in the file in the current memory to order quadruples corresponding to the k-mer sequences in the file, and outputting the quadruples to a hard disk to form a corresponding ordered file; and (3) fast sequencing: traversing a k-mer set in the current file from the head, merging adjacent same k-mer sequences, and merging offset position combinations corresponding to the k-mers; and sequentially importing all files into a memory and performing rapid sequencing to generate all different k-mer sequences and ensure that the position sets corresponding to the k-mers are ordered.
The process of determining the unique path unipath according to the number of the incoming edges and the outgoing edges of the nodes comprises the following steps:
firstly, according to the number of the incoming edges and the outgoing edges of the nodes, the nodes are divided into the following types:
1) Node of type 'X': a plurality of incoming edges and a plurality of outgoing edges;
2) Node of the 'FY' type: a plurality of in edges and one out edge;
3) Node of the 'RY' type: one entrance edge and a plurality of exit edges;
4) Node of the 'L' type: one in and one out;
the starting and ending nodes of a unique path may be constructed in several ways:
(1) The start and end nodes are the same 'X' type node;
(2) The start node is a 'FY' type node and the end node is a 'RY' type node;
(3) The start node is a back-drive node of the 'X' type node or a back-drive node of the 'FY' type node and the end node is a front-drive node of the 'X' type node or a front-drive node of the 'RY' type node;
then, traversing all nodes in the graph, and if the current node is the initial node of the unique path, executing backward extension operation; backward extension operation: calculating the back-driving node of the current node according to the edge-out information of the current node, if the back-driving node is an 'L' -type node, continuing to extend backwards, and sequentially circulating according to the method until an end node is encountered; all unique paths of the graph model can be generated by the method, and all nodes in the paths are ensured not to be repeated;
all the unique path index files are generated through the method, and each unique path index file corresponds to one unipath.
S13, constructing local variation sequence index representation of variation data, and specifically comprising the following steps:
the reference genome is divided into overlapping regions according to fixed-length intervals, each divided fixed-length interval is used as a local region, every two adjacent local regions are partially overlapped, and the length of the overlapping region is half of the length of the divided regions.
Aiming at the haplotype of an individual, collecting the variation in each local region, and forming a local individual genome sequence with known variation, namely generating a local variation sequence; the length of the overlapping area is half the length of the dividing area.
And generating variant sequence index files by the method, wherein each local variant sequence index file is correspondingly used as an alt string.
The reference genome is divided into a plurality of intervals with fixed length (the default value is 300 bp) and coverage (the default value is 150 bp) from beginning to end, local sequences containing known variation in each interval are extracted, and if a plurality of variations exist in the intervals, the combination of the variations is constructed according to the real haplotype information, so that local variation sequence indexes are created.
S14, constructing a population genome index representation based on the seed sequence minizer according to the unique path unipath and the local variation sequence index alt string of the de Bruijn graph model index, namely the population genome index based on the seed sequence minizer comprises a graph model index based on a reference genome of the seed sequence minizer and a local variation sequence index based on variation data of the seed sequence minizer:
before indexes are constructed for the unipath and the alt string, respectively converting a unipath graph and an alt string list into character string lists; aiming at each character string in the character string list, according to a seed sequence Minimizer selection criterion, selecting a k-mer with the minimum hash value in a sliding window as a local representative k-mer; these local minimum k-mers, called seed sequences Minimizer, will be stored in a unified list.
The specific process for selecting the k-mer sequence with the minimum hash value in the sliding window comprises the following steps:
the selection of the Minimizer has two key parameters, namely windows size and k-mer sequence length; assuming that the sequencing sequence of the index to be constructed is L and the length of the K-mer is L _ K, when the sliding window is selected, (L-L _ K + 1) K-mers are selected; if the windows size is W, selecting a min _ k-mer1 with the minimum hash value from the 5 k-mers with the numbers of 1-5; then, selecting the minimum min _ k-mer2 from the 5 of 2 to 6; then 3-7 and so on; the min _ k-mers selected from different adjacent minimizers by selecting the window may be the same, and each repetition is only recorded once; the min _ k-mer is called the seed sequence minizer.
The Minimizer lists corresponding to all the character strings are combined into a complete list, and the list is sorted according to the size of the hash value of the Minimizer; and constructing a secondary index of the mimizer according to the high K bits of the hash value of each miniizer, wherein K is a secondary index parameter.
The secondary index splits the minisizer's total index into B =2 ^K The number of the minimizers is M, and each cell contains M/B minimizers on average; each constructed Minimizer occupies a 96-bit storage space, wherein the last Q bit of the hash value of the Minimizer occupies a 30-bit storage space; the position of the minisizer on the original "unipath" or "alt string" list takes up 65 bits of space.
S2, extracting a seed sequence of the sequencing sequence, acquiring a variant sequence of the seed sequence on variant data by using a local variant sequence index, and obtaining a position of the variant sequence on a reference genome, namely the position of the seed sequence on the reference genome, wherein the specific process comprises the following steps:
for each pair of paired-end sequencing sequences paired-end (sequence a and sequence b) in the sequencing sequence, the following operations are performed for each end sequence (sequence a or sequence b) of each pair of paired-end sequencing sequences:
1) Defining a sliding window, and extracting a seed sequence:
one end sequence (sequence a or sequence b) is divided into a plurality of intervals with fixed lengths by using a sliding window, the window size of the interval with the fixed length is generally set to be 10bp by default, the specific window size can be adjusted according to an actual application scene, and each interval comprises a plurality of k-mer sequences. Defining the mutual coverage between adjacent sliding windows on one end sequence, wherein the difference between the initial positions of the adjacent sliding windows is 1bp; and traversing the one-end sequence from the beginning, selecting a k-mer sequence with the minimum ASC2 value or hash value in each sliding window according to the defined size of the sliding window as a seed sequence Minimizer, and obtaining all the seed sequences minimizers on the one-end sequence (sequence a or sequence b).
2) Obtaining the position of the seed sequence on the reference genome:
if the total number of the seed sequences minimizers found on the two end sequences (sequence a and sequence b) opposite to each other in the paired double-end sequencing sequences is less than a preset threshold value, wherein the preset threshold value is 1/3 of the total number of all the seed sequences minimizers of all the sequencing sequences, local parts in the genome index of the Minimizer-based population are utilizedThe variation sequence index searches a variation sequence (data) corresponding to a seed sequence minizer of a certain pair of double-end sequencing sequences on known variation data, if the corresponding variation sequence is found on the variation data, the variation sequence is consistent with the seed sequence minizer, and a quadruplet match { Unipath } of the corresponding position of the variation sequence on a reference genome is recorded start ,Unipath end ,Read start ,Read end Set of { Unipath }, where start Representing the starting position of the variant sequence (seed sequence minizer) on the unique path of the de Bruijn map model of the reference genome, unipath end Representing the termination position of the variant sequence (seed sequence minizer) on the unique path of the de Bruijn map model of the reference genome, read start Indicates the starting position of the variant sequence (seed sequence minizer) on the sequenced sequence, read end Indicates the termination position of the variant sequence (seed sequence minizer) on the sequenced sequence; if no corresponding variant sequence is found on the variant sequence, performing S3 according to the corresponding quadruplet set of the obtained seed sequence Minimizer on the reference genome position;
if the total number of the seed sequences minimizers found on the two end sequences opposite to the paired-end direction of a certain pair of double-end sequencing sequences is greater than or equal to a preset threshold, the number of the found seed sequences minimizers is enough, the seed sequences indexed by the local variation sequences are not searched, the quadruple set corresponding to the seed sequences minimizers on the reference genome is obtained, and S3 is executed.
The method utilizes de Bruijn graph model index representation and local variation sequence index representation of a reference genome to search for the quadruplet { Unipath } corresponding to all candidate positions of the seed sequence Minimizer on the reference genome start ,Unipath end ,Read start ,Read end The set of sequences, the seed sequence minisizer of the sequenced sequence is unique in position on the unique path unipath.
S3, according to the position of the seed sequence on the reference genome, performing path search on the reference genome by using a sparse dynamic programming method to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome, wherein the specific process comprises the following steps:
s31, judging the direction of each end sequencing sequence in the double-end sequencing sequence:
defining the paired-endreads of the paired-end sequencing sequence as L according to the corresponding quadruplet set of the seed sequence minizer on the reference genome 1f 、L 1r Or L 2f 、L 2r ,L 1f Forward sequence coverage length, L, of read1 representing paired-endiads 1r Read1 representing paired-endiads is the reverse complement sequence cover length, L 2f Read2, representing paired-end reads, is the forward sequence coverage length, L 2r Read2, representing paired-endiads, is the reverse complement sequence coverage length.
L 1f Is totally called L 1forward "refers to the forward sequence of read1 of paired-endiads, L 1r The reverse complementary sequence of read1 referring to paired-endiads; l is a radical of an alcohol 2f And L 2r The same is true. The principle of the second-generation sequencing is that a double-helix structure (double-strand) of DNA is sequenced from 5 'end to 3' end at the same time, and the generated sequencing sequences are generated in pairs and are called paired-end reads, wherein one read is from a positive strand of a genome, the other read is from a reverse strand of the genome, and the sequencing direction is opposite to that of the positive strand.
And (3) judging the real direction of each end sequence according to the integral coverage length formed by all the seed sequences minimizers on each end sequence: if L is 1f +L 2r >L 1r +L 2f The bias value is an empirical value, the bias value is related to the length of the sequencing sequence, and is generally 1/10 of the length of the sequencing sequence, then the correct direction is considered that read1 is a forward sequence, is derived from a forward strand, and read2 is a reverse complementary sequence, is derived from a reverse strand, and returns a set of all minimizers contained in the corresponding direction, for example, read1 is set according to the forward direction and read2 is set according to the reverse direction; if L is 1r +L 2f >L 1f +L 2r + offset value, then considered positiveThe definite direction is that read2 is a forward sequence and comes from a positive strand, read1 is a reverse complementary sequence and comes from a reverse strand, and the set of all minimizers contained in the corresponding direction is returned; otherwise, the correct direction is not considered to be distinguished, and the set which comprises all minimizers and corresponds to the two possible directions is returned.
If two adjacent seed sequences minisizer (c and d) on each end sequence overlap, the coverage length is the union length of the coverage areas of the two seed sequences minisizer (c and d) on the sequencing; according to this principle, the length of coverage of the two end sequencing sequences in the two possible orientations was calculated separately.
If the total coverage length of the sequencing sequences at the two ends is less than the designated threshold which is 1/3 of the total length of the sequencing sequences at the two ends, the local variant sequences are used for comparing minimizers to obtain the directions of the sequencing sequences at the two ends, and all Minimizer sets contained in the corresponding directions are returned.
The Minimizer comparison method based on the local variant sequence comprises the following specific processes:
according to the seed sequence Minimizer of the two-end sequencing sequence, the local variation sequence index representation is used for searching the variation sequence (seed sequence Minimizer) corresponding to the variation data, and the quadruple { ALT (alternating sequence) corresponding to the variation sequence on the variation data is returned start ,ALT end ,Read′ start ,Read′ end }, each element in the quadruple (ALT) start ,ALT end ,Read′ start ,Read′ end ) Respectively representing the initial position and the end position of the variant sequence (seed sequence Minimizer) on the variant data (ALT) and the initial position and the end position on the sequencing sequence, and updating L according to the coverage length corresponding to the quadruple 1f ,L 1r ,L 2f ,L 2r And judging the direction of the sequencing sequence at each end according to the method, and returning all Minimizer sets contained in the corresponding direction.
S32, extending and merging seed sequences according to the direction of the sequencing sequence at each end:
a. extension of the seed sequence:
acquiring the starting position and the ending position of the Minimizer of the current seed sequence on a sequencing sequence, and the starting position and the ending position of the Minimizer of the current seed sequence on a unique path in a de Bruijn graph model, carrying out forward and backward accurate matching extension on the corresponding positions of the Minimizer of the seed sequence on the sequencing sequence and the unique path until a certain base matching error on the sequencing sequence is met or the last base position of the sequencing sequence is reached or the last base position of the unique path is reached, recording the front and back extension positions of the Minimizer of the seed sequence on the sequencing sequence and the unique path and the coverage length of the Minimizer of the seed sequence on the sequencing sequence after extension, and traversing all the minimizers of the seed sequence on the sequencing sequence according to the method.
During traversal, if the termination position of the current seed sequence Minimizer on the sequencing sequence is smaller than the backward extension position of the previous seed sequence Minimizer, the current seed sequence Minimizer is contained in the extension coverage area of the previous seed sequence Minimizer, and then accurate matching extension is not needed; otherwise, the exact match extension described above is performed.
Obtaining a candidate seed sequence seed after each time of completing the precise matching extension of a Minimizer, and recording the unique path number ID corresponding to the candidate seed sequence seed Unipath Offset position on sequencing sequence, position on reference genome, coverage length on sequencing sequence, candidate seed sequence number ID seed And obtaining all candidate seed sequences seed.
b. Merging candidate seed sequences:
merging IDs having the same unique path number Unipath Obtaining a set of candidate seed sequences, and combining coverage areas of all candidate seed sequences in each set to obtain a new coverage length;
splitting the candidate seed sequences in the set into a plurality of six-tuple (candidate seed sequence number ID) according to the corresponding reference genome positions seed The start and end positions of the region covered on the sequencing sequence, the end number ID of the paired-end sequencing sequence end The start and end positions on the reference genome) were obtained according to the above procedureA six-member group set composed of six-member groups of the candidate seed sequences;
since different candidate seed sequences seed may fall on the same unique path unipath, merging of seed sequences with the same ID is required Unipath To obtain the candidate seed sequence seed with the same unique path number ID Unipath The coverage areas of all the candidate seed sequences seed in each set are merged to obtain the coverage length which is recalculated. Meanwhile, since each unique path unipath corresponds to a plurality of reference genome positions, the same ID will be used Unipath The candidate seed sequence seed is divided into a plurality of six-tuple according to the corresponding reference genome position, wherein the first four elements in the six-tuple from the same candidate seed sequence have the same numerical value, and the starting position and the ending position on the reference genome have different numerical values. A different set of six-tuple comprising six-tuple is used to support the following steps.
S33, path searching based on the dynamic planning method:
first, the association between partial six-tuples in the six-tuple set is established according to the specified conditions, similar to the establishment of connecting edges for nodes in a certain graph according to the specified conditions. Sequencing the six-tuple SET according to the initial position of each six-tuple on the reference genome to obtain an ordered SET SET-match, and defining each six-tuple in the ordered SET SET-match as a node; traversing the ordered SET SET-match from the head, extracting the adjacent nodes in the specified number (which can be a variable, can be the specified number, and has a default value of 10) after the current node for the current six-tuple (node), and if the current node and a certain downstream node simultaneously satisfy the following conditions, establishing a connecting edge; otherwise, no connecting edge is established.
1) Seed number ID of two nodes seed Different;
2) Chromosomes corresponding to the reference genome positions of the two nodes are the same;
3) The difference of the reference genome positions of the two nodes is less than a specified threshold (default 500 bp);
4) The reference genome positions of the two nodes are consistent with the position size relation on the sequencing sequence;
downstream refers to the physical position relationship between six-element groups (nodes) on the reference genome, for example, if the position of node 1 on the reference genome is the 100bp, and the position of node 2 on the reference genome is the 200bp, then node 2 is downstream of node 1.
Taking the negative value of the connecting edge as the weight of the edge, and the weight calculation method of the edge is as follows:
taking the example of establishing a connection edge between a node 1 (current node) and a node 2 (a certain downstream node) as an example for explanation;
weight edge =cover match -score panalty
Figure BDA0003919106520000151
diff distance =|(P1 read -P1 read )-(P1 ref -P2 ref )
wherein, weight edge Representing the weight of the connecting edge; cover match Represents the coverage length of the sequencing sequence of node 1 on the connecting edge; score panalty Is a connection penalty (known); diff is a unit of a product distance Representing the reference genome position of two nodes (node 1 and node 2) on the connecting edge and the deviation distance on the sequencing sequence, and calculating according to the initial position; p1 read Represents the offset position of node 1 on the sequencing sequence read; p1 ref Represents the position of node 1 on the reference genome; p2 read Represents the offset position of node 2 on the sequencing sequence read; p2 ref Indicating the position of node 2 on the reference genome. Meanwhile, in order to eliminate the influence of the insertion distance between the two ends of the paired-end sequencing sequence paired-end on penalty calculation, if the end numbers ID of the paired-end sequencing sequences of the node 1 and the node 2 are numbered end If the difference is different, the logarithm of the deviation distance is taken, and the nodes from the two ends can be ensured to be connected when the connection conditions are met, so that the integral calculation of the nodes of the sequences at the two ends is realized, and the accuracy of path search is improved; otherwise, taking the deviation distance value. Establishing a satisfied link according to the above methodAnd connecting the connecting edges of the conditions to obtain a plurality of connecting edges.
Secondly, searching and obtaining an optimal path and a suboptimal path in the graph model by using a dynamic programming method:
according to the dynamic planning idea, each node needs to be connected with the precursor node with the largest edge weight, so that the ordered SET SET-match needs to be traversed from the beginning. And aiming at each node, selecting a precursor node with the maximum edge weight to construct an effective connecting edge between two nodes, adding a back-drive node number connected with the precursor node to the precursor node, and indicating the back-drive node number connected with the precursor node. And obtaining the path in the graph model according to the operation.
Constructing a path set: and traversing the ordered SET SET-match from the tail to the front, if the current node has no back-drive node and has a front-drive node, indicating that the current node is a termination node (six-element group) of a certain path, traversing the front-drive node forwards according to the connection relation of edges, circularly traversing the front-drive node of the current node according to the method, and recording the node number until a certain node has no front-drive node, namely indicating that a certain node is an initial node of the current path, obtaining a new path, recording all nodes on the path, and obtaining all path SETs according to the method. Meanwhile, the weights of all the connecting edges are synchronously accumulated and calculated in the path construction process to obtain the corresponding path score path . Since the same node may be on multiple paths, each path and corresponding path score may be obtained in this way. According to the path score path Sorting all paths from high to low by adopting a quick sorting method to obtain an ordered path set, and traversing the ordered path set according to the following rules:
rule 1), if the overall coverage length of each node on the sequencing sequence on the first path in the ordered path set is smaller than a specified threshold, the specified threshold is 2/3 of the length of the sequencing sequence, and the seed sequence minisizer is not queried on the variant sequence, carrying out minisizer comparison by using a local variant sequence index, if a new seed sequence minisizer is obtained after comparison, adding a hexahydric set corresponding to the new seed sequence minisizer into the hexahydric set of the S32b, and executing S33 again to obtain an optimal path and a suboptimum path set; otherwise, rule 2) is executed, otherwise all the inverse possibilities of the conditions described in rule 1) are included.
Rule 2), in the ordered path set, taking the first path as the optimal path, taking the paths with the same score as the first path and within the specified number (the default value is 100) as the suboptimal paths, if the number of the paths in the ordered path set is greater than the specified number, selecting the suboptimal paths according to the specified number, and if the number of the paths in the ordered path set is less than or equal to the specified number, taking all the paths except the first path as the suboptimal paths to obtain the optimal paths and the suboptimal paths in the ordered path set.
S4, obtaining a genome sequence corresponding to the sequencing sequence of the S2 on the reference genome according to the optimal path and the suboptimal path set, comparing the sequencing sequence with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score, wherein the specific process comprises the following steps:
comparing all paths in the optimal path and suboptimal path set generated in the step S33 with a sequence to be sequenced to obtain a local sequence comparison score, and obtaining a result comparison file according to the local sequence comparison score, wherein the initial position of the sequence to be sequenced on a reference genome is calculated according to the initial node information (the reference genome position and the offset position on the sequencing sequence) of the path, a corresponding genome sequence is extracted from the reference genome, and the KSW2 algorithm is adopted to carry out end-to-end global sequence comparison with the sequence to be sequenced;
if the comparison result contains the known variant sequence, the format of the comparison result represents the known variant information according to the following method:
before and after the mutation number of the mutation data acquired in S1, a start identification character 'V' and an end representation character 'E' are respectively added for customizing the known mutation in the CIGAR information. Meanwhile, in order to realize compressed representation and storage of the sequence comparison result and connect the subsequent research of the variation detection method based on the group genome, a binary three-level index structure is designed to represent the sequence comparison result, and the design method of the index structure at each level is as follows:
1) Recording the number of comparison results of each sequencing sequence, namely the offset position on the second-level index, in each data unit of the first-level index;
2) Recording the length of the CIGAR sequence of each sequencing sequence comparison result, namely the offset position on the third-level index, by each data unit of the second-level index;
3) And the third-level index records the CIGAR sequence of each sequencing sequence alignment result.
CIGAR information: the CIGAR character string is a sequence consisting of base length and related operations, and represents an alignment relationship between two sequences. The CIGAR string generally indicates which bases on the sequence match, do not match, are deleted or inserted at which position from the reference genomic sequence.
S5, obtaining a sequence to be sequenced, and executing S1-S4 to compare the sequence to be sequenced to obtain a result comparison file.

Claims (10)

1. A method of population genome-based sequence alignment, comprising: the method comprises the following steps:
s1, constructing a seed sequence-based population genome index, wherein the seed sequence-based population genome index comprises a seed sequence-based map model index of a reference genome and a seed sequence-based local variation sequence index of variation data;
s2, extracting a seed sequence of the sequencing sequence, and obtaining a variant sequence of the seed sequence on variant data by using a local variant sequence index to obtain the position of the variant sequence on a reference genome, namely the position of the seed sequence on the reference genome;
s3, according to the position of the seed sequence on the reference genome, performing path search on the reference genome by using a sparse dynamic programming method to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome;
s4, obtaining a genome sequence corresponding to the sequencing sequence of the S2 on a reference genome according to the optimal path and the suboptimal path set, comparing the sequencing sequence with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score;
s5, obtaining a sequence to be sequenced, and executing S1-S4 to compare the sequence to be sequenced to obtain a result comparison file.
2. A method for population genome-based sequence alignment according to claim 1, wherein: the method comprises the following steps of S1, constructing a minizer-based population genome index, wherein the minizer-based population genome index comprises a reference genome-based map model index and a variation data-based local variation sequence index, and the specific process comprises the following steps of:
s11, acquiring human reference genome data and corresponding variant data containing known different types of variants, and performing standardized operation pretreatment such as redundant data removal on the two data;
s12, building de Bruijn graph model index representation of the reference genome to obtain de Bruijn graph model index based on the reference genome;
s13, constructing local variation sequence index representation of variation data to obtain a local variation sequence index of the variation data;
s14, constructing a Minimizer-based population genome index representation according to the unique path and local variant sequence index of the de Bruijn graph model index.
3. A method for genome-wide sequence alignment according to claim 2, wherein: s2, extracting a seed sequence of the sequencing sequence, acquiring a variant sequence of the seed sequence on variant data by using a local variant sequence index, and obtaining the position of the variant sequence on a reference genome, namely the position of the seed sequence on the reference genome, wherein the specific process comprises the following steps:
for each pair of paired-end sequencing sequences in the sequencing sequence, performing the following operations for each end sequence:
1) Defining a sliding window and extracting a seed sequence Minimizer:
dividing one end sequence into a plurality of intervals with fixed lengths by using sliding windows, wherein each interval comprises a plurality of k-mer sequences, and defining that adjacent sliding windows on one end sequence are mutually covered and the difference between the initial positions of the adjacent sliding windows is 1bp;
traversing one end sequence from the beginning, selecting a k-mer sequence with the minimum ASC2 value or Hash value in each sliding window as a seed sequence Minimizer according to the defined size of the sliding window, and obtaining all seed sequences minimizers on the one end sequence;
2) Obtaining the position of the seed sequence Minimizer on the reference genome:
if the total number of the minimizers of the seed sequences found on the two end sequences opposite to each other in the direction of a pair of double-end sequencing sequences is smaller than a preset threshold which is 1/3 of the total number of the minimizers of all the seed sequences of the sequencing sequences, searching a variant sequence corresponding to the Minimizer of the seed sequences of the pair of double-end sequencing sequences on variant data by using a local variant sequence index, if the corresponding variant sequence is found on the variant data, indicating that the variant sequence is consistent with the Minimizer of the seed sequence, and recording a quadruplet { Unipath } of the corresponding position of the variant sequence on a reference genome start ,Unipath end ,Read start ,Read end Set of { Unipath }, where start Indicates the starting position, unipath, of the variant sequence on a unique path of the de Bruijn graph model of the reference genome end Representing the termination position of the variant sequence on the unique path of the de Bruijn graph model of the reference genome, read start Indicating the start position of the variant sequence on the sequenced sequence, read end Indicates the termination position of the variant sequence on the sequenced sequence; if no corresponding variant sequence is found on the variant sequence, performing S3 according to the corresponding quadruplet set of the obtained seed sequence Minimizer on the reference genome position;
and if the total number of the seed sequences minimizers found on the two end sequences opposite to the direction of a certain pair of double-end sequencing sequences is greater than or equal to a preset threshold value, obtaining a corresponding quadruplet set of the seed sequences minimizers on the reference genome, and executing S3.
4. A method for genome-wide sequence alignment according to claim 3, wherein: s3, according to the position of the seed sequence on the reference genome, performing path search on the reference genome by using a sparse dynamic programming method to obtain an optimal path and a suboptimal path set of the seed sequence on the reference genome, wherein the specific process comprises the following steps:
s31, judging the direction of each end sequencing sequence in the double-end sequencing sequence:
defining the paired-end reads of the double-ended sequencing sequence as L according to the corresponding quadruplet set of the seed sequence minizer on the reference genome 1f 、L 1r Or L 2f 、L 2r ,L 1f Forward sequence coverage length, L, of read1 representing paired-end reads 1r Read1 representing paired-end reads is the reverse complement covered length, L 2f Read2, representing paired-end reads, is the forward sequence coverage length, L 2r Read2 representing paired-end reads is the reverse complement sequence coverage length;
if L is 1f +L 2r >L 1r +L 2f + an offset value, wherein the offset value is 1/10 of the length of the sequencing sequence, if the correct direction is considered that read1 is a forward sequence and comes from a forward strand, read2 is a reverse complementary sequence and comes from a reverse strand, and a set of all minimizers contained in the corresponding direction is returned;
if L is 1r +L 2f >L 1f +L 2r + offset value, regarding read2 as forward sequence from the forward strand, read1 as reverse complementary sequence from the reverse strand, and returning all minimizers contained in the corresponding direction; otherwise, the correct direction is considered to be indistinguishable, and the sets which correspond to the two directions and contain all minimizers are returned;
if the total coverage length of the sequencing sequences at the two ends is less than a specified threshold which is 1/3 of the total length of the sequencing sequences at the two ends, carrying out Minimizer comparison by using the local variant sequences to obtain the directions of the sequencing sequences at the two ends, and returning all Minimizer sets contained in the corresponding directions;
s32, extending and merging the seed sequence Minimizer according to the direction of the sequencing sequence at each end:
a. extension of seed sequence Minimizer:
acquiring the starting position and the ending position of a current seed sequence Minimizer on a sequencing sequence, and the starting position and the ending position of the current seed sequence Minimizer on a unique path in a de Bruijn graph model, extending the corresponding positions of the seed sequence Minimizer on the sequencing sequence and the unique path forwards and backwards until encountering a certain base matching error on the sequencing sequence or reaching the last base position of the unique path, recording the front-back extending position and the covering length of the seed sequence Minimizer on the sequencing sequence after extending on the sequencing sequence and the unique path, and traversing all the seed sequences minimizers on the sequencing sequence according to the method;
during traversal, if the terminating position of the current seed sequence minizer on the sequencing sequence is less than the backward extension position of the previous seed sequence minizer, it indicates that the current seed sequence minizer is contained within the extension coverage area of the previous seed sequence minizer and no extension is required; otherwise, executing the extension;
obtaining a candidate seed sequence after each time of extension of a seed sequence Minimizer, and recording a unique path number ID corresponding to the candidate seed sequence Unipath Offset position on sequencing sequence, position on reference genome, coverage length on sequencing sequence, candidate seed sequence number ID seed Obtaining all candidate seed sequences;
b. merging candidate seed sequences:
merging IDs having the same unique path number Unipath Obtaining a set of candidate seed sequences, and combining coverage areas of all candidate seed sequences in each set to obtain a new coverage length;
splitting the candidate seed sequences in the set into a plurality of six-tuple (candidate seed sequence number ID) according to the corresponding reference genome positions seed The start and end positions of the region covered on the sequencing sequence, the end number ID of the paired-end sequencing sequence end On a reference genomeStarting position and ending position), obtaining a six-element group set consisting of six-element groups of all candidate seed sequences according to the steps;
s33, path searching based on the dynamic planning method:
i, ordering the six-tuple set according to the initial position of each six-tuple on a reference genome to obtain an ordered set, defining each six-tuple in the ordered set as a node, traversing the ordered set from the beginning, extracting adjacent nodes in a specified number behind a current node, and establishing a connecting edge if the current node and a certain downstream node simultaneously meet a condition; otherwise, no connecting edge is established;
and II, taking the negative value of the connecting edge as the weight of the edge, wherein the weight calculation method comprises the following steps:
weight edge =cover match -score panalty
Figure FDA0003919106510000041
diff distance =|(P1 read -P1 read )-(P1 ref -P2 ref )|
wherein, weight edge Representing the weight of the connecting edge; cover match Representing the coverage length of the sequencing sequence of the current node; score panalty Is a connection penalty; diff (diff) distance Representing the reference genome positions of the current node and a certain downstream node and the deviation distance on the sequencing sequence, and calculating according to the initial position; p1 read Representing the offset position of the current node on the sequencing sequence read; p1 ref Representing the position of the current node on the reference genome; p2 read Indicating the offset position of a certain downstream node on the sequencing sequence read; p2 ref Represents the position of a downstream node on the reference genome;
if the end number ID of the double-end sequencing sequence of the current node and a certain downstream node end If the difference is different, taking the logarithm of the deviation distance, otherwise, taking the value of the deviation distance; obtaining a plurality of connecting edges according to the above method;
III, traversing the ordered set from the beginning according to a dynamic planning idea of a dynamic planning method, selecting a precursor node with the maximum edge weight for each node to construct an effective connecting edge between the two nodes, and adding a back-driving node number connected with the precursor node to obtain a path in the graph model;
the ordered set is traversed from the tail to the front, if the current node has no back drive node and has a front drive node, the current node is the termination node of a certain path, the front drive node is traversed forwards according to the connection relation of edges, the front drive node of the current node is traversed circularly according to the method, the node number is recorded until a certain node has no front drive node, namely, the certain node is the initial node of the current path, a path is obtained, and all nodes on the path are recorded; obtaining all path sets according to the method;
and IV, synchronously accumulating and calculating the weight of each connecting edge in the path construction process to obtain a corresponding path score path According to the path score path And sorting all paths from high to low by adopting a quick sorting method to obtain an ordered path set, and traversing the set according to specified conditions to obtain an optimal path set and a suboptimal path set.
5. A method of population genome-based sequence alignment according to claim 4, wherein: s31, when judging the direction of each end sequencing sequence in the double-end sequencing sequence, if two adjacent seed sequences minimizers on each end sequence are overlapped, the coverage length is the union length of the coverage areas of the two seed sequences minimizers on the sequencing sequences.
6. A method of population genome-based sequence alignment according to claim 5, wherein: in S31, a Minimizer comparison is carried out by using the local variant sequence, and the specific process is as follows:
according to the seed sequence Minimizer of the two-end sequencing sequence, the corresponding variant sequence on the variant data is searched by using the local variant sequence index, and the variant sequence is returned on the variant dataCorresponding quadruplet ALT start ,ALT end ,Read′ start ,Read′ end },ALT start ,ALT end ,Read′ start ,Read′ end Respectively representing the initial position and the end position of the variant sequence on variant data and the initial position and the end position of the variant sequence on a sequencing sequence, and updating L according to the coverage length corresponding to the quadruple 1f 、L 1r 、L 2f 、L 2r And judging the direction of the sequencing sequence at each end according to S31, and returning all Minimizer sets contained in the corresponding direction.
7. A method of population genome-based sequence alignment according to claim 6, wherein: the conditions in I in S33 are:
1) Seed number ID of two nodes seed Different;
2) Chromosomes corresponding to the reference genome positions of the two nodes are the same;
3) The difference of the reference genome positions of the two nodes is smaller than a specified threshold value;
4) And the reference genome positions of the two nodes are consistent with the position size relation on the sequencing sequence.
8. A method of population genome-based sequence alignment as claimed in claim 7, wherein: the conditions specified in iv in S33 are:
1) If the overall coverage length of each node on the sequencing sequence on the first path in the path set is smaller than a specified threshold, the specified threshold is 2/3 of the length of the sequencing sequence, and the seed sequence minizer is not inquired on the variant sequence, minizer comparison is carried out by using a local variant sequence index, if a new seed sequence minizer is obtained after comparison, the hexahydric set corresponding to the new seed sequence minizer is added to the hexahydric set of the S32b, and the S33 is executed again to obtain an optimal path and a suboptimal path set; otherwise, execute 2);
2) In the ordered path set, a first path is used as an optimal path, paths which have the same score as the first path and are within the specified number are used as suboptimal paths, if the number of the paths in the ordered path set is greater than the specified number, the suboptimal paths are selected according to the specified number, and if the number of the paths in the ordered path set is less than or equal to the specified number, all the paths except the first path are used as suboptimal paths to obtain an optimal path and suboptimal path set.
9. A method for genome-wide sequence alignment according to claim 8, wherein: in S4, a genome sequence corresponding to the sequencing sequence of S2 on a reference genome is obtained according to the optimal path and the suboptimal path set, the sequencing sequence is compared with the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, a result comparison file is obtained according to the comparison score, and the specific process is as follows:
and calculating the initial position of the sequencing sequence of the S2 on the reference genome according to the initial node information of all paths in the optimal path and suboptimal path set generated in the step S33, extracting a genome sequence corresponding to the sequencing sequence from the reference genome, performing end-to-end global sequence comparison on the sequencing sequence and the corresponding genome sequence by adopting a KSW2 algorithm to obtain a comparison score, and obtaining a result comparison file according to the comparison score.
10. A method for genome-wide sequence alignment according to claim 9, wherein: if the comparison result of S4 contains a known variant sequence, the format of the comparison result represents variant information according to the following method:
respectively adding a starting identification character 'V' and an ending representation character 'E' before and after the variation number of the variation data acquired in S1; then, a binary three-level index structure is constructed to represent a comparison result, and the construction method of each level of index structure is as follows:
1) Recording the number of comparison results of each sequencing sequence, namely the offset position on the second-level index, in each data unit of the first-level index;
2) Recording the length of the CIGAR sequence of each sequencing sequence comparison result, namely the offset position on the third-level index, by each data unit of the second-level index;
3) And the third-level index records the CIGAR sequence of each sequencing sequence alignment result.
CN202211366121.9A 2022-10-31 2022-10-31 Sequence alignment method based on group genome Active CN115602246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211366121.9A CN115602246B (en) 2022-10-31 2022-10-31 Sequence alignment method based on group genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211366121.9A CN115602246B (en) 2022-10-31 2022-10-31 Sequence alignment method based on group genome

Publications (2)

Publication Number Publication Date
CN115602246A true CN115602246A (en) 2023-01-13
CN115602246B CN115602246B (en) 2023-06-20

Family

ID=84851619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211366121.9A Active CN115602246B (en) 2022-10-31 2022-10-31 Sequence alignment method based on group genome

Country Status (1)

Country Link
CN (1) CN115602246B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160259880A1 (en) * 2015-03-05 2016-09-08 Seven Bridges Genomics Inc. Systems and methods for genomic pattern analysis
CN106682393A (en) * 2016-11-29 2017-05-17 北京荣之联科技股份有限公司 Genomic sequence alignment method and genomic sequence alignment device
CN107480466A (en) * 2017-07-06 2017-12-15 北京荣之联科技股份有限公司 Genomic data storage method and electronic equipment
US20200265923A1 (en) * 2019-01-22 2020-08-20 The Regents Of The University Of Michigan Efficient Seeding For Read Alignment
CN112489727A (en) * 2020-12-24 2021-03-12 厦门基源医疗科技有限公司 Method and system for rapidly acquiring pathogenic site of rare disease
CN112735528A (en) * 2021-01-08 2021-04-30 华中农业大学 Gene sequence comparison method and system
CN114999573A (en) * 2022-04-14 2022-09-02 哈尔滨因极科技有限公司 Genome variation detection method and detection system
CN115206434A (en) * 2022-06-23 2022-10-18 南京邮电大学 De Bruijn graph-based multi-sequence comparison method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160259880A1 (en) * 2015-03-05 2016-09-08 Seven Bridges Genomics Inc. Systems and methods for genomic pattern analysis
CN106682393A (en) * 2016-11-29 2017-05-17 北京荣之联科技股份有限公司 Genomic sequence alignment method and genomic sequence alignment device
CN107480466A (en) * 2017-07-06 2017-12-15 北京荣之联科技股份有限公司 Genomic data storage method and electronic equipment
US20200265923A1 (en) * 2019-01-22 2020-08-20 The Regents Of The University Of Michigan Efficient Seeding For Read Alignment
CN112489727A (en) * 2020-12-24 2021-03-12 厦门基源医疗科技有限公司 Method and system for rapidly acquiring pathogenic site of rare disease
CN112735528A (en) * 2021-01-08 2021-04-30 华中农业大学 Gene sequence comparison method and system
CN114999573A (en) * 2022-04-14 2022-09-02 哈尔滨因极科技有限公司 Genome variation detection method and detection system
CN115206434A (en) * 2022-06-23 2022-10-18 南京邮电大学 De Bruijn graph-based multi-sequence comparison method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BO LIU等: "deBGA: read alignment with de Bruijn graph-based seed and extension", 《BIOINFORMATICS》 *
MARIA CHATZOU等: "Multiple sequence alignment modeling: methods and applications", 《BRIEFINGS IN BIOINFORMATICS》 *
国宏哲: "基于de Bruijn图模型的基因组序列映射算法研究", 《中国博士学位论文全文数据库 基础科学辑》 *
权威: "基于高通量测序数据的基因组序列比对方法研究", 《中国博士学位论文全文数据库 基础科学辑》 *

Also Published As

Publication number Publication date
CN115602246B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN107403075B (en) Comparison method, device and system
CN110010193B (en) Complex structure variation detection method based on hybrid strategy
US6714874B1 (en) Method and system for the assembly of a whole genome using a shot-gun data set
CA2424031C (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN107133493B (en) Method for assembling genome sequence, method for detecting structural variation and corresponding system
KR101313087B1 (en) Method and Apparatus for rearrangement of sequence in Next Generation Sequencing
CN115631789B (en) Group joint variation detection method based on pan genome
CN113362889A (en) Genome structure variation annotation method
CN107229842A (en) A kind of three generations&#39;s sequencing sequence bearing calibration based on Local map
He et al. De novo assembly methods for next generation sequencing data
CN111008625B (en) Address correction method, device, equipment and storage medium
Thachuk Indexing hypertext
CN112397148B (en) Sequence comparison method, sequence correction method and device thereof
CN115602246A (en) Sequence comparison method based on group genome
US11250931B2 (en) Systems and methods for detecting recombination
CN115602244B (en) Genome variation detection method based on sequence alignment skeleton
EP3663890B1 (en) Alignment method, device and system
CN115662523B (en) Group-oriented genome index representation and construction method and equipment
CN108491687B (en) Scafffolding method based on contig quality evaluation classification and graph optimization
CN113380326B (en) Gene expression data analysis method based on PAM clustering algorithm
Vezzi Next generation sequencing revolution challenges: Search, assemble, and validate genomes
CN109935274A (en) A kind of long reading overlay region detection method based on k-mer distribution characteristics
CN108830047A (en) A kind of scaffolding method based on long reading and contig classification
CN115641910B (en) Combined detection method for structural variation of third generation group genome

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant