WO2021169387A1 - Sequence alignment method, apparatus and device, and medium - Google Patents

Sequence alignment method, apparatus and device, and medium Download PDF

Info

Publication number
WO2021169387A1
WO2021169387A1 PCT/CN2020/126350 CN2020126350W WO2021169387A1 WO 2021169387 A1 WO2021169387 A1 WO 2021169387A1 CN 2020126350 W CN2020126350 W CN 2020126350W WO 2021169387 A1 WO2021169387 A1 WO 2021169387A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
sequence
reference sub
segment
score
Prior art date
Application number
PCT/CN2020/126350
Other languages
French (fr)
Chinese (zh)
Inventor
尹云峰
任智新
金良
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2021169387A1 publication Critical patent/WO2021169387A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Definitions

  • the present invention relates to the technical field of gene detection, in particular to a sequence comparison method, device, equipment, and medium.
  • Sequence alignment mainly includes two stages: seed finding and expansion.
  • seed finding and expansion In order to improve the accuracy of sequence alignment, it is necessary to find the position where the seed of the sequence to be aligned read appears in the reference sequence as much as possible.
  • there are a large number of alignments of invalid positions which wastes a large amount of computing resources, and causes the performance of the entire sequence alignment to be greatly reduced.
  • the present invention proposes a sequence comparison method, device, equipment, and medium to overcome the above-mentioned problems or at least partially solve the above-mentioned problems.
  • a method of sequence alignment including:
  • the position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence.
  • the comparing all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score includes:
  • all the target seeds in the sequence read to be compared are compared with each of the reference sub-fragments to obtain a target score.
  • comparing all the target seeds in the sequence read to be compared with any reference sub-fragment according to a preset rule to obtain a target score including:
  • the target score is subtracted from the corresponding penalty point to obtain the target score corresponding to the reference sub-segment.
  • the judging whether a target situation occurs during the comparison process includes:
  • comparing all the target seeds in the sequence read to be compared with any reference sub-fragment according to a preset rule to obtain a target score including:
  • the target score is added to the corresponding bonus point to obtain the target score corresponding to the reference sub-segment.
  • the screening out target reference sub-segments from the reference sub-segments according to the target score includes:
  • the reference sub-segment corresponding to the target score is determined as the target reference sub-segment.
  • the screening out target reference sub-segments from the reference sub-segments according to the target score includes:
  • the reference sub-segment corresponding to the standardized target score is determined as the target reference sub-segment.
  • a sequence comparison device includes:
  • the first segmentation module is used to segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared;
  • the second segmentation module is used to segment the target reference sequence obtained in advance to obtain reference sub-segments
  • the comparison module is used to compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score;
  • a fragment screening module configured to screen out target reference sub-segments from the reference sub-segments according to the target score
  • the position determination module is configured to determine the position of the target reference sub-fragment in the target reference sequence as the exact matching position of the read sequence to be aligned in the target reference sequence.
  • a sequence comparison device including:
  • the memory is used to store a computer program
  • the processor is configured to execute the computer program to realize the sequence comparison method disclosed above.
  • a computer-readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the sequence comparison method disclosed above.
  • the present invention provides a sequence comparison method, device, equipment, and medium that first segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, and obtain the result in advance.
  • the target reference sequence is segmented to obtain a reference sub-fragment, and then all the target seeds in the sequence to be compared read are compared with the reference sub-fragment according to a preset rule to obtain a target score, and the target score is obtained according to the target
  • the score selects the target reference sub-segment from the reference sub-segments, and then determines the position of the target reference sub-segment in the target reference sequence as the precise read of the sequence to be compared in the target reference sequence. Match location.
  • this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules.
  • the target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence
  • the position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
  • Figure 1 is a flow chart of a sequence alignment method disclosed in this application.
  • FIG. 2 is a flowchart of a specific sequence alignment method disclosed in this application.
  • Fig. 3 is a process diagram of the precise screening of matching positions in a sequence alignment disclosed in this application;
  • Figure 4 is a schematic structural diagram of a sequence comparison device disclosed in this application.
  • Figure 5 is a structural diagram of a sequence alignment device disclosed in this application.
  • this application proposes a sequence alignment method that can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce the workload of subsequent sequence expansion.
  • the embodiment of the present application discloses a sequence alignment method, which includes:
  • Step S11 segment the read sequence to be compared to obtain the target seed corresponding to the sequence read to be compared.
  • seed is a sequence that has an exact match and is waiting to be expanded in the sequence alignment.
  • the sequence read to be aligned includes 100 bases, and every 20 is divided into a fragment to obtain the target seed.
  • the first target seed is the 0th to 19th bases
  • the second target seed is the first To the 20th base
  • the third target seed is the second to the 21st base, and so on for the subsequent target seeds.
  • Step S12 segment the target reference sequence obtained in advance to obtain reference sub-segments.
  • the target reference sequence is also necessary to segment the target reference sequence accordingly to obtain a reference sub-segment for comparison with the target seed. Since the target reference sequence is generally relatively long, it is also necessary to segment the target reference sequence for comparison, and the length of the reference sub-segment is generally greater than or equal to the length of the sequence to be aligned read.
  • Step S13 Compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score.
  • the comparing all the target seeds in the sequence read to be compared with the reference sub-fragment according to a preset rule to obtain a target score includes: reading the sequence to be compared in the sequence read according to the preset rule All the target seeds of are compared with each of the reference sub-segments to obtain the target score.
  • the first target seed and the first Compare two reference sub-segments compare the second target seed with the first reference sub-segment to obtain the corresponding target score, and then first compare the first target seed with the first target seed according to preset rules.
  • the two reference sub-segments are compared, and the second target seed is compared with the second reference sub-segment to obtain a corresponding target score.
  • Step S14 screening out target reference sub-segments from the reference sub-segments according to the target score.
  • the target reference sub-segment needs to be screened out from the reference sub-segments according to the target score, so as to determine that the sequence to be compared read is in the target reference sequence. Exact match in.
  • the screening of the target reference sub-segment from the reference sub-segment according to the target score includes: judging whether the target score is greater than or equal to a preset score threshold; if the If the target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment. Specifically, after the target score is obtained, it is determined whether the target score is greater than or equal to a preset score threshold, and if the target score is greater than or equal to the preset score threshold, the reference child corresponding to the target score is determined The segment is determined as the target reference sub-segment.
  • the screening of the target reference sub-segment from the reference sub-segments according to the target score includes: normalizing the target score; judging whether the target score after the standardization is greater than or equal to A preset score threshold; if the standardized target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the standardized target score is determined as the target reference sub-segment.
  • Step S15 Determine the position of the target reference sub-fragment in the target reference sequence as the exact matching position of the read sequence to be aligned in the target reference sequence.
  • the position of the target reference sub-segment in the target reference sequence can be determined as the sequence read in the target reference sequence. The exact match position.
  • this application first segmented the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, segmented the target reference sequence obtained in advance to obtain the reference sub-segment, and then segmented it according to a preset rule All target seeds in the sequence read to be compared are compared with the reference sub-fragment to obtain a target score, and the target reference sub-fragment is selected from the reference sub-fragments according to the target score, and then the The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence.
  • this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules.
  • the target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence
  • the position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
  • the embodiment of the present application discloses a specific sequence alignment method, which includes:
  • Step S21 Segment the read sequence to be compared to obtain the target seed corresponding to the sequence read to be compared.
  • Step S22 Segment the target reference sequence obtained in advance to obtain reference sub-segments.
  • Step S23 Compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score.
  • the comparing the target seed with the reference sub-fragment according to a preset rule to obtain a target score includes: reading all targets in the sequence to be compared according to the preset rule The seed is compared with each of the reference sub-segments to obtain the target score.
  • comparing all the target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score includes: initializing the target score ; Compare all the target seeds in the sequence read to be compared with the reference sub-fragment, and determine whether the target situation occurs during the comparison process; if the target situation occurs during the comparison process, the target score is subtracted from the corresponding Penalty points of, get the target score corresponding to the reference sub-segment.
  • the judging whether there is a target situation during the comparison includes: judging whether all the target seeds in the read sequence to be compared are hit on the reference sub-fragment and are in the reference sub-fragment.
  • the position of the upper hit is not continuous; or, it is judged whether the first target seed among all the target seeds in the read sequence to be aligned hits the reference sub-fragment during the comparison process, and the to-be-aligned
  • the second target seed among all target seeds in the sequence read does not hit the reference sub-segment.
  • the first target seed and the second target seed constitute all target seeds in the sequence read to be aligned, and the first target seed in all target seeds in the sequence read to be aligned appears in the A hit on the reference sub-fragment, and the second target seed among all target seeds in the sequence read to be aligned does not hit the reference sub-fragment, and includes: all targets in the sequence read to be aligned The first target seed in the seed hits the reference sub-segment, and the hit position of the first target seed on the reference sub-segment is continuous, and the second target seed in all the target seeds in the read sequence to be aligned The target seed does not hit the reference sub-segment.
  • the target score is initialized first, that is, the same target score is given to each target seed first, so as to assume that all the target seeds can hit the corresponding reference sub-segment and are in a reference sub-segment.
  • the hit positions on the fragment are consecutive, and then all the target seeds in the sequence read to be compared are compared with the reference sub-fragment.
  • the initialization is performed
  • the target score of is used as the final target score, if all the target seeds hit the reference sub-segment during the comparison process, but the hit positions are not continuous, or part of the target among all target seeds read by the sequence to be compared If the seed hits the reference sub-chip, and part of the target seed does not hit the reference sub-chip, the corresponding penalty points are subtracted. For example, when all the target seeds are hit on the reference sub-segment, but the hit positions are not continuous, how many extra positions appear in the middle of the hit positions, then the initialized target score is subtracted from the corresponding number of 7 points , Get the final target score.
  • the sequence to be compared by read (read) to obtain the target seed as shown in 1 to 8, segment the target reference sequence (Reference) to obtain the reference sub-segment, including R_segment 1 to R_segment n, suppose If all the target seeds in the sequence read to be aligned hit consecutively on the reference sub-segment, the corresponding initial target score is set to 100 points, the target seeds 0-8 are all hit on R_segment 1, and the hit positions are consecutive , Get the corresponding target score of 100 points, all the target seeds in the sequence read to be compared hit in R_segment 2, but two extra positions are added to the hit positions, and two 7 points are deducted, that is, 14 Score, get the target score of 86 points, R_segment 3 has a target seed miss, get the target score of 93 points, and so on, R_segment n-1 gets the target score of 65 points, R_segment n gets the target score of 58 point.
  • comparing all the target seeds in the sequence read to be compared with any reference sub-fragment according to a preset rule to obtain a target score including: initializing the target score ; Compare all the target seeds in the sequence read to be compared with the reference sub-segment; if the target seed hits the reference sub-segment, the target score is added to the corresponding reward points to obtain the reference The target score corresponding to the sub-segment.
  • Step S24 Determine whether the target score is greater than or equal to a preset score threshold.
  • a target reference sub-segment After obtaining the target score, it is also necessary to determine a target reference sub-segment from the reference sub-segments according to the target score. Specifically, it may be determined whether the target score is greater than or equal to a preset score threshold; if the target score is greater than or equal to a preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment Fragment.
  • Step S25 If the target score is greater than or equal to a preset score threshold, determine the reference sub-segment corresponding to the target score as the target reference sub-segment.
  • Step S26 Determine the position of the target reference sub-fragment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
  • the embodiment of the present application discloses a sequence comparison device, including:
  • the first segmentation module 11 is used to segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared;
  • the second segmentation module 12 is used to segment the target reference sequence obtained in advance to obtain reference sub-segments
  • the comparison module 13 is configured to compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score;
  • the fragment screening module 14 is configured to screen out target reference sub-segments from the reference sub-segments according to the target score;
  • the position determining module 15 is configured to determine the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
  • this application first segmented the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, segmented the target reference sequence obtained in advance to obtain the reference sub-segment, and then segmented it according to a preset rule All target seeds in the sequence read to be compared are compared with the reference sub-fragment to obtain a target score, and the target reference sub-fragment is selected from the reference sub-fragments according to the target score, and then the The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence.
  • this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules.
  • the target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence
  • the position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
  • an embodiment of the present application also discloses a sequence comparison device, which includes a processor 21 and a memory 22.
  • the memory 22 is used to store a computer program; the processor 21 is used to execute the computer program to implement the sequence comparison method disclosed in the foregoing embodiment.
  • the embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program is executed by a processor to implement the following steps:
  • this application first segmented the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, segmented the target reference sequence obtained in advance to obtain the reference sub-segment, and then segmented it according to a preset rule All target seeds in the sequence read to be compared are compared with the reference sub-fragment to obtain a target score, and the target reference sub-fragment is selected from the reference sub-fragments according to the target score, and then the The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence.
  • this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules.
  • the target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence
  • the position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
  • the following steps can be specifically implemented: according to a preset rule, all the target seeds in the sequence to be compared are read with each other. The reference sub-fragments are compared to obtain the target score.
  • the following steps can be specifically implemented: initialize the target score; read all targets in the sequence to be compared The seed compares with the reference sub-segment and judges whether there is a target situation during the comparison; if there is a target situation during the comparison, the target score is subtracted from the corresponding penalty point to obtain the target score corresponding to the reference sub-segment .
  • the following steps can be specifically implemented: judging whether all the target seeds in the sequence to be compared appear during the comparison process. A hit on the reference sub-segment, and the position of the hit on the reference sub-segment is not continuous; or, determining whether the first target seed among all the target seeds in the sequence to be aligned read appears during the comparison process A situation in which a hit is made on the reference sub-segment, and the second target seed among all the target seeds in the read sequence to be aligned does not hit the reference sub-segment.
  • the following steps can be specifically implemented: initialize the target score; read all targets in the sequence to be compared The seed is compared with the reference sub-segment; if the target seed hits the reference sub-segment, the target score is added to the corresponding bonus point to obtain the target score corresponding to the reference sub-segment.
  • the following steps can be specifically implemented: judging whether the target score is greater than or equal to the preset score threshold; if the target score is greater than Or equal to the preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment.
  • the following steps can be specifically implemented: normalizing the target score; judging whether the standardized target score is greater than or equal to the preset score Threshold; if the standardized target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the standardized target score is determined as the target reference sub-segment.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM).
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • this application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A sequence alignment method, apparatus and device, and a medium. The method comprises: segmenting a sequence to be aligned read, to obtain target seeds corresponding to said sequence read (S11); segmenting a pre-obtained target reference sequence to obtain reference sub-segments (S12); comparing all the target seeds in said sequence read with the reference sub-segments according to a preset rule to obtain a target score (S13); screening out a target reference sub-segment from the reference sub-segments according to the target score (S14); and determining the position of the target reference sub-segment in the target reference sequence as an accurate matching position of said sequence read in the target reference sequence (S15). In this way, invalid matching positions can be filtered out, computing resources are saved, the performance of sequence alignment is improved, and the workload of subsequent sequence extension is reduced.

Description

一种序列比对方法、装置、设备、介质A sequence comparison method, device, equipment and medium
本申请要求于2020年02月28日提交中国专利局、申请号为202010130211.2、发明名称为“一种序列比对方法、装置、设备、介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 28, 2020, the application number is 202010130211.2, and the invention title is "a sequence alignment method, device, equipment, and medium", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本发明涉及基因检测技术领域,特别涉及一种序列比对方法、装置、设备、介质。The present invention relates to the technical field of gene detection, in particular to a sequence comparison method, device, equipment, and medium.
背景技术Background technique
随着生物基因检测技术的迅速发展,提取个人的基因进行基因序列的比对,预测罹患多种疾病的可能性,锁定个人病变的基因,提前预防和治疗的技术愈发成熟。人类基因库目前约为30亿个碱基对,采用通用的计算机软件处理平台完成一个人的基因序列比对需要几天。序列比对主要包括找种子和扩展两个阶段。为了提高序列比对的精度,需要尽可能的找到待比对序列read的seed在参考序列中出现的位置。在现有技术中,存在大量无效位置的比对,浪费了大量的计算资源,导致整个序列比对的性能大大降低。With the rapid development of biological gene detection technology, the technology of extracting personal genes for genetic sequence comparison, predicting the possibility of suffering from a variety of diseases, locking personal disease genes, and advance prevention and treatment technology has become more and more mature. The human gene bank currently has approximately 3 billion base pairs, and it takes several days to complete a human gene sequence alignment using a general-purpose computer software processing platform. Sequence alignment mainly includes two stages: seed finding and expansion. In order to improve the accuracy of sequence alignment, it is necessary to find the position where the seed of the sequence to be aligned read appears in the reference sequence as much as possible. In the prior art, there are a large number of alignments of invalid positions, which wastes a large amount of computing resources, and causes the performance of the entire sequence alignment to be greatly reduced.
发明内容Summary of the invention
鉴于上述问题,本发明提出了一种序列比对方法、装置、设备、介质,以便克服上述问题或者至少部分地解决上述问题。In view of the above-mentioned problems, the present invention proposes a sequence comparison method, device, equipment, and medium to overcome the above-mentioned problems or at least partially solve the above-mentioned problems.
一种序列比对方法,包括:A method of sequence alignment, including:
对待比对序列read进行分段,得到所述待检测序列对应的目标seed;Segment the read sequence to be compared to obtain the target seed corresponding to the sequence to be detected;
对预先得到的目标参考序列进行分段,得到参考子片段;Segment the target reference sequence obtained in advance to obtain reference sub-segments;
按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分;Comparing all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score;
根据所述目标得分从所述参考子片段中筛选出目标参考子片段;Screening out target reference sub-segments from the reference sub-segments according to the target score;
将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence.
可选地,所述按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,包括:Optionally, the comparing all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score includes:
按照预设规则将所述待比对序列read中的所有目标seed与各个所述参考子片段进行比对,得到目标得分。According to a preset rule, all the target seeds in the sequence read to be compared are compared with each of the reference sub-fragments to obtain a target score.
可选地,按照预设规则将所述待比对序列read中的所有目标seed与任一参考子片段进行比对,得到目标得分,包括:Optionally, comparing all the target seeds in the sequence read to be compared with any reference sub-fragment according to a preset rule to obtain a target score, including:
将所述目标得分进行初始化;Initialize the target score;
将所述待比对序列read中的所有目标seed与该参考子片段进行比对,并判断比对过程中是否出现目标情况;Compare all the target seeds in the sequence read to be compared with the reference sub-fragment, and determine whether a target situation occurs during the comparison;
如果比对过程中出现目标情况,则将目标得分减去相应的惩罚分,得到该参考子片段对应的目标得分。If a target situation occurs during the comparison, the target score is subtracted from the corresponding penalty point to obtain the target score corresponding to the reference sub-segment.
可选地,所述判断比对过程中是否出现目标情况,包括:Optionally, the judging whether a target situation occurs during the comparison process includes:
判断比对过程中是否出现所述待比对序列read中的所有目标seed均在该参考子片段上命中,且在该参考子片段上命中的位置不连续的情况;Judging whether all the target seeds in the sequence read to be aligned hit the reference sub-segment during the comparison process, and the hit positions on the reference sub-segment are not consecutive;
或,判断比对过程中是否出现所述待比对序列read中的所有目标seed中的第一目标seed在该参考子片段上命中,且所述待比对序列read中的所有目标seed中的第二目标seed在该参考子片段上没有命中的情况。Or, it is judged whether the first target seed among all target seeds in the sequence read to be aligned hits on the reference sub-fragment during the comparison process, and the first target seed among all target seeds in the sequence read to be aligned is determined The second target seed does not hit the reference sub-segment.
可选地,按照预设规则将所述待比对序列read中的所有目标seed与任一参考子片段进行比对,得到目标得分,包括:Optionally, comparing all the target seeds in the sequence read to be compared with any reference sub-fragment according to a preset rule to obtain a target score, including:
将所述目标得分进行初始化;Initialize the target score;
将所述待比对序列read中的所有目标seed与该参考子片段进行比对;Aligning all the target seeds in the sequence read to be aligned with the reference sub-fragment;
如果所述目标seed在该参考子片段上命中,则将目标得分加上相应的奖励分,得到该参考子片段对应的目标得分。If the target seed hits the reference sub-segment, the target score is added to the corresponding bonus point to obtain the target score corresponding to the reference sub-segment.
可选地,所述根据所述目标得分从所述参考子片段中筛选出目标参考子片段,包括:Optionally, the screening out target reference sub-segments from the reference sub-segments according to the target score includes:
判断所述目标得分是否大于或等于预设得分阈值;Judging whether the target score is greater than or equal to a preset score threshold;
如果所述目标得分大于或等于预设得分阈值,则将所述目标得分对应的所述参考子片段确定为目标参考子片段。If the target score is greater than or equal to a preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment.
可选地,所述根据所述目标得分从所述参考子片段中筛选出目标参考子片段,包括:Optionally, the screening out target reference sub-segments from the reference sub-segments according to the target score includes:
将所述目标得分进行标准化;Standardize the target score;
判断标准化后目标得分是否大于或等于预设得分阈值;Determine whether the standardized target score is greater than or equal to the preset score threshold;
如果标准化后目标得分大于或等于预设得分阈值,则将所述标准化后目标得分对应的所述参考子片段确定为目标参考子片段。If the standardized target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the standardized target score is determined as the target reference sub-segment.
一种序列比对装置,包括:A sequence comparison device includes:
第一分段模块,用于对待比对序列read进行分段,得到所述待比对序列read对应的目标seed;The first segmentation module is used to segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared;
第二分段模块,用于对预先得到的目标参考序列进行分段,得到参考子片段;The second segmentation module is used to segment the target reference sequence obtained in advance to obtain reference sub-segments;
比对模块,用于按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分;The comparison module is used to compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score;
片段筛选模块,用于根据所述目标得分从所述参考子片段中筛选出目标参考子片段;A fragment screening module, configured to screen out target reference sub-segments from the reference sub-segments according to the target score;
位置确定模块,用于将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。The position determination module is configured to determine the position of the target reference sub-fragment in the target reference sequence as the exact matching position of the read sequence to be aligned in the target reference sequence.
一种序列比对设备,包括:A sequence comparison device, including:
存储器和处理器;Memory and processor;
其中,所述存储器,用于存储计算机程序;Wherein, the memory is used to store a computer program;
所述处理器,用于执行所述计算机程序,以实现前述公开的序列比对方法。The processor is configured to execute the computer program to realize the sequence comparison method disclosed above.
一种计算机可读存储介质,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现前述公开的序列比对方法。A computer-readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the sequence comparison method disclosed above.
借由上述技术方案,本发明提供的一种序列比对方法、装置、设备、介质,先对待比对序列read进行分段,得到所述待比对序列read对应的目标seed,并对预先得到的目标参考序列进行分段,得到参考子片段,然后 按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。由此可见,本申请需要先分别对待比对序列read以及预先得到的目标参考序列进行分段,对应得到目标seed以及参考子片段,然后按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置,这样能够过滤掉无效的匹配位置,节约计算资源,提高序列比对的性能,且减少后续序列扩展的工作负载。With the above technical solutions, the present invention provides a sequence comparison method, device, equipment, and medium that first segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, and obtain the result in advance. The target reference sequence is segmented to obtain a reference sub-fragment, and then all the target seeds in the sequence to be compared read are compared with the reference sub-fragment according to a preset rule to obtain a target score, and the target score is obtained according to the target The score selects the target reference sub-segment from the reference sub-segments, and then determines the position of the target reference sub-segment in the target reference sequence as the precise read of the sequence to be compared in the target reference sequence. Match location. It can be seen that this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules. The target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence The position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objectives, features and advantages of the present invention more obvious and understandable. In the following, specific embodiments of the present invention are specifically cited.
附图说明Description of the drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only used for the purpose of illustrating the preferred embodiments, and are not considered as a limitation to the present invention. Also, throughout the drawings, the same reference symbols are used to denote the same components. In the attached picture:
图1为本申请公开的一种序列比对方法流程图;Figure 1 is a flow chart of a sequence alignment method disclosed in this application;
图2为本申请公开的一种具体的序列比对方法流程图;Figure 2 is a flowchart of a specific sequence alignment method disclosed in this application;
图3为本申请公开的一种序列比对中精确筛选匹配位置过程图;Fig. 3 is a process diagram of the precise screening of matching positions in a sequence alignment disclosed in this application;
图4为本申请公开的一种序列比对装置结构示意图;Figure 4 is a schematic structural diagram of a sequence comparison device disclosed in this application;
图5为本申请公开的一种序列比对设备结构图。Figure 5 is a structural diagram of a sequence alignment device disclosed in this application.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
目前,基因序列比对中存在大量无效位置的比对,浪费了大量的计算资源,导致整个序列比对的性能大大降低。有鉴于此,本申请提出一种序列比对方法,能够过滤掉无效的匹配位置,节约计算资源,提高序列比对的性能,且减少后续序列扩展的工作负载。At present, there are a large number of invalid positions in the comparison of gene sequences, which wastes a lot of computing resources, and causes the performance of the entire sequence comparison to be greatly reduced. In view of this, this application proposes a sequence alignment method that can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce the workload of subsequent sequence expansion.
参见图1所示,本申请实施例公开了一种序列比对方法,该方法包括:As shown in Figure 1, the embodiment of the present application discloses a sequence alignment method, which includes:
步骤S11:对待比对序列read进行分段,得到待比对序列read对应的目标seed。Step S11: segment the read sequence to be compared to obtain the target seed corresponding to the sequence read to be compared.
在具体的实施过程中,为进行精确的比对,需要先对待比对序列read进行分段,得到所述待比对序列read对应的目标seed。其中,seed为序列比对中的有精确匹配、等待扩展的序列。例如,待比对序列read包括100个碱基,每20个分为一个片段,得到目标seed,第一个目标seed为第0个到第19个碱基,第二个目标seed为第1个到第20个碱基,第三个目标seed为第2个到第21个碱基,之后的目标seed以此类推。In a specific implementation process, in order to perform an accurate comparison, it is necessary to segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared. Among them, seed is a sequence that has an exact match and is waiting to be expanded in the sequence alignment. For example, the sequence read to be aligned includes 100 bases, and every 20 is divided into a fragment to obtain the target seed. The first target seed is the 0th to 19th bases, and the second target seed is the first To the 20th base, the third target seed is the second to the 21st base, and so on for the subsequent target seeds.
步骤S12:对预先得到的目标参考序列进行分段,得到参考子片段。Step S12: segment the target reference sequence obtained in advance to obtain reference sub-segments.
可以理解的是,对所述待比对序列read进行分段之后,同样需要对目标参考序列进行相应的分段,得到参考子片段,以便与所述目标seed进行比对。由于所述目标参考序列一般比较长,所以也需要对所述目标参考序列进行分段,以便比对,且所述参考子片段的长度一般大于或等于所述待比对序列read的长度。It is understandable that after segmenting the sequence to be compared with read, it is also necessary to segment the target reference sequence accordingly to obtain a reference sub-segment for comparison with the target seed. Since the target reference sequence is generally relatively long, it is also necessary to segment the target reference sequence for comparison, and the length of the reference sub-segment is generally greater than or equal to the length of the sequence to be aligned read.
步骤S13:按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分。Step S13: Compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score.
在得到所述参考子片段之后,还需要按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分。其中,所述按照预设规则将所述待比对序列read中的所有目标seed与所述参考子 片段进行比对,得到目标得分,包括:按照预设规则将所述待比对序列read中的所有目标seed与各个所述参考子片段进行比对,得到目标得分。例如,所述待比对序列read得到的所述目标seed为2个,所述目标参考序列得到的所述参考子片段为2个,则按照预设规则先将第一个目标seed与第一个参考子片段进行比对,再将第二个目标seed与所述第一个参考子片段进行比对,得到相应的目标得分,再按照预设规则先将所述第一个目标seed与第二个参考子片段进行比对,再将所述第二个目标seed与所述第二个参考子片段进行比对,得到相应的目标得分。After obtaining the reference sub-fragment, it is also necessary to compare all the target seeds in the sequence read to be compared with the reference sub-fragment according to a preset rule to obtain a target score. Wherein, the comparing all the target seeds in the sequence read to be compared with the reference sub-fragment according to a preset rule to obtain a target score includes: reading the sequence to be compared in the sequence read according to the preset rule All the target seeds of are compared with each of the reference sub-segments to obtain the target score. For example, if there are two target seeds obtained from the read sequence of the sequence to be compared, and there are two reference sub-fragments obtained from the target reference sequence, the first target seed and the first Compare two reference sub-segments, then compare the second target seed with the first reference sub-segment to obtain the corresponding target score, and then first compare the first target seed with the first target seed according to preset rules. The two reference sub-segments are compared, and the second target seed is compared with the second reference sub-segment to obtain a corresponding target score.
步骤S14:根据所述目标得分从所述参考子片段中筛选出目标参考子片段。Step S14: screening out target reference sub-segments from the reference sub-segments according to the target score.
在具体的实施过程中,在得到所述目标分数之后,需要按照所述目标得分从所述参考子片段中筛选出目标参考子片段,以便确定所述待比对序列read在所述目标参考序列中的精确匹配。In the specific implementation process, after the target score is obtained, the target reference sub-segment needs to be screened out from the reference sub-segments according to the target score, so as to determine that the sequence to be compared read is in the target reference sequence. Exact match in.
在第一种具体的实施方式中,所述根据所述目标得分从所述参考子片段中筛选出目标参考子片段,包括:判断所述目标得分是否大于或等于预设得分阈值;如果所述目标得分大于或等于预设得分阈值,则将所述目标得分对应的所述参考子片段确定为目标参考子片段。具体的,在得到所述目标得分之后,判断所述目标得分是否大于或等于预设得分阈值,如果所述目标得分大于或等于预设得分阈值,则将所述目标得分对应的所述参考子片段确定为目标参考子片段。In a first specific implementation manner, the screening of the target reference sub-segment from the reference sub-segment according to the target score includes: judging whether the target score is greater than or equal to a preset score threshold; if the If the target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment. Specifically, after the target score is obtained, it is determined whether the target score is greater than or equal to a preset score threshold, and if the target score is greater than or equal to the preset score threshold, the reference child corresponding to the target score is determined The segment is determined as the target reference sub-segment.
在第二种具体的实施方式中,所述根据所述目标得分从所述参考子片段中筛选出目标参考子片段,包括:将所述目标得分进行标准化;判断标准化后目标得分是否大于或等于预设得分阈值;如果标准化后目标得分大于或等于预设得分阈值,则将所述标准化后目标得分对应的所述参考子片段确定为目标参考子片段。In a second specific implementation manner, the screening of the target reference sub-segment from the reference sub-segments according to the target score includes: normalizing the target score; judging whether the target score after the standardization is greater than or equal to A preset score threshold; if the standardized target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the standardized target score is determined as the target reference sub-segment.
步骤S15:将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。Step S15: Determine the position of the target reference sub-fragment in the target reference sequence as the exact matching position of the read sequence to be aligned in the target reference sequence.
可以理解的是,在确定出所述目标参考子片段之后,便可以将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在 所述目标参考序列中的精确匹配位置。It is understandable that after the target reference sub-segment is determined, the position of the target reference sub-segment in the target reference sequence can be determined as the sequence read in the target reference sequence. The exact match position.
可见,本申请先对待比对序列read进行分段,得到所述待比对序列read对应的目标seed,并对预先得到的目标参考序列进行分段,得到参考子片段,然后按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。由此可见,本申请需要先分别对待比对序列read以及预先得到的目标参考序列进行分段,对应得到目标seed以及参考子片段,然后按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置,这样能够过滤掉无效的匹配位置,节约计算资源,提高序列比对的性能,且减少后续序列扩展的工作负载。It can be seen that this application first segmented the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, segmented the target reference sequence obtained in advance to obtain the reference sub-segment, and then segmented it according to a preset rule All target seeds in the sequence read to be compared are compared with the reference sub-fragment to obtain a target score, and the target reference sub-fragment is selected from the reference sub-fragments according to the target score, and then the The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence. It can be seen that this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules. The target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence The position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
参见图2所示,本申请实施例公开了一种具体的序列比对方法,该方法包括:As shown in Figure 2, the embodiment of the present application discloses a specific sequence alignment method, which includes:
步骤S21:对待比对序列read进行分段,得到所述待比对序列read对应的目标seed。Step S21: Segment the read sequence to be compared to obtain the target seed corresponding to the sequence read to be compared.
步骤S22:对预先得到的目标参考序列进行分段,得到参考子片段。Step S22: Segment the target reference sequence obtained in advance to obtain reference sub-segments.
步骤S23:按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分。Step S23: Compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score.
在具体的实施过程中,所述按照预设规则将所述目标seed与所述参考子片段进行比对,得到目标得分,包括:按照预设规则将所述待比对序列read中的所有目标seed与各个所述参考子片段进行比对,得到目标得分。In a specific implementation process, the comparing the target seed with the reference sub-fragment according to a preset rule to obtain a target score includes: reading all targets in the sequence to be compared according to the preset rule The seed is compared with each of the reference sub-segments to obtain the target score.
在第一种具体的实施方式中,按照预设规则将所述待比对序列read中的所有目标seed与任一参考子片段进行比对,得到目标得分,包括:将所述目标得分进行初始化;将所述待比对序列read中的所有目标seed与该参 考子片段进行比对,并判断比对过程中是否出现目标情况;如果比对过程中出现目标情况,则将目标得分减去相应的惩罚分,得到该参考子片段对应的目标得分。其中,所述判断比对过程中是否出现目标情况,包括:判断比对过程中是否出现所述待比对序列read中的所有目标seed均在该参考子片段上命中,且在该参考子片段上命中的位置不连续的情况;或,判断比对过程中是否出现所述待比对序列read中的所有目标seed中的第一目标seed在该参考子片段上命中,且所述待比对序列read中的所有目标seed中的第二目标seed在该参考子片段上没有命中的情况。其中,所述第一目标seed和所述第二目标seed构成所述待比对序列read中的所有目标seed,出现所述待比对序列read中的所有目标seed中的第一目标seed在该参考子片段上命中,且所述待比对序列read中的所有目标seed中的第二目标seed在该参考子片段上没有命中的情况,又包括:所述待比对序列read中的所有目标seed中的第一目标seed在该参考子片段上命中,且所述第一目标seed在该参考子片段上命中的位置连续,且所述待比对序列read中的所有目标seed中的第二目标seed在该参考子片段上没有命中的情况。以及所述待比对序列read中的所有目标seed中的第一目标seed在该参考子片段上命中,且所述第一目标seed在该参考子片段上命中的位置不连续,且所述待比对序列read中的所有目标seed中的第二目标seed在该参考子片段上没有命中的情况。具体的,先将所述目标得分初始化,也即,先给各个所述目标seed一个相同的目标分数,以假设所有所述目标seed均能在对应的参考子片段上命中,且在一个参考子片段上命中的位置连续,再将所述待比对序列read中的所有目标seed与该参考子片段进行比对,如果所有所述目标seed均在该参考子片段连续命中,则进行将初始化之后的目标得分作为最后的目标得分,如果比对过程中出现所有所述目标seed均在该参考子片段上命中,但命中位置不连续,或者所述待比对序列read的所有目标seed中部分目标seed在该参考子片上命中,部分目标seed在该参考子片上没有命中,则进行减去相应的惩罚分。例如,在所有所述目标seed均在该参考子片段上命中,但命中位置不连续时,则在命中位置中间出现多少个多出来的位置,就将初始化的目标得分减去相应数量个7分,得到最终 的目标得分。在所述待比对序列read的所有目标seed中部分目标seed在该参考子片上命中,部分目标seed在该参考子片上没有命中时,没有命中的目标seed的数量为多少,就将初始化的目标得分减去相应数量个7分,得到最终的目标得分。参见图3所示,为序列比对中精确筛选匹配位置过程图。先将待比对序列read(read)分段,得到目标seed,如图中1到8,将目标参考序列(Reference)分段得到参考子片段,包括R_片段1到R_片段n,假设所述待比对序列read中的所有目标seed在参考子片段上连续命中上,则将相应的初始化目标得分设置为100分,目标seed0-8均在R_片段1上命中,且命中位置连续,得到相应的目标得分100分,所述待比对序列read中的所有目标seed在R_片段2均命中,但是命中位置中增加了两个多余位置,扣去两个7分,也即14分,得到目标得分86分,R_片段3有一个目标seed未命中,得到目标得分93分,以此类推,R_片段n-1得到目标得分65分,R_片段n得到目标得分为58分。In a first specific implementation, comparing all the target seeds in the sequence read to be compared with any reference sub-segment according to a preset rule to obtain a target score includes: initializing the target score ; Compare all the target seeds in the sequence read to be compared with the reference sub-fragment, and determine whether the target situation occurs during the comparison process; if the target situation occurs during the comparison process, the target score is subtracted from the corresponding Penalty points of, get the target score corresponding to the reference sub-segment. Wherein, the judging whether there is a target situation during the comparison includes: judging whether all the target seeds in the read sequence to be compared are hit on the reference sub-fragment and are in the reference sub-fragment. The position of the upper hit is not continuous; or, it is judged whether the first target seed among all the target seeds in the read sequence to be aligned hits the reference sub-fragment during the comparison process, and the to-be-aligned The second target seed among all target seeds in the sequence read does not hit the reference sub-segment. Wherein, the first target seed and the second target seed constitute all target seeds in the sequence read to be aligned, and the first target seed in all target seeds in the sequence read to be aligned appears in the A hit on the reference sub-fragment, and the second target seed among all target seeds in the sequence read to be aligned does not hit the reference sub-fragment, and includes: all targets in the sequence read to be aligned The first target seed in the seed hits the reference sub-segment, and the hit position of the first target seed on the reference sub-segment is continuous, and the second target seed in all the target seeds in the read sequence to be aligned The target seed does not hit the reference sub-segment. And the first target seed among all target seeds in the sequence read to be aligned hits the reference sub-segment, and the hit position of the first target seed on the reference sub-segment is not continuous, and the waiting The second target seed among all target seeds in the sequence read does not hit the reference sub-segment. Specifically, the target score is initialized first, that is, the same target score is given to each target seed first, so as to assume that all the target seeds can hit the corresponding reference sub-segment and are in a reference sub-segment. The hit positions on the fragment are consecutive, and then all the target seeds in the sequence read to be compared are compared with the reference sub-fragment. If all the target seeds are consecutively hit in the reference sub-fragment, the initialization is performed The target score of is used as the final target score, if all the target seeds hit the reference sub-segment during the comparison process, but the hit positions are not continuous, or part of the target among all target seeds read by the sequence to be compared If the seed hits the reference sub-chip, and part of the target seed does not hit the reference sub-chip, the corresponding penalty points are subtracted. For example, when all the target seeds are hit on the reference sub-segment, but the hit positions are not continuous, how many extra positions appear in the middle of the hit positions, then the initialized target score is subtracted from the corresponding number of 7 points , Get the final target score. Among all the target seeds of the sequence read to be compared, part of the target seed hits the reference sub-chip, and when the part of the target seed does not hit the reference sub-chip, what is the number of the target seeds that are not hit, the target is initialized Subtract the corresponding number of 7 points from the score to get the final target score. Refer to Figure 3, which is a process diagram of precise screening of matching positions in sequence alignment. First segment the sequence to be compared by read (read) to obtain the target seed, as shown in 1 to 8, segment the target reference sequence (Reference) to obtain the reference sub-segment, including R_segment 1 to R_segment n, suppose If all the target seeds in the sequence read to be aligned hit consecutively on the reference sub-segment, the corresponding initial target score is set to 100 points, the target seeds 0-8 are all hit on R_segment 1, and the hit positions are consecutive , Get the corresponding target score of 100 points, all the target seeds in the sequence read to be compared hit in R_segment 2, but two extra positions are added to the hit positions, and two 7 points are deducted, that is, 14 Score, get the target score of 86 points, R_segment 3 has a target seed miss, get the target score of 93 points, and so on, R_segment n-1 gets the target score of 65 points, R_segment n gets the target score of 58 point.
在第二种具体的实施方式中,按照预设规则将所述待比对序列read中的所有目标seed与任一参考子片段进行比对,得到目标得分,包括:将所述目标得分进行初始化;将所述待比对序列read中的所有目标seed与该参考子片段进行比对;如果所述目标seed在该参考子片段上命中,则将目标得分加上相应的奖励分,得到该参考子片段对应的目标得分。In a second specific embodiment, comparing all the target seeds in the sequence read to be compared with any reference sub-fragment according to a preset rule to obtain a target score, including: initializing the target score ; Compare all the target seeds in the sequence read to be compared with the reference sub-segment; if the target seed hits the reference sub-segment, the target score is added to the corresponding reward points to obtain the reference The target score corresponding to the sub-segment.
步骤S24:判断所述目标得分是否大于或等于预设得分阈值。Step S24: Determine whether the target score is greater than or equal to a preset score threshold.
在得到所述目标得分之后,还需要根据所述目标得分从所述参考子片段中确定出目标参考子片段。具体的,可以是判断所述目标得分是否大于或等于预设得分阈值;如果所述目标得分大于或等于预设得分阈值,则将所述目标得分对应的所述参考子片段确定为目标参考子片段。After obtaining the target score, it is also necessary to determine a target reference sub-segment from the reference sub-segments according to the target score. Specifically, it may be determined whether the target score is greater than or equal to a preset score threshold; if the target score is greater than or equal to a preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment Fragment.
步骤S25:如果所述目标得分大于或等于预设得分阈值,则将所述目标得分对应的所述参考子片段确定为目标参考子片段。Step S25: If the target score is greater than or equal to a preset score threshold, determine the reference sub-segment corresponding to the target score as the target reference sub-segment.
步骤S26:将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。Step S26: Determine the position of the target reference sub-fragment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
参见图4所示,本申请实施例公开了一种序列比对装置,包括:As shown in Figure 4, the embodiment of the present application discloses a sequence comparison device, including:
第一分段模块11,用于对待比对序列read进行分段,得到待比对序列read对应的目标seed;The first segmentation module 11 is used to segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared;
第二分段模块12,用于对预先得到的目标参考序列进行分段,得到参考子片段;The second segmentation module 12 is used to segment the target reference sequence obtained in advance to obtain reference sub-segments;
比对模块13,用于按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分;The comparison module 13 is configured to compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score;
片段筛选模块14,用于根据所述目标得分从所述参考子片段中筛选出目标参考子片段;The fragment screening module 14 is configured to screen out target reference sub-segments from the reference sub-segments according to the target score;
位置确定模块15,用于将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。The position determining module 15 is configured to determine the position of the target reference sub-segment in the target reference sequence as the exact matching position of the sequence read to be aligned in the target reference sequence.
可见,本申请先对待比对序列read进行分段,得到所述待比对序列read对应的目标seed,并对预先得到的目标参考序列进行分段,得到参考子片段,然后按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。由此可见,本申请需要先分别对待比对序列read以及预先得到的目标参考序列进行分段,对应得到目标seed以及参考子片段,然后按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置,这样能够过滤掉无效的匹配位置,节约计算资源,提高序列比对的性能,且减少后续序列扩展的工作负载。It can be seen that this application first segmented the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, segmented the target reference sequence obtained in advance to obtain the reference sub-segment, and then segmented it according to a preset rule All target seeds in the sequence read to be compared are compared with the reference sub-fragment to obtain a target score, and the target reference sub-fragment is selected from the reference sub-fragments according to the target score, and then the The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence. It can be seen that this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules. The target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence The position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
进一步的,参见图5所示,本申请实施例还公开了一种序列比对设备,包括:处理器21和存储器22。Further, referring to FIG. 5, an embodiment of the present application also discloses a sequence comparison device, which includes a processor 21 and a memory 22.
其中,所述存储器22,用于存储计算机程序;所述处理器21,用于执 行所述计算机程序,以实现前述实施例中公开的序列比对方法。Wherein, the memory 22 is used to store a computer program; the processor 21 is used to execute the computer program to implement the sequence comparison method disclosed in the foregoing embodiment.
其中,关于上述序列对比方法的具体过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。For the specific process of the foregoing sequence comparison method, reference may be made to the corresponding content disclosed in the foregoing embodiment, and details are not described herein again.
进一步的,本申请实施例还公开了一种计算机可读存储介质,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现以下步骤:Further, the embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program is executed by a processor to implement the following steps:
对待比对序列read进行分段,得到所述待比对序列read对应的目标seed;对预先得到的目标参考序列进行分段,得到参考子片段;按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分;根据所述目标得分从所述参考子片段中筛选出目标参考子片段;将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。Segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared; segment the target reference sequence obtained in advance to obtain reference sub-fragments; read the sequence to be compared according to preset rules Compare all target seeds in the target seed with the reference sub-segment to obtain a target score; select the target reference sub-segment from the reference sub-segments according to the target score; place the target reference sub-segment in the target reference The position in the sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence.
可见,本申请先对待比对序列read进行分段,得到所述待比对序列read对应的目标seed,并对预先得到的目标参考序列进行分段,得到参考子片段,然后按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。由此可见,本申请需要先分别对待比对序列read以及预先得到的目标参考序列进行分段,对应得到目标seed以及参考子片段,然后按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,并根据所述目标得分从所述参考子片段中筛选出目标参考子片段,再将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置,这样能够过滤掉无效的匹配位置,节约计算资源,提高序列比对的性能,且减少后续序列扩展的工作负载。It can be seen that this application first segmented the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared, segmented the target reference sequence obtained in advance to obtain the reference sub-segment, and then segmented it according to a preset rule All target seeds in the sequence read to be compared are compared with the reference sub-fragment to obtain a target score, and the target reference sub-fragment is selected from the reference sub-fragments according to the target score, and then the The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence. It can be seen that this application needs to segment the sequence read to be compared and the target reference sequence obtained in advance separately to obtain the target seed and reference sub-fragments, and then read all the sequences in the sequence to be compared according to preset rules. The target seed is compared with the reference sub-segment to obtain a target score, and the target reference sub-segment is selected from the reference sub-segment according to the target score, and then the target reference sub-segment is placed in the target reference sequence The position in is determined as the exact matching position of the sequence read to be aligned in the target reference sequence, which can filter out invalid matching positions, save computing resources, improve the performance of sequence alignment, and reduce subsequent sequence expansion Work load.
本实施例中,所述计算机可读存储介质中保存的计算机子程序被处理器执行时,可以具体实现以下步骤:按照预设规则将所述待比对序列read中的所有目标seed与各个所述参考子片段进行比对,得到目标得分。In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps can be specifically implemented: according to a preset rule, all the target seeds in the sequence to be compared are read with each other. The reference sub-fragments are compared to obtain the target score.
本实施例中,所述计算机可读存储介质中保存的计算机子程序被处理器执行时,可以具体实现以下步骤:将所述目标得分进行初始化;将所述待比对序列read中的所有目标seed与该参考子片段进行比对,并判断比对过程中是否出现目标情况;如果比对过程中出现目标情况,则将目标得分减去相应的惩罚分,得到该参考子片段对应的目标得分。In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps can be specifically implemented: initialize the target score; read all targets in the sequence to be compared The seed compares with the reference sub-segment and judges whether there is a target situation during the comparison; if there is a target situation during the comparison, the target score is subtracted from the corresponding penalty point to obtain the target score corresponding to the reference sub-segment .
本实施例中,所述计算机可读存储介质中保存的计算机子程序被处理器执行时,可以具体实现以下步骤:判断比对过程中是否出现所述待比对序列read中的所有目标seed均在该参考子片段上命中,且在该参考子片段上命中的位置不连续的情况;或,判断比对过程中是否出现所述待比对序列read中的所有目标seed中的第一目标seed在该参考子片段上命中,且所述待比对序列read中的所有目标seed中的第二目标seed在该参考子片段上没有命中的情况。In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps can be specifically implemented: judging whether all the target seeds in the sequence to be compared appear during the comparison process. A hit on the reference sub-segment, and the position of the hit on the reference sub-segment is not continuous; or, determining whether the first target seed among all the target seeds in the sequence to be aligned read appears during the comparison process A situation in which a hit is made on the reference sub-segment, and the second target seed among all the target seeds in the read sequence to be aligned does not hit the reference sub-segment.
本实施例中,所述计算机可读存储介质中保存的计算机子程序被处理器执行时,可以具体实现以下步骤:将所述目标得分进行初始化;将所述待比对序列read中的所有目标seed与该参考子片段进行比对;如果所述目标seed在该参考子片段上命中,则将目标得分加上相应的奖励分,得到该参考子片段对应的目标得分。In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps can be specifically implemented: initialize the target score; read all targets in the sequence to be compared The seed is compared with the reference sub-segment; if the target seed hits the reference sub-segment, the target score is added to the corresponding bonus point to obtain the target score corresponding to the reference sub-segment.
本实施例中,所述计算机可读存储介质中保存的计算机子程序被处理器执行时,可以具体实现以下步骤:判断所述目标得分是否大于或等于预设得分阈值;如果所述目标得分大于或等于预设得分阈值,则将所述目标得分对应的所述参考子片段确定为目标参考子片段。In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps can be specifically implemented: judging whether the target score is greater than or equal to the preset score threshold; if the target score is greater than Or equal to the preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment.
本实施例中,所述计算机可读存储介质中保存的计算机子程序被处理器执行时,可以具体实现以下步骤:将所述目标得分进行标准化;判断标准化后目标得分是否大于或等于预设得分阈值;如果标准化后目标得分大于或等于预设得分阈值,则将所述标准化后目标得分对应的所述参考子片段确定为目标参考子片段。In this embodiment, when the computer subprogram stored in the computer-readable storage medium is executed by the processor, the following steps can be specifically implemented: normalizing the target score; judging whether the standardized target score is greater than or equal to the preset score Threshold; if the standardized target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the standardized target score is determined as the target reference sub-segment.
本领域内的技术人员应明白,本申请的实施例可提供为方法、***、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施 例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相 变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity or equipment that includes the element.
本领域技术人员应明白,本申请的实施例可提供为方法、***或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (10)

  1. 一种序列比对方法,其特征在于,包括:A sequence alignment method, characterized in that it comprises:
    对待比对序列read进行分段,得到所述待比对序列read对应的目标seed;Segmenting the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared;
    对预先得到的目标参考序列进行分段,得到参考子片段;Segment the target reference sequence obtained in advance to obtain reference sub-segments;
    按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分;Comparing all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score;
    根据所述目标得分从所述参考子片段中筛选出目标参考子片段;Screening out target reference sub-segments from the reference sub-segments according to the target score;
    将所述目标参考子片段在所述目标参考序列中的位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。The position of the target reference sub-fragment in the target reference sequence is determined as the exact matching position of the read sequence to be aligned in the target reference sequence.
  2. 根据权利要求1所述的序列比对方法,其特征在于,所述按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分,包括:The sequence comparison method according to claim 1, wherein the comparison of all target seeds in the read sequence to be compared with the reference sub-fragment according to a preset rule to obtain a target score includes :
    按照预设规则将所述待比对序列read中的所有目标seed与各个所述参考子片段进行比对,得到目标得分。According to a preset rule, all the target seeds in the sequence read to be compared are compared with each of the reference sub-fragments to obtain a target score.
  3. 根据权利要求2所述的序列比对方法,其特征在于,按照预设规则将所述待比对序列read中的所有目标seed与任一参考子片段进行比对,得到目标得分,包括:The sequence comparison method according to claim 2, characterized in that, according to a preset rule, all the target seeds in the sequence to be compared are compared with any reference sub-fragment to obtain a target score, comprising:
    将所述目标得分进行初始化;Initialize the target score;
    将所述待比对序列read中的所有目标seed与该参考子片段进行比对,并判断比对过程中是否出现目标情况;Compare all the target seeds in the sequence read to be compared with the reference sub-fragment, and determine whether a target situation occurs during the comparison;
    如果比对过程中出现目标情况,则将目标得分减去相应的惩罚分,得到该参考子片段对应的目标得分。If a target situation occurs during the comparison, the target score is subtracted from the corresponding penalty point to obtain the target score corresponding to the reference sub-segment.
  4. 根据权利要求3所述的序列比对方法,其特征在于,所述判断比对过程中是否出现目标情况,包括:The sequence comparison method according to claim 3, wherein the judging whether a target situation occurs during the comparison process comprises:
    判断比对过程中是否出现所述待比对序列read中的所有目标seed均在该参考子片段上命中,且在该参考子片段上命中的位置不连续的情况;Judging whether all target seeds in the sequence read to be aligned hit the reference sub-segment during the comparison process, and the hit positions on the reference sub-segment are not continuous;
    或,判断比对过程中是否出现所述待比对序列read中的所有目标seed中的第一目标seed在该参考子片段上命中,且所述待比对序列read中的所 有目标seed中的第二目标seed在该参考子片段上没有命中的情况。Or, it is judged whether the first target seed among all target seeds in the sequence read to be aligned hits on the reference sub-fragment during the comparison process, and the first target seed among all target seeds in the sequence read to be aligned is determined The second target seed does not hit the reference sub-segment.
  5. 根据权利要求2所述的序列比对方法,其特征在于,按照预设规则将所述待比对序列read中的所有目标seed与任一参考子片段进行比对,得到目标得分,包括:The sequence comparison method according to claim 2, characterized in that, according to a preset rule, all the target seeds in the sequence to be compared are compared with any reference sub-fragment to obtain a target score, comprising:
    将所述目标得分进行初始化;Initialize the target score;
    将所述待比对序列read中的所有目标seed与该参考子片段进行比对;Aligning all the target seeds in the sequence read to be aligned with the reference sub-fragment;
    如果所述目标seed在该参考子片段上命中,则将目标得分加上相应的奖励分,得到该参考子片段对应的目标得分。If the target seed hits the reference sub-segment, the target score is added to the corresponding bonus point to obtain the target score corresponding to the reference sub-segment.
  6. 根据权利要求1所述的序列比对方法,其特征在于,所述根据所述目标得分从所述参考子片段中筛选出目标参考子片段,包括:The sequence comparison method according to claim 1, wherein said screening a target reference sub-segment from said reference sub-segment according to said target score comprises:
    判断所述目标得分是否大于或等于预设得分阈值;Judging whether the target score is greater than or equal to a preset score threshold;
    如果所述目标得分大于或等于预设得分阈值,则将所述目标得分对应的所述参考子片段确定为目标参考子片段。If the target score is greater than or equal to a preset score threshold, the reference sub-segment corresponding to the target score is determined as the target reference sub-segment.
  7. 根据权利要求1所述的序列比对方法,其特征在于,所述根据所述目标得分从所述参考子片段中筛选出目标参考子片段,包括:The sequence comparison method according to claim 1, wherein said screening a target reference sub-segment from said reference sub-segment according to said target score comprises:
    将所述目标得分进行标准化;Standardize the target score;
    判断标准化后目标得分是否大于或等于预设得分阈值;Determine whether the standardized target score is greater than or equal to the preset score threshold;
    如果标准化后目标得分大于或等于预设得分阈值,则将所述标准化后目标得分对应的所述参考子片段确定为目标参考子片段。If the standardized target score is greater than or equal to the preset score threshold, the reference sub-segment corresponding to the standardized target score is determined as the target reference sub-segment.
  8. 一种序列比对装置,其特征在于,包括:A sequence comparison device, characterized in that it comprises:
    第一分段模块,用于对待比对序列read进行分段,得到所述待比对序列read对应的目标seed;The first segmentation module is used to segment the sequence read to be compared to obtain the target seed corresponding to the sequence read to be compared;
    第二分段模块,用于对预先得到的目标参考序列进行分段,得到参考子片段;The second segmentation module is used to segment the target reference sequence obtained in advance to obtain reference sub-segments;
    比对模块,用于按照预设规则将所述待比对序列read中的所有目标seed与所述参考子片段进行比对,得到目标得分;The comparison module is used to compare all the target seeds in the sequence read to be compared with the reference sub-fragments according to a preset rule to obtain a target score;
    片段筛选模块,用于根据所述目标得分从所述参考子片段中筛选出目标参考子片段;A fragment screening module, configured to screen out target reference sub-segments from the reference sub-segments according to the target score;
    位置确定模块,用于将所述目标参考子片段在所述目标参考序列中的 位置确定为所述待比对序列read在所述目标参考序列中的精确匹配位置。The position determination module is configured to determine the position of the target reference sub-fragment in the target reference sequence as the exact matching position of the read sequence to be aligned in the target reference sequence.
  9. 一种序列比对设备,其特征在于,包括:A sequence comparison device, characterized in that it comprises:
    存储器和处理器;Memory and processor;
    其中,所述存储器,用于存储计算机程序;Wherein, the memory is used to store a computer program;
    所述处理器,用于执行所述计算机程序,以实现权利要求1至7任一项所述的序列比对方法。The processor is configured to execute the computer program to implement the sequence comparison method according to any one of claims 1 to 7.
  10. 一种计算机可读存储介质,其特征在于,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的序列比对方法。A computer-readable storage medium, characterized in that it is used to store a computer program, wherein when the computer program is executed by a processor, the sequence comparison method according to any one of claims 1 to 7 is realized.
PCT/CN2020/126350 2020-02-28 2020-11-04 Sequence alignment method, apparatus and device, and medium WO2021169387A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010130211.2A CN111402956A (en) 2020-02-28 2020-02-28 Sequence comparison method, device, equipment and medium
CN202010130211.2 2020-02-28

Publications (1)

Publication Number Publication Date
WO2021169387A1 true WO2021169387A1 (en) 2021-09-02

Family

ID=71430385

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/126350 WO2021169387A1 (en) 2020-02-28 2020-11-04 Sequence alignment method, apparatus and device, and medium

Country Status (2)

Country Link
CN (1) CN111402956A (en)
WO (1) WO2021169387A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402956A (en) * 2020-02-28 2020-07-10 苏州浪潮智能科技有限公司 Sequence comparison method, device, equipment and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2725509A1 (en) * 2012-10-29 2014-04-30 Samsung SDS Co. Ltd. System and method for aligning genome sequence
CN106682393A (en) * 2016-11-29 2017-05-17 北京荣之联科技股份有限公司 Genomic sequence alignment method and genomic sequence alignment device
CN108985008A (en) * 2018-06-29 2018-12-11 郑州云海信息技术有限公司 A kind of method and Compare System of quick comparison gene data
CN109887547A (en) * 2019-03-06 2019-06-14 苏州浪潮智能科技有限公司 A kind of gene order compares filtering accelerated processing method, system and device
CN110379461A (en) * 2019-06-28 2019-10-25 苏州浪潮智能科技有限公司 A kind of gene data comparison method, device, equipment and medium
CN110517728A (en) * 2019-08-29 2019-11-29 苏州浪潮智能科技有限公司 A kind of gene order comparison method and device
CN110797085A (en) * 2019-10-25 2020-02-14 浪潮(北京)电子信息产业有限公司 Method, system, equipment and storage medium for inquiring gene data
CN111402956A (en) * 2020-02-28 2020-07-10 苏州浪潮智能科技有限公司 Sequence comparison method, device, equipment and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2725509A1 (en) * 2012-10-29 2014-04-30 Samsung SDS Co. Ltd. System and method for aligning genome sequence
CN106682393A (en) * 2016-11-29 2017-05-17 北京荣之联科技股份有限公司 Genomic sequence alignment method and genomic sequence alignment device
CN108985008A (en) * 2018-06-29 2018-12-11 郑州云海信息技术有限公司 A kind of method and Compare System of quick comparison gene data
CN109887547A (en) * 2019-03-06 2019-06-14 苏州浪潮智能科技有限公司 A kind of gene order compares filtering accelerated processing method, system and device
CN110379461A (en) * 2019-06-28 2019-10-25 苏州浪潮智能科技有限公司 A kind of gene data comparison method, device, equipment and medium
CN110517728A (en) * 2019-08-29 2019-11-29 苏州浪潮智能科技有限公司 A kind of gene order comparison method and device
CN110797085A (en) * 2019-10-25 2020-02-14 浪潮(北京)电子信息产业有限公司 Method, system, equipment and storage medium for inquiring gene data
CN111402956A (en) * 2020-02-28 2020-07-10 苏州浪潮智能科技有限公司 Sequence comparison method, device, equipment and medium

Also Published As

Publication number Publication date
CN111402956A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
US9009447B2 (en) Acceleration of string comparisons using vector instructions
CN108846749B (en) Partitioned transaction execution system and method based on block chain technology
US20040109376A1 (en) Method for detecting logical address of flash memory
US9208374B2 (en) Information processing apparatus, control method therefor, and electronic device
US20100287203A1 (en) Partitioning of contended synchronization objects
US11526803B2 (en) Learning device and method for implementation of gradient boosted decision trees
US9996271B2 (en) Storage controller and method of operating the same
US10002079B2 (en) Method of predicting a datum to be preloaded into a cache memory
CN112181902B (en) Database storage method and device and electronic equipment
US20090216813A1 (en) Method and system for generating a transaction-bound sequence of records in a relational database table
US10120575B2 (en) Dynamic storage tiering
WO2021169387A1 (en) Sequence alignment method, apparatus and device, and medium
US20200143284A1 (en) Learning device and learning method
US20060114132A1 (en) Apparatus, system, and method of dynamic binary translation with translation reuse
US6219764B1 (en) Memory paging control method
US20070277025A1 (en) Method and system for preventing livelock due to competing updates of prediction information
CN112925632B (en) Processing method and device, processor, electronic device and storage medium
US11599824B2 (en) Learning device and learning method
CN106649143B (en) Cache access method and device and electronic equipment
US10725681B2 (en) Method for calibrating the read latency of a DDR DRAM module
US10824945B2 (en) Machine-learning system and method thereof to manage shuffling of input training datasets
US5903915A (en) Cache detection using timing differences
CN115204923A (en) Entity detection method, entity detection device, computer equipment and storage medium
US20200143290A1 (en) Learning device and learning method
CN113012752A (en) Alpha transmembrane protein secondary and topological structure prediction method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921484

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921484

Country of ref document: EP

Kind code of ref document: A1