CN114540471B - Method and system for performing comparison by using missing nucleic acid sequencing information - Google Patents

Method and system for performing comparison by using missing nucleic acid sequencing information Download PDF

Info

Publication number
CN114540471B
CN114540471B CN202210104037.3A CN202210104037A CN114540471B CN 114540471 B CN114540471 B CN 114540471B CN 202210104037 A CN202210104037 A CN 202210104037A CN 114540471 B CN114540471 B CN 114540471B
Authority
CN
China
Prior art keywords
sequencing
nucleic acid
sequence
signal
alignment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210104037.3A
Other languages
Chinese (zh)
Other versions
CN114540471A (en
Inventor
周文雄
吴思彧
张春艳
李昂
乔朔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202210104037.3A priority Critical patent/CN114540471B/en
Publication of CN114540471A publication Critical patent/CN114540471A/en
Application granted granted Critical
Publication of CN114540471B publication Critical patent/CN114540471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Zoology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for comparing sequence information of deleted nucleic acid, in the sequence process, sequencing signal acquisition is carried out on a part of sequencing reaction cycles selected in advance, signal acquisition is not carried out on the rest of sequencing reaction cycles, and the acquired sequencing signal is encoded into a sequence to obtain a deleted nucleic acid sequence; coding the reference sequence by using the same coding mode to obtain a missing reference sequence; the deleted nucleic acid sequences are then aligned to the deleted reference sequence. The method of the invention only collects sequencing signals in a part of sequencing cycles, further reduces the running time on the basis of the existing high-throughput sequencing technology, and does not affect subsequent bioinformatics analysis and interpretation.

Description

Method and system for performing comparison by using missing nucleic acid sequencing information
Technical Field
The invention relates to a method and a system for comparing by using missing nucleic acid sequencing information, belonging to the field of gene sequencing.
Background
High throughput sequencing technology has been widely used in biological research and clinical diagnostics. Most high throughput sequencing techniques are divided into multiple cycling steps, each cycle comprising two sequential sub-steps of chemical reaction and signal acquisition. Where signal acquisition tends to be very time consuming and the higher the throughput of the sequencer, the longer the time it takes for the signal acquisition process to scan a larger range of chip areas. Sequencing duration has become one of the major factors limiting further throughput improvement in high throughput sequencers.
The invention provides a method for sequencing nucleic acid and comparing sequence of missing information, which can reduce the signal acquisition time in high-throughput sequencing and provides a rapid sequencing means; and simultaneously, a sequence alignment method corresponding to the method is provided, so that the conventional bioinformatics analysis can be realized.
Disclosure of Invention
The invention discloses a method for comparing sequence information of deleted nucleic acid, which is characterized by comprising the following steps:
Carrying out a plurality of sequencing chemical reaction cycles on the nucleic acid molecules to be detected by using a gene sequencing chip; wherein at least one cycle is preselected, and signal acquisition is performed only in the at least one cycle; in other cycles, only sequencing chemical reaction is carried out, and signal acquisition is not carried out; encoding the acquired signals into sequences to obtain a missing nucleic acid sequence; coding a reference sequence corresponding to the nucleic acid molecule to be detected by using the same coding mode to obtain a missing reference sequence; aligning the deleted nucleic acid sequence with the deleted reference sequence;
The sequencing chemical reaction cycle comprises the steps of providing a sequencing reagent, doping a nucleotide monomer with a detectable label in the sequencing reagent into the nucleic acid molecule to be detected to perform sequencing chemical reaction, and performing signal acquisition or not performing signal acquisition on a sequencing signal generated by the detectable label.
According to a preferred embodiment, the sequencing chemistry comprises pyrophosphate sequencing, semiconductor sequencing, fluorogenic sequencing, cycle reversible termination sequencing.
According to a preferred embodiment, the fluorogenic sequencing is a sequencing reaction with a 3' end not blocked, in which two different sequencing reagents are included: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises two different nucleotide monomers having a detectable label; the second sequencing reagent comprises two nucleotide monomers having a detectable label, and the two nucleotide monomers are different from the nucleotide monomers present in the first sequencing reagent, and wherein the second sequencing reagent is provided after the first sequencing reagent is provided, the label generating a detectable signal upon incorporation of the nucleotide monomers into the nucleic acid sequence to be tested.
According to a preferred embodiment, the fluorogenic sequencing is a sequencing reaction with a 3' end not blocked, in which two different sequencing reagents are included: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises three different nucleotide monomers having a detectable label; the second sequencing reagent comprises a nucleotide monomer having a detectable label and the nucleotide monomer is different from the nucleotide monomer present in the first sequencing reagent, and wherein the second sequencing reagent is provided after the first sequencing reagent is provided, the label generating a detectable signal upon incorporation of the nucleotide monomer into the nucleic acid sequence to be tested.
According to a preferred embodiment, the detectable signal is signal-captured after the provision of the first sequencing reagent and not signal-captured after the provision of the second sequencing reagent.
According to a preferred embodiment, the detectable signal is signal-captured after the provision of the second sequencing reagent and not signal-captured after the provision of the first sequencing reagent.
According to a preferred embodiment, an oil seal is required during the sequencing chemistry cycle in which signal acquisition is performed; oil sealing is carried out or is not carried out in the sequencing chemical reaction cycle without signal acquisition; the oil seal is characterized in that firstly, water phase fluid is introduced into a fluid chamber of a chip through a fluid inlet, then oil phase fluid is introduced to discharge the water phase fluid out of the fluid chamber, and meanwhile, part of the water phase fluid is sealed in a micro-reaction chamber on the surface of the chip to form reaction units which are isolated from each other.
According to a preferred embodiment, said encoding of the collected signals into sequences means that for a sequencing cycle in which signal collection is performed, the sequences are represented by the base symbols corresponding to the signals, the number of base symbols corresponding to the intensity of the signals; for loops where no signal acquisition is performed, a single placeholder is written at a corresponding position in the sequence.
According to a preferred embodiment, said base and said placeholder are preferably one or more of A, G, C, T/U, and said base and said placeholder are different.
According to a preferred embodiment, the reference sequence is a reference genome, or a reference transcriptome, or a subset of a reference genome, or a subset of a reference transcriptome.
According to a preferred embodiment, the deletion reference sequence is a set of sequences, or a plurality of sets of sequences; when the deletion reference sequences are a plurality of groups of sequences, the deletion nucleic acid sequences are required to be respectively compared with each group of deletion reference sequences, and a better comparison result is selected from the deletion reference sequences; the "better alignment" may be a higher alignment quality among the multiple alignments, or may be a longer sequence portion on the alignment, or may be a fewer error in the alignment, or may be an alignment to a specific region in the reference sequence.
According to a preferred embodiment, the deleted nucleic acid sequences are aligned to the deleted reference sequence using software or algorithms including, but not limited to, the Smith-Waterman algorithm, the Bowtie, BWA, SOAP, needleman-Wunch algorithm, the Bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP to obtain a sequence alignment.
According to a preferred embodiment, the method further comprises performing a bioinformatic analysis of the sequence alignment; the bioinformatics analysis includes detecting genetic variation, detecting gene expression level, detecting RNA alternative splicing state, detecting gene modification state, identifying species or individual from which the nucleic acid is derived, detecting three-dimensional structure of genome, detecting interaction between nucleic acid and nucleic acid, detecting interaction between nucleic acid and protein, detecting accessibility of chromatin, analyzing RNA structure, and the like.
The invention also discloses a system for comparing by using the missing nucleic acid sequencing information, which comprises a processor, a storage medium and a computer program, wherein the system is used for implementing the method for comparing by using the missing nucleic acid sequencing information.
The invention has the advantages that
Compared with the method mentioned in the background art, the method has the following advantages:
1. sequencing signals are collected only during a portion of the sequencing cycle, further reducing run time based on existing high throughput sequencing techniques, while not affecting subsequent bioinformatics analytical interpretation and/or clinical diagnosis.
2. Sequences containing the deletion information can be aligned and analyzed. The term "deletion information" as used herein means that a part of the sequence information in the resulting sequence is deleted.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the detailed description will be briefly described below, it will be obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 shows a schematic diagram of a deletion sequencing flow. Wherein 101 shows a carrier (e.g., microsphere) for immobilizing a nucleic acid fragment to be tested on a chip, 102 shows a sequencing primer, 103 shows a nucleic acid molecule to be tested, and 104 shows a signal (e.g., fluorescent signal) released by a sequencing reaction.
Detailed Description
At present, the development of high-throughput sequencing technology is mature, the requirement on the sequencing speed is higher and higher, and the sequencing result is expected to be obtained in the shortest possible time no matter basic scientific research or clinical diagnosis and treatment. In the existing second generation sequencing technology, a time-consuming step is the collection process of the sequencing signals, because each cycle sequencing chemical reaction needs to be performed by a separate signal acquisition process, which means that hundreds of sequencing cycles take hundreds of signal acquisition times and occupy nearly half of the sequencing time. Therefore, if only partial sequencing cycle signals can be collected without affecting final sequence alignment and bioinformatics analysis, the time required by sequencing is greatly shortened, and the requirements of basic scientific research and clinical diagnosis and treatment are better met.
In view of this, the invention discloses a method for alignment by using missing nucleic acid sequencing information, which is characterized by comprising the following steps:
Carrying out a plurality of sequencing chemical reaction cycles on the nucleic acid molecules to be detected by using a gene sequencing chip; wherein at least one cycle is preselected, and signal acquisition is performed only in the at least one cycle; in other cycles, only sequencing chemical reaction is carried out, and signal acquisition is not carried out; encoding the acquired signals into sequences to obtain a missing nucleic acid sequence; coding a reference sequence corresponding to the nucleic acid molecule to be detected by using the same coding mode to obtain a missing reference sequence; aligning the deleted nucleic acid sequence with the deleted reference sequence;
The sequencing chemical reaction cycle comprises the steps of providing a sequencing reagent, doping a nucleotide monomer with a detectable label in the sequencing reagent into the nucleic acid molecule to be detected to perform sequencing chemical reaction, and performing signal acquisition or not performing signal acquisition on a sequencing signal generated by the detectable label.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. For a better disclosure of the method and content of the present invention, more critical terms are explained in detail herein.
Nucleic acid molecules: the nucleic acid molecule is template polynucleotide obtained by constructing a library after breaking the nucleic acid molecule to be detected into fragments of 10bp-10 kb. The nucleic acid molecule may be derived from a sample of biological fluid, cells, tissues, organs or organisms. Such samples include, but are not limited to, blood, sputum, amniotic fluid, fine needle biopsy samples (e.g., surgical biopsies, fine needle biopsies, etc.), urine, peritoneal fluid, pleural fluid, tissue explants, organ cultures, secretions and any other tissue or cell preparations, or fractions or derivatives thereof, or substances isolated therefrom.
Deletion sequence: i.e., sequences containing deletion information, where deletion information refers to the deletion of a portion of the sequence information in the resulting sequence, such as: sequence ATTCGNNTTT, which is a deleted sequence, and N indicates that the sequence information is unknown, i.e., the sequence information is deleted, and the sequence is a deleted sequence. The deletion sequence in the invention comprises a deletion nucleic acid sequence and a deletion reference sequence, wherein the deletion reference sequence is a sequence in which all known sequences are expressed as partial sequence information deletion according to a coding mode consistent with the deletion nucleic acid sequence.
Sequencing by fluorescence: sequencing reactions with nucleic acid substrates are performed using fluorogenic nucleotides, nucleic acid polymerases (DNA polymerases), phosphatases. Firstly, DNA polymerase polymerizes fluorogenic nucleotide into nucleic acid substrate to release phosphorylated fluorogenic fluorophore, and then further hydrolyzing with phosphatase to remove phosphoric acid to release fluorogenic fluorophore with changed fluorescence state. By detecting the change in fluorescence (intensity and spectrum) of the fluorophore that occurs in fluorescence, information about the nucleotide that has undergone the extension reaction can be obtained. Fluorogenic fluorophores are a detectable label on a nucleotide monomer.
2+2 Sequencing: 2+2 sequencing is a new form of sequencing-by-synthesis technology, a sequencing reaction in which the 3' end is not blocked, and in which two different sequencing reagents are included: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises two different nucleotide monomers having a detectable label; the second sequencing reagent comprises two nucleotide monomers having a detectable label, and the two nucleotide monomers are different from the nucleotide monomers present in the first sequencing reagent, and wherein the second sequencing reagent is provided after the first sequencing reagent is provided, the label generating a detectable signal upon incorporation of the nucleotide monomers into the nucleic acid sequence to be tested. There may be 1 to more base reactions per round except for the first round of sample injection, with no empty rounds of sequencing. The empty round refers to that no extension of the base occurs in this round of reaction. The nucleotide monomers of the first and second sequencing reagents comprise 3 possible combinations, i.e., AC/GT, or AG/CT, or AT/CG; or writing MK, RY, WS according to standard degenerate base identification; m represents A and/or C, K represents G and/or T. Reference is specifically made to table 1.
TABLE 1 letters representing degenerate bases
Letter The base represented
M A/C
K G/T
R A/G
Y C/T
W A/T
S C/G
B C/G/T
D A/G/T
H A/C/T
V A/C/G
1+3 Sequencing: similar to 2+2 sequencing, it is also a sequencing reaction in which the 3' end is not blocked, and in which two different sequencing reagents are included: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises three different nucleotide monomers having a detectable label; the second sequencing reagent comprises a nucleotide monomer having a detectable label and the nucleotide monomer is different from the nucleotide monomer present in the first sequencing reagent, and wherein the second sequencing reagent is provided after the first sequencing reagent is provided, the label generating a detectable signal upon incorporation of the nucleotide monomer into the nucleic acid sequence to be tested. In order to achieve a more compact expression, the cycle of feeding only 1 nucleotide monomer is simply referred to as 1 substrate stage, and the cycle of feeding 3 nucleotide monomers is simply referred to as 3 substrate stage. In theory, 1+3 sequencing reactions are more unbalanced than 2+2 sequencing, the reaction cycle of adding only one nucleotide monomer can react sufficiently faster, while the reaction cycle of adding three nucleotide monomers may be extended longer, there may be insufficient reaction substrate in a small reaction volume to react completely, resulting in misequality and further affecting subsequent reactions, and thus, 2+2 sequencing reactions are currently a frequently used method compared to 1+3 sequencing.
Alignment (alignment or alignment): "alignment" is a common concept in bioinformatics, where an alignment is often used to compare similarities between different nucleic acids or between different proteins. The alignment of the present invention refers specifically to the process of comparing a deleted nucleic acid sequence with a deleted reference sequence to determine whether the reference sequence contains the encoded deleted nucleic acid sequence. Common sequence alignment algorithms and software include, but are not limited to, for example, the Smith-Waterman algorithm, the Bowtie, BWA, SOAP, needleman-Wunch algorithm, bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP, and the like.
Encoding: coding is the process of converting information from one form or format to another, and the invention encodes the acquired signals into sequences, namely, for sequencing cycles of signal acquisition, representing the sequences by base symbols corresponding to the signals, wherein the number of the base symbols corresponds to the intensity of the signals; for loops where no signal acquisition is performed, a single placeholder is written at a corresponding position in the sequence. The coding scheme of the reference sequence needs to be the same as that of the sequence. The base and the placeholder are preferably one or more of A, G, C, T/U, and the base and the placeholder are different.
And (3) oil seal: when the signal released by the sequencing reaction is in a free state, the signal is required to be limited in situ when the sequencing reaction occurs in order to accurately record the sequencing signal, and because the sequencing reaction liquid is aqueous phase liquid, the reaction chamber can be sealed by using aqueous-oil incompatible oil phase liquid, specifically, firstly, the aqueous phase fluid is introduced into the fluid chamber through a fluid inlet, then the oil phase fluid is introduced to drain the aqueous phase fluid out of the fluid chamber, and meanwhile, part of the aqueous phase fluid is sealed in the micro reaction chamber on the surface of the chip to form a reaction unit which is isolated from each other, and the process is called oil seal. The oil-sealing agent may be selected from various electronic fluorinating liquids, such as Novec TM electronic fluorinating liquid 71DA, novec TM electronic fluorinating liquid 71IPA, novec TM electronic fluorinating liquid, novec TM 7300 electronic fluorinating liquid, novec TM 7200 electronic fluorinating liquid, novec TM 7000 electronic fluorinating liquid, novec TM 7500 electronic fluorinating liquid, FC-3284, FC-72, FC-3283, FC-40, and the like, of 3M.
Sequencing chemical reaction cycle (cycle): or a sequencing wheel, a sequencing chemical reaction cycle comprises providing a sequencing reagent, doping a nucleotide monomer with a detectable label in the sequencing reagent into the nucleic acid molecule to be tested to perform a sequencing chemical reaction, and performing or not performing a signal acquisition process on a sequencing signal generated by the detectable label. Traditionally, a cycle is a complete sequencing reaction process, which includes introducing sequencing reactants to perform biochemical reactions and collecting sequencing signals. In the present invention, for a partial sequencing cycle, it is not a complete cycle, as it includes only sequencing chemistry and does not include a signal acquisition process.
According to a preferred embodiment, the sequencing chemistry comprises pyrophosphate sequencing, semiconductor sequencing, fluorogenic sequencing, cycle reversible termination sequencing, and the like. The method of the invention has no special requirement on the sequencing reaction type, namely, the loop reversible termination sequencing of Illumina, the semiconductor sequencing method of Ion Torrent and the single-molecule sequencing technology of Helicos, and the deletion sequencing comparison method can be used as long as the sequencing reaction and the sequencing signal acquisition process are not concurrent.
According to a preferred embodiment, when performing a fluorescence sequencing reaction, conventionally, in one sequencing cycle, after introducing a sequencing reaction solution, an oil seal is required, so that a fluorescence signal released by the sequencing reaction is limited in a reaction chamber, but when performing a missing sequencing by adopting the method of the invention, for example, a 2+2 sequencing reaction, all odd-numbered rounds are selected, a sequencing signal is collected, an oil seal is required, and for all even-numbered rounds, a sequencing signal is not collected, and at this time, an oil seal may be performed or not performed. Of course, even number of rounds of reaction can collect sequencing signals and carry out oil sealing, and odd number of rounds of reaction can not collect sequencing signals, and oil sealing can be carried out or not. There are many advantages to not performing an oil seal operation in a reaction cycle where no sequencing signals are collected: firstly, the reaction time is further saved, and a certain time is needed for adding the oil seal liquid or cleaning the oil seal liquid, so that for hundreds of sequencing cycles, the step of oil seal is omitted, and a large amount of time can be saved; second, nucleotide extension is not limited in the reaction chamber, which is favorable for fully extending the nucleic acid chain and reducing the occurrence of hysteresis in phase loss, thereby enabling the effective extension length of the nucleic acid chain to be longer and being favorable for finding genetic variation; third, about half of the oil seal reagent and cleaning fluid are saved. For a 1+3 sequencing reaction, in the 3 substrate stage, 3 sequencing substrates are introduced at a time, the length that can be extended may be long, in this case, if the reaction is limited to occur in a tiny reaction chamber, the reaction substrates may be insufficient, which may cause the amplified clusters to be not fully extended, and may cause a phase loss reaction, and as the reaction proceeds, the phase loss gradually accumulates, and eventually, the effective sequencing read length is greatly reduced. At this time, the advantage of not carrying out oil seal in the 3 substrate stage without collecting sequencing signals is more obvious, the occurrence rate of phase loss can be effectively reduced, and the reading length is increased. Preferably, natural nucleotide substrates (nucleotides without detectable labels) are also used in the reaction cycle where no sequencing signals are collected, which can reduce the cost of the reaction.
According to se:Sub>A preferred embodiment, at least one cycle is pre-selected, only in which signal acquisition is performed, said selection not being selected randomly from all cycles, but with se:Sub>A certain rule, for example for sequencing methods of single nucleotide addition, in sequential cycles of T-C-se:Sub>A-G, only in cycles of addition of se:Sub>A certain specific base type, for example only in cycles of addition of se:Sub>A base, and for other cycles, only sequencing chemistry is performed without signal acquisition; or signal acquisition is performed only in the cycle of adding T and C; or signal acquisition may be performed only during the cycle of T, C, A additions.
According to a preferred embodiment, the 2+2 sequencing method performs signal acquisition on all odd wheels, and no signal acquisition is performed on even wheels; or signal acquisition is performed on all even wheels, and signal acquisition is not performed on odd wheels. Taking the sequential cycle of M-K as an example, as shown in FIG. 1, signal acquisition is performed only in the cycle of adding M (A and C) and oil sealing is performed, and the sequencing reaction of adding K (G and T) does not acquire a signal nor does oil sealing.
According to a preferred embodiment, in the 1+3 sequencing method, signal collection is performed when the nucleotides in the introduced sequencing reaction are only complementary to one base on the nucleic acid sequence to be tested; when the nucleotides in the introduced sequencing reaction solution can be complementary with the other three bases on the nucleic acid sequence to be detected, signal collection is not performed.
According to a preferred embodiment, for cycle reversible termination sequencing, 4 nucleotides are added for each cycle, signal acquisition is only performed in the odd-numbered cycles, and no sequencing signal is acquired in the even-numbered cycles. Or signal acquisition is only performed on even-numbered rounds and no sequencing signal is acquired on odd-numbered rounds.
According to a preferred embodiment, after the end of the sequencing, the signal collected is encoded into a sequence, resulting in a deleted nucleic acid sequence. The coding mode is that, for a sequencing cycle for signal acquisition, a sequence is represented by base symbols corresponding to the signals, and the number of the base symbols corresponds to the intensity of the signals; for the loop without signal acquisition, writing a single placeholder at a corresponding position in the sequence; for a number of consecutive cycles of the non-acquired signal, only one placeholder character is written. The base and the placeholder are preferably one or more of A, G, C, T/U, and the base and the placeholder are different. The placeholder may also be selected from M, K, R, Y, W, S, B, D, H, V, N; other characters (e.g., X or punctuation marks, etc.) may theoretically be used, but the effect is the same as N when aligned. For example, for sequencing methods of single nucleotide addition, signal collection is performed only in cycles of addition of a bases, 2a signals are detected, and then encoded as AA; for the cycle without signal acquisition, writing an independent occupying character T (or C or G) in the sequence, correspondingly encoding a reference sequence corresponding to the nucleic acid molecule to be detected by using the same encoding mode, wherein the encoding mode of the reference sequence is as follows: keeping all a's in the reference sequence unchanged, all other consecutive 1 or more non-a characters all become 1T (or C, or G).
For 2+2 sequencing, signal acquisition is performed in M-K sequential cycles, only in M-added cycles, 2M signals are detected, then encoded as AA, and for K cycles, a single placeholder character T (or G) is written; for coding of the reference sequence, all a is kept unchanged, all C is changed to a, and 1 or more K (G or T) s are changed to 1T (or G) s in succession.
For 1+3 sequencing, signal acquisition is performed in the sequential cycles of A-B, with the measured signal written as A, and for the B cycle, a separate placeholder character T (or C, or G) is written; for coding of the reference sequence, keeping all a unchanged, 1 or more consecutive B (C/G/T) are changed to 1T (or C, or G).
According to a preferred embodiment, for sequencing by reversible termination of cycles, 4 nucleotides are added each time, signal acquisition is performed only in the odd-numbered cycles, and then what bases are written by the odd-numbered cycles, and for the even-numbered cycles, a single placeholder N is written; for the coding of the reference sequence, it is coded into 2 sets of sequences: the first group is to keep the odd number base unchanged, and the even number base is changed into N; the second group is to change the even number base to N.
According to a preferred embodiment, the reference sequence may be a reference genome, or a reference transcriptome, or a subset of a reference genome, or a subset of a reference transcriptome. For example, when targeted sequencing is used to detect known and novel variations in a particular genome or genomic region, the entire genome is not required as a reference sequence, but only a genomic sequence of the region of interest, i.e., a subset of the reference genome.
According to a preferred embodiment, the deletion reference sequence is a set of sequences, or a plurality of sets of sequences; for example, in addition to encoding a reference genome, it is also necessary to encode its reverse complement, so that the result of the encoding is more than one set; in identifying the species of microorganism from which the nucleic acid is derived, it is necessary to encode reference genomes of a plurality of microorganisms to form a plurality of sets of reference sequences to observe to which reference genome of microorganism the sequences to be tested will be aligned.
According to a preferred embodiment, where the coding is a plurality of sets of deletion reference sequences, the deletion nucleic acid sequences are aligned separately to each set of deletion reference sequences, and then a better alignment is selected therefrom. The "better alignment" may be a higher alignment quality among the multiple alignments, or may be a longer sequence portion on the alignment, or may be a fewer error in the alignment, or may be an alignment to a specific region in the reference sequence.
According to a preferred embodiment, the deleted nucleic acid sequence and the deleted reference sequence are aligned to obtain an alignment, which is a common concept in bioinformatics, and the alignment algorithm and software used in the present invention include, but are not limited to, for example: smith-Waterman algorithm, bowtie, BWA, SOAP, needleman-Wunch algorithm, bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP, etc.; the Smith-Waterman algorithm is a classical local alignment algorithm; bowtie is suitable for comparing small sequences to large genomes, and is suitable for obtaining short sequence reading length by second generation sequencing; BWA, i.e., burrows-Wheeler-ALIGNMENT TOOL, is a software package that can align sequences with small differences to a larger reference genome, including three different algorithms BWA-backtrack, BWA-SW, BWA-MEM; SOAP, short Oligonucleotide ANALYSIS PACKAGE, is a sequence alignment package developed by Huada. It can be appreciated that the method disclosed by the invention has no special requirement on the comparison algorithm or software, and the use of the comparison algorithm or software is not unique to the invention, so that the specific use of the comparison algorithm and software should not limit the protection scope of the invention.
In the invention, bioinformatics analysis is carried out on the sequence comparison result; the bioinformatics analysis includes, but is not limited to, detecting genetic variation, detecting gene expression level, detecting RNA alternative splice state, detecting gene modification state, identifying species or individual from which the nucleic acid is derived, detecting genomic three-dimensional structure, detecting interactions between nucleic acid and nucleic acid, detecting interactions between nucleic acid and protein, detecting chromatin accessibility, resolving RNA structure, and the like. In particular, the method of the invention has obvious advantages for identifying large-scale structural variations in genetic variations, in 1+3 sequencing, sequencing signals are not collected for 3 substrate stages, and oil sealing is not carried out, at the same time, the substrates of the nucleic acid to be detected in each reaction chamber are relatively sufficient, no phase loss occurs easily, particularly, the occurrence rate of hysteresis reaction is greatly reduced, thus the obtained effective reading length is greatly increased by about 33% compared with 2+2 sequencing, and thus, the method creates favorable conditions for finding large-scale structural variations.
The invention also provides a system for performing comparison by using the missing nucleic acid sequencing information, which comprises a processor, a storage medium and a computer program, wherein the system is used for implementing the method for performing comparison by using the missing nucleic acid sequencing information.
Example 1
Genomic DNA from humans was sequenced using fluorescence-generated sequencing. The following 4 1+3 sequencing were performed separately: AB, CD, GH, TV and only signals were collected when A, C, G, T was added and only chemical reactions were performed when B, D, H, V was added, but no signals were collected. When a signal is acquired, how many times the signal is detected, how many corresponding bases are written in the sequence (e.g., 3 times the signal is acquired when A is added, AAA is written in the sequence). In B, D, H, V where no signal is collected, 1T, 1G, 1C, 1 a are used as placeholder characters, respectively. The human reference genome GRCh38 and its reverse complement were encoded in 4 ways, respectively:
Ab: keeping all a unchanged, 1 or more consecutive B's are changed to 1T.
Cd: keeping all C unchanged, 1 or more consecutive D are changed to 1G.
Gh: keeping all G unchanged, 1 or more consecutive H's change to 1C.
Tv: keeping all T unchanged, 1 or more consecutive V changes to 1 a.
Taking AB sequencing as an example, the specific sequencing procedure is described as follows:
Preparing a sequencing reaction mother solution (mother solution for short) which contains:
20mM Tris-HCl pH 8.8
10mM(NH4)2SO4
50mM KCl
2mM MgSO4
0.1%Tween20
8000unit/mL Bst polymerase
100unit/mL CIP
Sequencing reaction solutions were prepared for 2 groups. The method comprises the following steps of:
A. mother liquor +20uM dA4P-TG
B. Mother liquor +20uM dG4P-TG +20uM dT4P-TG +20uM dC4P-TG
And placing the prepared reaction solution and mother solution on a refrigerator or ice at the temperature of 4 ℃ for standby.
Hybridization sequencing primer:
The sequencing chip was filled with sequencing primer solution (10 uM dissolved in 1 XSSC buffer), warmed to 90℃and then cooled to 40℃at a rate of 5℃per minute. The sequencing primer solution was rinsed off with a wash.
Sequencing:
the sequencing chip was placed on a sequencer, following the procedure below.
1. 10ML of washing liquid is introduced to wash the chip;
2. Cooling the chip to 4 ℃;
3. introducing 100uL of reaction liquid A, and sealing an oil seal;
4. heating the chip to 65 ℃;
5. Waiting for 1min;
6. and (3) signal acquisition: exciting with 473nm laser, and shooting fluorescent image;
7. 10mL of washing liquid is introduced to wash the chip;
8. cooling the chip to 4 ℃;
9. Introducing 100uL of reaction liquid B;
10. heating the chip to 65 ℃;
11. Wait for 1min.
Repeating the steps 1-11 for several times.
These 4 sequences obtained 1×10 6 sequences, respectively, which were aligned to the corresponding 4 encoded reference genomes, respectively, and the total alignment, unique alignment, Q20 alignment, Q30 alignment were counted as follows.
As can be seen from the results of Table 2, when the number of sequencing cycles was 30 or more, the total comparison rate of the 4 kinds of 1+3 sequencing could reach 100%, which proves that the comparison method using the missing nucleic acid sequencing information according to the present invention is ideal. For sequencing cycles below 30 times, the overall alignment is low, but in practical sequencing applications, the sequencing cycle is at least tens to hundreds.
TABLE 2 Total alignment
Number of cycles AB CD GH TV
5 0.03% 0.00% 0.00% 0.03%
10 0.18% 0.00% 0.00% 0.18%
15 1.19% 0.09% 0.09% 1.20%
20 11.30% 2.49% 2.50% 11.62%
25 61.90% 40.53% 40.95% 62.41%
30 100.00% 100.00% 100.00% 100.00%
35 100.00% 100.00% 100.00% 100.00%
40 100.00% 100.00% 100.00% 100.00%
45 100.00% 100.00% 100.00% 100.00%
50 100.00% 100.00% 100.00% 100.00%
55 100.00% 100.00% 100.00% 100.00%
60 100.00% 100.00% 100.00% 100.00%
65 100.00% 100.00% 100.00% 100.00%
70 100.00% 100.00% 100.00% 100.00%
75 100.00% 100.00% 100.00% 100.00%
80 100.00% 100.00% 100.00% 100.00%
85 100.00% 100.00% 100.00% 100.00%
90 100.00% 100.00% 100.00% 100.00%
95 100.00% 100.00% 100.00% 100.00%
100 100.00% 100.00% 100.00% 100.00%
125 100.00% 100.00% 100.00% 100.00%
150 100.00% 100.00% 100.00% 100.00%
As can be seen from the results of Table 3, when the number of sequencing cycles is 70 times or more and 70 times or more, the unique comparison rate of the 4 kinds of 1+3 sequencing can reach 70% or more, wherein the unique comparison rates of AB and TV sequencing reach 85% or more, and the effect of the comparison method using the missing nucleic acid sequencing information is ideal.
TABLE 3 unique alignment rate
As can be seen from the results of tables 4 and 5, when the number of sequencing cycles was 90 or more, the Q20 comparison rate and the Q30 comparison rate of the 4 1+3 sequencing were 70% or more, and the Q20 comparison rate and the Q30 comparison rate of the AB and TV sequencing were 85% or more, which proves that the effect of the method for comparing using the missing nucleic acid sequencing information according to the present invention is ideal.
TABLE 4 Q20 alignment rate
TABLE 5 Q30 alignment rate
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (11)

1. A method for alignment using missing nucleic acid sequencing information, comprising the steps of:
carrying out a plurality of sequencing chemical reaction cycles on the nucleic acid molecules to be detected by using a gene sequencing chip; wherein at least one cycle is preselected, and signal acquisition is performed only in the at least one cycle; in other cycles, only sequencing chemical reaction is carried out, and signal acquisition is not carried out; encoding the acquired signals into sequences to obtain a missing nucleic acid sequence; coding a reference sequence corresponding to the nucleic acid molecule to be detected by using the same coding mode to obtain a missing reference sequence; aligning the deleted nucleic acid sequence with the deleted reference sequence; providing a sequencing reagent, doping a nucleotide monomer with a detectable label in the sequencing reagent into the nucleic acid molecule to be detected to perform sequencing chemical reaction, and performing signal acquisition or not on a sequencing signal generated by the detectable label;
Wherein the sequencing chemical reaction is a fluorescence generation sequencing reaction, and is selected from one of the following two sequencing methods:
(1) The fluorescence generation sequencing is a sequencing reaction with a 3' -end not closed, and in the sequencing reaction, two different sequencing reagents are included: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises two different nucleotide monomers having a detectable label; the second sequencing reagent comprises two nucleotide monomers having a detectable label, and the two nucleotide monomers are different from the nucleotide monomers present in the first sequencing reagent, and wherein the second sequencing reagent is provided subsequent to the provision of the first sequencing reagent, the label generating a detectable signal upon incorporation of the nucleotide monomers into the nucleic acid sequence to be tested;
(2) The fluorescence generation sequencing is a sequencing reaction with a 3' -end not closed, and in the sequencing reaction, two different sequencing reagents are included: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises three different nucleotide monomers having a detectable label; the second sequencing reagent comprises a nucleotide monomer having a detectable label and the nucleotide monomer is different from the nucleotide monomer present in the first sequencing reagent, and wherein the second sequencing reagent is provided after the first sequencing reagent is provided, the label generating a detectable signal upon incorporation of the nucleotide monomer into the nucleic acid sequence to be tested.
2. The method of claim 1, wherein the detectable signal is signal-captured after the first sequencing reagent is provided and wherein the signal is not captured after the second sequencing reagent is provided.
3. The method of claim 1, wherein the detectable signal is signal-captured after the second sequencing reagent is provided and wherein the signal is not captured after the first sequencing reagent is provided.
4. A method according to claim 2 or 3, wherein an oil seal is required in the sequencing chemistry cycle in which signal acquisition is performed; oil sealing is carried out or is not carried out in the sequencing chemical reaction cycle without signal acquisition; the oil seal is characterized in that firstly, water phase fluid is introduced into a fluid chamber of a chip through a fluid inlet, then oil phase fluid is introduced to discharge the water phase fluid out of the fluid chamber, and meanwhile, part of the water phase fluid is sealed in a micro-reaction chamber on the surface of the chip to form reaction units which are isolated from each other.
5. The method of claim 1, wherein encoding the collected signals as sequences means, for a sequencing cycle in which signal collection is performed, representing sequences with base numbers corresponding to the signals, the number of base numbers corresponding to the intensity of the signals; for loops where no signal acquisition is performed, a single placeholder is written at a corresponding position in the sequence.
6. The method of claim 5, wherein the base and the placeholder are one or more of A, G, C, T/U and the base and the placeholder are different.
7. The method of claim 1, wherein the reference sequence is a reference genome, or a reference transcriptome, or a subset of a reference genome, or a subset of a reference transcriptome.
8. The method of claim 1, wherein the deletion reference sequence is one or more sequences; when the deletion reference sequences are a plurality of groups of sequences, the deletion nucleic acid sequences are required to be respectively compared with each group of deletion reference sequences, and a better comparison result is selected from the deletion reference sequences; the "better alignment" may be a higher alignment quality among the multiple alignments, or may be a longer sequence portion on the alignment, or may be a fewer error in the alignment, or may be an alignment to a specific region in the reference sequence.
9. The method of claim 8, wherein the deleted nucleic acid sequences are aligned to the deleted reference sequences using software or algorithms including, but not limited to, smith-Waterman algorithm, bowtie, BWA, SOAP, needleman-Wunch algorithm, bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP to obtain a sequence alignment.
10. The method of claim 9, further comprising performing a bioinformatic analysis on the sequence alignment; the bioinformatics analysis comprises detecting genetic variation, detecting gene expression quantity, detecting RNA alternative splicing state, detecting gene modification state, identifying species or individuals from which nucleic acid is derived, detecting three-dimensional structure of genome, detecting interaction between nucleic acid and nucleic acid, detecting interaction between nucleic acid and protein, detecting accessibility of chromatin and analyzing RNA structure.
11. A system for alignment using missing nucleic acid sequencing information, characterized in that: comprising a processor, a storage medium, a computer program, the system being used to implement the method of alignment using missing nucleic acid sequencing information as claimed in any of claims 1 to 10.
CN202210104037.3A 2022-01-28 2022-01-28 Method and system for performing comparison by using missing nucleic acid sequencing information Active CN114540471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210104037.3A CN114540471B (en) 2022-01-28 2022-01-28 Method and system for performing comparison by using missing nucleic acid sequencing information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210104037.3A CN114540471B (en) 2022-01-28 2022-01-28 Method and system for performing comparison by using missing nucleic acid sequencing information

Publications (2)

Publication Number Publication Date
CN114540471A CN114540471A (en) 2022-05-27
CN114540471B true CN114540471B (en) 2024-05-14

Family

ID=81674038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210104037.3A Active CN114540471B (en) 2022-01-28 2022-01-28 Method and system for performing comparison by using missing nucleic acid sequencing information

Country Status (1)

Country Link
CN (1) CN114540471B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102203292A (en) * 2008-10-29 2011-09-28 南克森制药公司 Sequencing of nucleic acid molecules by mass spectrometry
WO2018089567A1 (en) * 2016-11-10 2018-05-17 Life Technologies Corporation Methods, systems and computer readable media to correct base calls in repeat regions of nucleic acid sequence reads
CN108165616A (en) * 2016-12-01 2018-06-15 北京大学 A kind of method and system for the identification that is compared and makes a variation using fuzzy nucleic acid sequencing information
CN113281324A (en) * 2021-06-29 2021-08-20 江苏大学 Characteristic information extraction method of small molecule volatile matter and portable detection system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102203292A (en) * 2008-10-29 2011-09-28 南克森制药公司 Sequencing of nucleic acid molecules by mass spectrometry
WO2018089567A1 (en) * 2016-11-10 2018-05-17 Life Technologies Corporation Methods, systems and computer readable media to correct base calls in repeat regions of nucleic acid sequence reads
CN108165616A (en) * 2016-12-01 2018-06-15 北京大学 A kind of method and system for the identification that is compared and makes a variation using fuzzy nucleic acid sequencing information
CN111667882A (en) * 2016-12-01 2020-09-15 赛纳生物科技(北京)有限公司 Method for comparing sequencing fuzzy sequence information
CN113281324A (en) * 2021-06-29 2021-08-20 江苏大学 Characteristic information extraction method of small molecule volatile matter and portable detection system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Csuroes,M et al..Fast Mapping and Precise Alignment of AB SOLiD Color Reads to Reference DNA.ALGORITHMS IN BIOINFOMATICS.2010,第6293卷176-187. *
同源DNA序列比对缺失位点的核苷酸最大似然插补;潘克迈等;福建农林大学学报(自然科学版);第50卷(第4期);570-576 *

Also Published As

Publication number Publication date
CN114540471A (en) 2022-05-27

Similar Documents

Publication Publication Date Title
AU2018266377B2 (en) Universal short adapters for indexing of polynucleotide samples
EP3289097B2 (en) Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis)
RU2390561C2 (en) Virtual sets of fragments of nucleotide sequences
US20210024996A1 (en) Method for verifying bioassay samples
US10801062B2 (en) Methods and systems for sequencing long nucleic acids
CA2906818C (en) Generating cell-free dna libraries directly from blood
CN106434873B (en) Method for synchronizing nucleic acid molecules
CA3220983A1 (en) Optimal index sequences for multiplex massively parallel sequencing
US20190360034A1 (en) Methods and systems for sequencing nucleic acids
US20100279882A1 (en) Sequencing methods
Matsumura et al. SuperSAGE: a modern platform for genome-wide quantitative transcript profiling
CA3114759A1 (en) Sequence-graph based tool for determining variation in short tandem repeat regions
KR20230117036A (en) Methods and systems for visualizing short reads in repetitive regions of a genome
CN111575355B (en) Sequencing fuzzy sequence analysis method
CN114540471B (en) Method and system for performing comparison by using missing nucleic acid sequencing information
US20230101896A1 (en) Enhanced Detection of Target Nucleic Acids by Removal of DNA-RNA Cross Contamination
Cai Spatial mapping of single cells in human cerebral cortex using DARTFISH: A highly multiplexed method for in situ quantification of targeted RNA transcripts
CN109790587B (en) Method for discriminating origin of human genomic DNA of 100pg or less, method for identifying individual, and method for analyzing degree of engraftment of hematopoietic stem cells
WO2023287876A1 (en) Efficient duplex sequencing using high fidelity next generation sequencing reads
CN118056911A (en) Method for detecting capture efficiency of probe
CN112375828A (en) Method for genetic relationship identification of cynomolgus monkey and application
MXPA05012638A (en) Virtual representations of nucleotide sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant