CN113186255A - Method and device for detecting nucleotide variation based on single molecule sequencing - Google Patents

Method and device for detecting nucleotide variation based on single molecule sequencing Download PDF

Info

Publication number
CN113186255A
CN113186255A CN202110517844.3A CN202110517844A CN113186255A CN 113186255 A CN113186255 A CN 113186255A CN 202110517844 A CN202110517844 A CN 202110517844A CN 113186255 A CN113186255 A CN 113186255A
Authority
CN
China
Prior art keywords
single nucleotide
nucleotide variation
sequencing reads
sequencing
reads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110517844.3A
Other languages
Chinese (zh)
Inventor
李世勇
茅矛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Siqin Medical Technology Co ltd
Original Assignee
Shenzhen Siqin Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Siqin Medical Technology Co ltd filed Critical Shenzhen Siqin Medical Technology Co ltd
Priority to CN202110517844.3A priority Critical patent/CN113186255A/en
Publication of CN113186255A publication Critical patent/CN113186255A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for determining single nucleotide variation in a nucleic acid sample. The method comprises the following steps: performing bidirectional low-depth sequencing on the nucleic acid sample to obtain sequencing data of the nucleic acid sample, wherein the sequencing data comprises forward sequencing reads and reverse sequencing reads of the nucleic acid sample; based on the sequencing data, performing alignment processing on the forward sequencing reads and the reverse sequencing reads to obtain overlapping regions of the sequencing reads; obtaining the single nucleotide variation based on the overlapping region. The method can be used for detecting and screening single nucleotide variation in a population based on large sample data, and has the advantages of low cost and high efficiency.

Description

Method and device for detecting nucleotide variation based on single molecule sequencing
Technical Field
The invention relates to the field of gene sequencing, in particular to a method and a device for detecting nucleotide variation based on single-molecule sequencing.
Background
An SNP generally refers to a single nucleotide polymorphism. Single nucleotide polymorphism refers primarily to DNA sequence polymorphism at the genomic level caused by variations of a single nucleotide. It is the most common one of the human heritable variations, accounting for over 90% of all known polymorphisms. SNPs are widely present in the human genome, averaging 1 in every 300 base pairs, and the total number is estimated to be 300 ten thousand. These SNPs are closely related to human life, and different genotypes of certain sites of alcohol dehydrogenase and acetaldehyde dehydrogenase determine whether the person can drink alcohol and produce alcoholic liver; different genotypes also affect the underlying probability of occurrence, for example certain genotypes of the BRCA gene increase the probability of breast and ovarian cancer in women by several to tens of times. Genotypes can also be used for paternity testing. The same holds true for other substances, such as origin tracing of pet dogs. SNVs are mutated in somatic cells, are acquired (non-inherited), and many diseases develop and progress due to some SNV mutations. Such as mutation of the EGFR gene L858R in lung cancer. At present, SNP/SNV detection is based on a high-depth sequencing platform and complex nucleic acid sequencing data algorithm mutation detection, but the problems of overhigh cost, unsuitability for large-sample screening of SNV/SNP and disease association and the like generally exist.
Therefore, there is a need for a method for determining single nucleotide variations in a nucleic acid sample with low cost and high accuracy.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
The present inventors have found in their research that the currently employed methods for determining nucleotide variations are costly, complex in algorithms, not suitable for screening large samples for new single nucleotide variations, or for analyzing large samples for associations between single nucleotide variations and various diseases, such as associations between single nucleotide variations and diseases in large populations of breast cancer or ovarian cancer.
To this end, the inventors provide a method of determining single nucleotide variations in a nucleic acid sample, an apparatus for determining single nucleotide variations in a nucleic acid sample, a computer readable medium and an electronic device. The method obtains the nucleic acid sample based on the low-depth sequencing technology, analyzes the obtained data by using a simple algorithm, greatly reduces the detection cost by the final obtaining precision, simplifies the algorithm, ensures the precision of the detection result, and can reach more than eighty percent of the precision of obtaining the single nucleotide variation by using the method.
The invention provides a method for determining single nucleotide variation in a nucleic acid sample and a device for determining the single nucleotide variation in the nucleic acid sample.
Specifically, the invention provides the following technical scheme:
in a first aspect of the invention, a method of determining a single nucleotide variation in a nucleic acid sample is provided. According to an embodiment of the invention, comprising: performing bidirectional low-depth sequencing on the nucleic acid sample to obtain sequencing data of the nucleic acid sample, wherein the sequencing data comprises forward sequencing reads and reverse sequencing reads of the nucleic acid sample; based on the sequencing data, each of the forward sequencing reads and the reverse sequencing reads independently performs a first comparison process with a reference sequence to obtain a position of a sequencing read at the reference sequence and an overlapping region of the sequencing read; obtaining the single nucleotide variation based on the overlapping region. According to the embodiment of the invention, the low-depth sequencing is utilized to obtain the bidirectional sequencing data, and the obtained bidirectional sequencing data and the reference sequence are compared and analyzed, so that the single nucleotide variation with the accuracy of more than 80 percent can be obtained, the sequencing cost and the analysis cost are greatly reduced, and the high-precision detection of the single nucleotide variation can be achieved by utilizing the simplified analysis algorithm and the low-depth sequencing with low cost.
According to an embodiment of the present invention, the method described above may further include the following technical features:
according to an embodiment of the invention, the number of nucleic acids of the overlapping region is not less than 10bp based on the length of the forward sequencing read or the reverse sequencing read. According to an embodiment of the invention, the forward sequencing data reads and the reverse sequencing data reads are matched, and the overlapping region is the overlapping region of the two sequencing reads on the reference chromosome. In the overlapping region, there may be instances where the two sequencing reads are of different nucleotide types at the same site, or instances where the nucleotides of the two sequencing reads are each independently the same at each site within the overlapping region. When the number of nucleic acids of the overlapping region is not less than 10bp, it can be regarded as a target overlapping region.
According to an embodiment of the invention, the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently the same based on the overlap region. According to embodiments of the invention, only overlapping regions and sequencing reads in which the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently identical are retained and single nucleotide variation analysis is performed. The inventors have found through extensive research that, in a large-sample-size single nucleotide variation analysis, based on sequencing data obtained by low-depth sequencing, for the case where the nucleotide type of a certain site in an overlapping region is different in a forward sequencing read and a reverse sequencing read, a more complex algorithm, such as a bayesian model used in conventional analysis, is not required to be adopted for processing. The inventor finds that the possibility of true single nucleotide variation existing in the sites is low, the sites can be directly abandoned, the operation is simplified, and the authenticity of detecting the single nucleotide variation is not influenced to a great extent.
According to an embodiment of the invention, the method further comprises: aligning at least a portion of the overlapping region with the reference sequence based on the forward sequencing reads and the reverse sequencing reads to obtain the single nucleotide variation. The sequencing reads that can be aligned to the reference sequence are partially overlapping regions when the forward and reverse sequencing reads are aligned to the reference sequence, respectively. And comparing the nucleotides in the overlapping region of the forward sequencing read and the reverse sequencing read with the reference sequence, wherein the nucleotides can be partial nucleotides or all the nucleotides in the overlapping region, and when the same sites on the sequencing read and the reference sequence have different nucleotides, the sites can have single nucleotide variation and need to be further screened and judged.
According to an embodiment of the invention, the alignment is performed by:
determining a pending single nucleotide variation on at least a portion of the region of overlap of the forward sequencing reads and the reverse sequencing reads, the pending single nucleotide variation being a nucleotide on at least a portion of the region of overlap that is different from a nucleotide on the reference sequence; the undetermined single nucleotide variation is not at the same site as the predicted single nucleotide variation and is indicative of the undetermined single nucleotide variation being a target single nucleotide variation. According to an embodiment of the present invention, the overlapping region is aligned with the reference sequence, and a site where a different nucleotide type occurs from the reference sequence on reads of the overlapping region is a possible single nucleotide variation site, and the nucleotide type at the site on reads of the overlapping region is a variant nucleotide type. Inquiring and obtaining whether the site has a known nucleotide variation type, if the site has the known nucleotide variation type and is the same as the nucleotide type on the overlapping region, judging the site and the nucleotide variation type as known single nucleotide variation; if it has a known nucleotide variation type and is different from the nucleotide type on the overlapping region, the nucleotide type at the site on the overlapping region is judged as a non-single nucleotide variation. The inventors have found through extensive studies that, in the case where a site already has a known nucleotide variation, a new type of nucleotide variation detected in the overlap region can be directly determined as a non-single nucleotide variation without further analysis using a complicated algorithm. The probability of non-single nucleotide variation is high, and the direct judgment of the non-single nucleotide variation does not influence the detection accuracy to a great extent, but can greatly reduce the algorithm complexity.
According to an embodiment of the invention, the variation of the order nucleoside to be sequenced is not less than 9bp from the end of the forward sequencing read and the reverse sequencing read. The inventor finds that when the position of the nucleotide variation site to be determined in the forward sequencing read is less than 9bp away from the tail end (the upstream tail end and the downstream tail end) of the forward sequencing read, the nucleotide variation site is not judged to be single nucleotide variation. Similarly, when the position of the nucleotide variation site to be determined in the reverse sequencing read is less than 9bp away from the tail ends (the upstream tail end and the downstream tail end) of the reverse sequencing read, the nucleotide variation site is not judged to be single nucleotide variation. The mononucleotide variation determined by the method has high precision and direct and simple judgment.
According to an embodiment of the invention, the method further comprises: and respectively and independently carrying out second comparison processing on the forward sequencing read and the reverse sequencing read where the undetermined nucleotide variation is located and the reference sequence so as to obtain an overlapping region of a first sequencing read and the reference sequence, wherein comparison software used for the second comparison processing is different from comparison software used for the first comparison processing, and the overlapping region of the second comparison processing is the same as the overlapping region of the first comparison processing and is an indication that the undetermined single nucleotide variation is a target single nucleotide variation. According to an embodiment of the present invention, for example, if the first comparison is performed using BWA software, the second comparison may be performed using any one of SOAP2, Bowtie, GATK index reading. In addition, a third comparison, a fourth comparison, and the like can be performed, each comparison being performed using different comparison software. The software for comparison is not particularly limited, and the sequence comparison can be carried out. The order in which the alignment software is used is not particularly limited.
According to an embodiment of the invention, the overlap region, where the forward sequencing reads and the reverse sequencing reads have different nucleotide types at the same site, is an indication that the sequencing data is not to be subsequently processed. According to an embodiment of the present invention, the non-subsequent processing is to discard the forward sequencing read and the reverse sequencing read, and not to perform the subsequent single nucleotide variation analysis based on the sequencing results of the two reads.
In a second aspect, the present invention provides an apparatus for determining single nucleotide variations in a nucleic acid sample. According to an embodiment of the invention, the apparatus comprises: a sequencing data module configured to perform bidirectional low-depth sequencing on the nucleic acid sample to obtain sequencing data of the nucleic acid sample, the sequencing data comprising forward sequencing reads and reverse sequencing reads of the nucleic acid sample; an overlap region acquisition module configured to perform a first comparison process on the forward sequencing reads and the reverse sequencing reads each independently from a reference sequence based on the sequencing data to obtain a position of the sequencing reads on the reference sequence and an overlap region of the sequencing reads; a single nucleotide variation acquisition module configured to obtain the single nucleotide variation based on the overlapping region. According to the embodiment of the invention, the device is used for setting each part of the device according to the method, and the single nucleotide variation in the nucleic acid sample can be rapidly and accurately determined by using the device.
According to an embodiment of the present invention, the apparatus may further include the following technical features:
according to an embodiment of the invention, the number of nucleic acids of the overlapping region is not less than 10bp based on the length of the forward sequencing read or the reverse sequencing read.
According to an embodiment of the invention, the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently the same based on the overlap region.
According to an embodiment of the invention, the apparatus further comprises: a reference sequence alignment unit configured to align at least a portion of the overlapping region with the reference sequence based on the forward sequencing reads and the reverse sequencing reads to obtain the single nucleotide variation.
According to an embodiment of the invention, the apparatus further comprises: a pending single nucleotide variation determination unit configured to determine a pending single nucleotide variation on at least part of an overlap region of the forward sequencing read and the reverse sequencing read, the pending single nucleotide variation being a nucleotide on at least part of the overlap region that is different from a nucleotide on the reference sequence; the undetermined single nucleotide variation is not at the same site as the predicted single nucleotide variation and is indicative of the undetermined single nucleotide variation being a target single nucleotide variation.
According to an embodiment of the invention, the variation of the order nucleoside to be sequenced is not less than 9bp from the end of the forward sequencing read and the reverse sequencing read.
According to an embodiment of the invention, the apparatus further comprises: a second alignment processing unit configured to perform a second alignment processing on the forward sequencing read and the reverse sequencing read, where the pending nucleotide variation is located, and the reference sequence independently from each other so as to obtain an overlapping region of the first sequencing read and the second sequencing read with the reference sequence, wherein an alignment software used for the second alignment processing is different from an alignment software used for the first alignment processing, and the overlapping region of the second alignment processing is the same as the overlapping region of the first alignment processing, which is an indication that the pending single nucleotide variation is a target single nucleotide variation.
According to an embodiment of the invention, the overlap region, where the forward sequencing reads and the reverse sequencing reads have different nucleotide types at the same site, is an indication that the sequencing data is not to be subsequently processed.
In a third aspect of the invention, a method of determining the genotype of a test sample is provided. According to an embodiment of the invention, the method comprises: obtaining the mononucleotide variation of the sample to be tested, wherein the mononucleotide variation is obtained by the method or the device; and acquiring the genotype of the sample to be detected by utilizing the single nucleotide variation. According to the embodiment of the invention, the genotype of the sample to be detected can be detected according to the single nucleotide variation of the sample to be detected, and the genotype determined by the method can be used for 1) genome-wide association research for finding new genotypes and sites related to certain diseases and phenotypes (such as increase); 2) based on the findings, disease risk and phenotype prediction are carried out, such as: the different genotypes of certain sites of the alcohol dehydrogenase and the acetaldehyde dehydrogenase determine whether the person can drink alcohol and is easy to suffer from alcoholic liver; certain genotypes of the BRCA gene increase the probability of breast and ovarian cancer in women by several to tens of times; 3) parent-child identification and the like; 4) the same is true for other species, such as phylogenetic tracing of pet dogs.
In a fourth aspect of the invention, an apparatus for determining the genotype of a test sample is provided. According to an embodiment of the invention, the apparatus comprises: a single nucleotide variation acquisition apparatus comprising the device set forth in the second aspect of the present invention; and a genotype acquisition device configured to acquire a genotype of the test sample using the obtained single nucleotide variation.
In a fifth aspect of the invention, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first or third aspect of the invention.
In a sixth aspect of the present invention, the present invention provides an electronic apparatus comprising: a computer-readable storage medium according to a fifth aspect of the invention; and one or more processors for executing the program in the computer-readable storage medium.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram of a method for determining single nucleotide variations in a nucleic acid sample according to an embodiment of the invention;
FIG. 2 is a schematic diagram of an apparatus for determining single nucleotide variations in a nucleic acid sample according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of different scenarios when a sense strand and an antisense strand of a nucleic acid are sequenced, wherein the reverse complement sequence should be reverse complementary to the sense strand, and for clarity of the figure, the reverse complement sequence is reverse complement switched to be identical to the forward strand sequence, according to an embodiment of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
It is noted that herein, "low depth sequencing" means sequencing at less than 5X depth.
It is noted that herein, "high depth sequencing" refers to sequencing depths greater than 30X.
It is noted that, herein, the nucleic acid sample may be taken from tissues, cells, organs, blood, body fluids, and the like.
In order to reduce the cost of detecting the single nucleotide variation and simplify the algorithm of the single nucleotide variation, and realize the detection and screening of the single nucleotide variation and the correlation analysis of the single nucleotide variation and diseases aiming at a large number of populations, the inventor creatively uses a low-depth sequencing mode to detect and determine the single nucleotide variation, thereby being beneficial to the discovery and utilization of new single nucleotide variation.
Thus, in one aspect of the invention, a method of determining single nucleotide variations in a nucleic acid sample is provided. According to an embodiment of the invention, with reference to fig. 1, it comprises: s100, performing bidirectional low-depth sequencing on the nucleic acid sample to obtain sequencing data of the nucleic acid sample, wherein the sequencing data comprises a forward sequencing read and a reverse sequencing read of the nucleic acid sample; s200, based on the sequencing data, the forward sequencing reads and the reverse sequencing reads are respectively and independently subjected to first comparison processing with a reference sequence so as to obtain an overlapping region of the sequencing reads and the reference sequence; s300, obtaining the single nucleotide variation based on the overlapping region. According to the embodiment of the invention, the low-depth sequencing is utilized to obtain the bidirectional sequencing data, and the obtained bidirectional sequencing data and the reference sequence are compared and analyzed, so that the single nucleotide variation with the accuracy of more than 80% can be obtained, the sequencing cost and the analysis cost are greatly reduced, and the high-accuracy detection of the single nucleotide variation can be achieved by utilizing a simplified analysis algorithm and the low-cost deep sequencing.
According to an embodiment of the invention, the number of nucleic acids of the overlapping region is not less than 10bp based on the length of the forward sequencing read or the reverse sequencing read. According to an embodiment of the invention, the forward sequencing data reads and the reverse sequencing data reads are matched, and the overlapping region is the overlapping region of the two sequencing reads on the reference chromosome. In the overlapping region, there may be instances where the two sequencing reads are of different nucleotide types at the same site, or instances where the nucleotides of the two sequencing reads are each independently the same at each site within the overlapping region. When the number of nucleic acids of the overlapping region is not less than 10bp, it can be regarded as a target overlapping region.
According to a specific embodiment of the invention, the SNV mutation is present at one site, the reference base is a and the known mutation is C. Depending on the base type of the site in the forward and reverse sequencing reads, the reads can be divided into five different categories:
a) reads 1 and 2 from one fragment may be identical and identical to the reference base, both a.
b) Reads 1 and 2 from one fragment may be identical and consistent with known mutations, both C.
c) Reads 1 and 2 may also be identical, but not identical to the reference base or known mutation, both T.
d) Reads 1 and 2 have different bases at the same site, one for T and one for a.
e) Reads 1 and 2 were identical in position, one identical to the reference base, A, and one not identical to the reference base, G.
For the five types, c, d, e, three types do not do subsequent processing, i.e., in the overlap region, the forward sequencing reads and the reverse sequencing reads have different nucleotide types at the same site, which is an indication that the sequencing data does not do subsequent processing. According to an embodiment of the present invention, the non-subsequent processing is to discard the forward sequencing read and the reverse sequencing read, and not to perform the subsequent single nucleotide variation analysis based on the sequencing results of the two reads.
According to an embodiment of the invention, the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently the same based on the overlap region. According to embodiments of the invention, only overlapping regions and sequencing reads in which the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently identical are retained and single nucleotide variation analysis is performed. The inventors have found through extensive research that, in a large-sample-size single nucleotide variation analysis, based on sequencing data obtained by low-depth sequencing, for the case where the nucleotide type of a certain site in an overlapping region is different in a forward sequencing read and a reverse sequencing read, a more complex algorithm, such as a bayesian model used in conventional analysis, is not required to be adopted for processing. The inventor finds that the possibility of true single nucleotide variation existing in the sites is low, the sites can be directly abandoned, the operation is simplified, and the authenticity of detecting the single nucleotide variation is not influenced to a great extent.
According to an embodiment of the invention, the method further comprises:
aligning at least a portion of the overlapping region with the reference sequence based on the forward sequencing reads and the reverse sequencing reads to obtain the single nucleotide variation.
According to an embodiment of the invention, the alignment is performed by:
determining a pending single nucleotide variation on at least a portion of the region of overlap of the forward sequencing reads and the reverse sequencing reads, the pending single nucleotide variation being a nucleotide on at least a portion of the region of overlap that is different from a nucleotide on the reference sequence; the undetermined single nucleotide variation is not at the same site as the predicted single nucleotide variation and is indicative of the undetermined single nucleotide variation being a target single nucleotide variation. According to an embodiment of the present invention, the overlapping region is aligned with the reference sequence, and a site where a different nucleotide type occurs from the reference sequence on reads of the overlapping region is a possible single nucleotide variation site, and the nucleotide type at the site on reads of the overlapping region is a variant nucleotide type. Inquiring and obtaining whether the site has a known nucleotide variation type, if the site has the known nucleotide variation type and is the same as the nucleotide type on the overlapping region, judging the site and the nucleotide variation type as known single nucleotide variation; if it has a known nucleotide variation type and is different from the nucleotide type on the overlapping region, the nucleotide type at the site on the overlapping region is judged as a non-single nucleotide variation. The inventors have found through extensive studies that, in the case where a site already has a known nucleotide variation, a new type of nucleotide variation detected in the overlap region can be directly determined as a non-single nucleotide variation without further analysis using a complicated algorithm. The probability of non-single nucleotide variation is high, and the direct judgment of the non-single nucleotide variation does not influence the detection accuracy to a great extent, but can greatly reduce the algorithm complexity.
According to an embodiment of the invention, the variation of the order nucleoside to be sequenced is not less than 9bp from the end of the forward sequencing read and the reverse sequencing read. The inventor finds that when the position of the nucleotide variation site to be determined in the forward sequencing read is less than 9bp away from the tail end (the upstream tail end and the downstream tail end) of the forward sequencing read, the nucleotide variation site is not judged to be single nucleotide variation. Similarly, when the position of the nucleotide variation site to be determined in the reverse sequencing read is less than 9bp away from the tail ends (the upstream tail end and the downstream tail end) of the reverse sequencing read, the nucleotide variation site is not judged to be single nucleotide variation. The mononucleotide variation determined by the method has high precision and direct and simple judgment.
According to an embodiment of the present invention, after obtaining single nucleotide variations based on the above method, two reads are obtained that simultaneously support a single mutated single nucleotide variation. And simultaneously obtaining read reads supporting single nucleotide variation, re-extracting a sequencing read sequence, and converting into a fastq format. Re-aligning to a human reference genome by using other alignment software (bowtie) and GATK re-alignment software, and filtering out different alignment software, wherein the initial positions of the alignments have different mononucleoside mutations; the single nucleotide variations were further filtered using the following methods simultaneously:
1) the position of the single nucleotide variation, at both ends of the read (within 9 bp);
2) known mutations and new mutations exist at the same site;
4) the base quality of the single nucleotide variation position is any one less than 30;
5) the simple repeat region, black region, where the mutation is located on the genome was filtered out.
According to an embodiment of the present invention, before performing sequencing data analysis, quality control is performed on the obtained sequencing data, and reads are not retained for the following cases:
a) the alignment quality value for any one read is <30,
b) repeated reads, and supplemental alignment reads;
c) the alignment results include insertions and deletions (InDel), 5 'or 3' end splicing (soft/hard splicing) alignment of reads, and gap (clipped region from the reference) in the middle of reads, and the number of base mismatches included in the alignment results is > 2.
In a second aspect, the present invention provides an apparatus for determining single nucleotide variations in a nucleic acid sample. According to an embodiment of the invention, with reference to fig. 2, the apparatus comprises: a sequencing data module 100 configured to perform bidirectional low-depth sequencing on the nucleic acid sample to obtain sequencing data of the nucleic acid sample, the sequencing data comprising forward sequencing reads and reverse sequencing reads of the nucleic acid sample; an overlap region acquisition module 200, the overlap region acquisition module 200 being connected to the sequencing data module 100, the overlap region acquisition module being configured to perform a first comparison process on the forward sequencing reads and the reverse sequencing reads with a reference sequence, respectively, independently, based on the sequencing data, so as to obtain an overlap region of the sequencing reads and the reference sequence; a single nucleotide variation acquisition module 300, the single nucleotide variation acquisition module 300 being connected to the overlap region acquisition module 200, the single nucleotide variation acquisition module being configured to acquire the single nucleotide variation based on the overlap region. According to the embodiment of the invention, the device is used for setting each part of the device according to the method, and the single nucleotide variation in the nucleic acid sample can be determined quickly and accurately by using the device.
According to an embodiment of the invention, the number of nucleic acids of the overlapping region is not less than 10bp based on the length of the forward sequencing read or the reverse sequencing read.
According to an embodiment of the invention, the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently the same based on the overlap region.
According to an embodiment of the invention, the apparatus further comprises: a reference sequence alignment unit configured to align at least a portion of the overlapping region with the reference sequence based on the forward sequencing reads and the reverse sequencing reads to obtain the single nucleotide variation.
According to an embodiment of the invention, the apparatus further comprises:
a pending single nucleotide variation determination unit configured to determine a pending single nucleotide variation on at least part of an overlap region of the forward sequencing read and the reverse sequencing read, the pending single nucleotide variation being a nucleotide on at least part of the overlap region that is different from a nucleotide on the reference sequence;
the undetermined single nucleotide variation is not at the same site as the predicted single nucleotide variation and is indicative of the undetermined single nucleotide variation being a target single nucleotide variation. According to an embodiment of the invention, the variation of the order nucleoside to be sequenced is not less than 9bp from the end of the forward sequencing read and the reverse sequencing read.
According to an embodiment of the invention, the apparatus further comprises: a second alignment processing unit configured to perform a second alignment processing on the forward sequencing read and the reverse sequencing read, where the pending nucleotide variation is located, and the reference sequence independently from each other so as to obtain an overlapping region of the first sequencing read and the second sequencing read with the reference sequence, wherein an alignment software used for the second alignment processing is different from an alignment software used for the first alignment processing, and the overlapping region of the second alignment processing is the same as the overlapping region of the first alignment processing, which is an indication that the pending single nucleotide variation is a target single nucleotide variation.
According to an embodiment of the invention, the overlap region, where the forward sequencing reads and the reverse sequencing reads have different nucleotide types at the same site, is an indication that the sequencing data is not to be subsequently processed.
In a third aspect of the invention, a method of determining the genotype of a test sample is provided. According to an embodiment of the invention, the method comprises: obtaining the mononucleotide variation of the sample to be tested, wherein the mononucleotide variation is obtained by the method or the device; and acquiring the genotype of the sample to be detected by utilizing the single nucleotide variation. According to the embodiment of the invention, the genotype of the sample to be detected can be detected according to the single nucleotide variation of the sample to be detected, and the genotype determined by the method can be used for 1) genome-wide association research for finding new genotypes and sites related to certain diseases and phenotypes (such as increase); 2) based on the existing findings, the disease risk is carried out, and the phenotype is estimated: certain genotypes of the BRCA gene increase the probability of breast and ovarian cancer in women by several to tens of times; 3) parent-child identification and the like; 4) the same is true for other species, such as phylogenetic tracing of pet dogs.
In a fourth aspect of the invention, an apparatus for determining the genotype of a test sample is provided. According to an embodiment of the invention, the apparatus comprises: a single nucleotide variation acquisition apparatus comprising the device set forth in the second aspect of the present invention; and a genotype acquisition device configured to acquire a genotype of the test sample using the obtained single nucleotide variation.
In a fifth aspect of the invention, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first or third aspect of the invention.
In a sixth aspect of the present invention, the present invention provides an electronic apparatus comprising: a computer-readable storage medium according to a fifth aspect of the invention; and one or more processors for executing the program in the computer-readable storage medium.
The invention will now be described with reference to specific examples, which are intended to be illustrative only and not to be limiting in any way.
Example 1:
and processing and library-building sequencing of the sample. The sample can be any biological sample such as tissue, plasma, saliva, etc., and is subjected to DNA extraction, library building and sequencing by using a conventional kit. Given below is only one example of the extraction, pooling, sequencing based blood sample,
1. plasma separation
a) Preparing instruments, reagents and consumables required by the experiment, and precooling the high-speed refrigerated centrifuge to 4 ℃ in advance.
b) If peripheral blood samples were collected using EDTA anticoagulation tubes, they were immediately placed in a 4 ℃ freezer after blood withdrawal and plasma separation was performed within 2 hours. If the peripheral blood sample is collected by a free nucleic acid storage tube such as a Streck tube, it can be left at room temperature and separated into plasma within a time specified in the specification of the blood collection tube.
c) Recording sample information, balancing a blood collection tube, replacing a high-speed refrigerated centrifuge with a horizontal rotor, and setting parameters: the temperature is 4 ℃, the centrifugal force is 1600g, and the time is 10 min. The blood collection tube was trimmed, and then placed in a centrifuge for centrifugation.
d) After centrifugation was completed, the blood collection tubes were placed on a centrifuge tube rack of a biosafety cabinet. The supernatant from the centrifuged blood collection tube was collected into a new 15mL centrifuge tube, and the tube wall was marked with the sample number and the operation time. Note that careful handling is required to collect the supernatant to avoid aspiration of leukocytes.
e) The high-speed refrigerated centrifuge is replaced by an angle rotor, and the parameters are set as follows: the temperature is 4 ℃, the centrifugal force is 16000g, and the time is 10 min.
A15 mL centrifuge tube containing the supernatant was trimmed, placed in a centrifuge, and centrifuged.
f) After centrifugation was complete, 15mL centrifuge tubes containing the supernatant were placed on the centrifuge tube rack of a biosafety cabinet. The supernatant from the centrifuged tube was collected into a new 15mL tube. And 500ul of the supernatant was taken out and stored in a 1.5mL centrifuge tube and used for subsequent tumor marker detection. Care was taken to collect the supernatant and avoid aspiration precipitation.
The purpose of this step is to remove impurities such as cellular debris from the plasma.
g) Storing the blood plasma and blood cells in a refrigerator at-80 deg.C for use.
h) After the experiment is finished, all the articles are returned, the experiment table top is cleaned, the ultraviolet lamp of the biological safety cabinet is turned on, and the biological safety cabinet is turned off after 30min of irradiation. Record the detailed experimental record.
cfDNA extraction
i) Preparing instruments, reagents and consumables required by the experiment. The water bath was opened and the temperature was adjusted to 60 ℃. The metal bath was opened and the temperature was adjusted to 56 ℃. Confirming the validity of the kit, whether the buffer ACB is added with proper amount of isopropanol, and whether the buffer ACW1 and the buffer ACW1 are added with proper amount of absolute ethyl alcohol.
j) Record the sample number and other information.
k) If the plasma is separated from the fresh plasma, cfDNA extraction is directly carried out. When plasma jelly exists at-80 deg.C, plasma sample is thawed, and centrifuged at 16,000x g [ fixed angle head ] under centrifugal force and at 4 deg.C for 5min to remove frozen precipitate.
l) prepare the required amount of ACL mixture according to Table 1.
Table 1: volumetric amounts of Buffer ACL and carrier RNA (dissolved in Buffer AVE) required to treat 4ml samples
Figure BDA0003062453200000111
Figure BDA0003062453200000121
m) transfer 400. mu.l of Proteinase K to a 50ml centrifuge tube containing 4ml of plasma. Vortex intermittently for 30s to mix well.
n) 3.2ml of Buffer ACL (containing 1.0. mu.g of carrier RNA) was added. Vortex vigorously and mix for 15 seconds. Ensure the centrifuge tube through violent vortex to guarantee the repeated mixing of sample and Buffer ACL, thereby realize efficient schizolysis.
o) note that: after this step, the experiment was left uninterrupted and the next lysis incubation step was immediately performed.
p) centrifuge tube followed by a water bath at 60 ℃ for 30 minutes.
q) 7.2ml of Buffer ACB were added to the above reaction mixture. The tube cap was closed and vortexed intermittently for 15s to mix well.
r) the lysates containing Buffer ACB were incubated on ice or refrigerated for 5 min.
s) assembling a suction filtration device: VacValve was inserted on a 24-well bottom, VacConnectors were inserted in the VacValve, QIAamp Mini silica gel membrane columns were attached to the VacConnectors, and finally 20ml flash tubes were inserted on the silica gel membrane columns. Ensure that the dilatation pipe is inserted compactly to prevent the sample from leaking. Note that: the 2ml collection tube was left to use until subsequent idling. And marking the sample number on a silica gel membrane column. VacValve can regulate the flow rate, VacConnectors can prevent pollution, a QIAamp Mini silica gel membrane column is used for adsorbing DNA, and a dilatation tube is used for containing large-volume plasma.
t) transferring the incubated mixture into a dilatation tube, turning on a vacuum pump, turning off the vacuum pump after the lysate in the centrifugal column is completely drained, and opening an exhaust valve at one side of the 24-hole base to release the pressure to 0 MPa. The flash tube is carefully removed and discarded.
u) to the QIAamp Mini silica gel membrane column, 600. mu.l of Buffer ACW1 was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. When the Buffer ACW1 in the spin column was drained, the vacuum pump was turned off and the vent valve on the base side of the 24 wells was opened to release the pressure to 0 MPa.
v) to the QIAamp Mini silica gel membrane column, 750. mu.l of Buffer ACW2 was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. When the Buffer ACW2 in the spin column was drained, the vacuum pump was turned off and the vent valve on the base side of the 24 wells was opened to release the pressure to 0 MPa.
w) to a QIAamp Mini silica gel membrane column, 750. mu.l of an absolute ethanol solution was added, the exhaust valve was closed, and the vacuum pump was turned on to suction-filter the liquid. And when the absolute ethyl alcohol in the centrifugal column is pumped to be dry, closing the vacuum pump, and opening an exhaust valve at one side of the 24-hole base to release the pressure to 0 MPa. And turning off the power supply of the vacuum pump.
x) cover the QIAamp Mini silica gel membrane column and remove from the vacuum manifold and place into a clean 2ml collection tube, discarding the VacConnector. The collection tube was centrifuged for 3min at full speed (20,000x g; 14,000 rpm).
y) the QIAamp Mini silica gel membrane column was placed in a new 2ml collection tube, uncapped and placed on a metal bath at 56 ℃ for drying for 10min until the silica gel membrane was completely dried.
z) the QIAamp Mini silica gel membrane column was removed and placed into a clean 1.5ml elution tube (kit-of-parts) and the used 2ml collection tube was discarded.
aa) elution was carried out by carefully adding nucleic-free water to the center of the silica gel membrane in a QIAamp Mini silica gel membrane column: 20-60 ul). The tube was capped and incubated at room temperature for 3 min.
bb) the elution tube was placed in a mini centrifuge at full speed (20,000x g; 14,000rpm) for 1min to elute cfDNA.
cc) quality standards and assessments
Quantitive HS quantification: 1 μ L of cfDNA was quantified using a Qubit 4.0(Thermo Fisher Scientific, Q33226) in combination with the Qubit dsDNA HS Assay Kits (Thermo Fisher Scientific, Q32854), and the concentration ng/uL was recorded.
Agilent 2100 detection: the distribution of cfDNA fragments was determined by taking 1. mu.L of cfDNA and performing peak mapping of cfDNA using an Agilent 2100 bioanalyzer (Agilent, G29939BA) in combination with the Agilent High Sensitivity DNA Kit (Agilent, 5067-one 4626).
dd), returning all articles, cleaning the experiment table, turning on the ultraviolet lamp of the biological safety cabinet, and turning off after irradiating for 30 min. Record the detailed experimental record.
cfDNA library construction
a) Preparation before building a warehouse
i. Magnetic beads (AMPureXP beads, Beckman) for DNA purification were removed from the 4 ℃ freezer and equilibrated at room temperature for 30min before use.
And ii, taking the End Repair & A-Tailing Buffer and the End Repair & A-TailingBuffer enzyme mix reagent out of a refrigerator at the temperature of-20 ℃, and placing the reagents on an ice box for thawing for later use.
And iii, recording the name of the cfDNA sample to be subjected to library construction, the sampling date and the DNA concentration on an experimental record book, and writing a serial number to facilitate later operation.
Corresponding number of 200. mu.L PCR tubes were taken and numbered (tube lid and tube wall are numbered).
v. calculating the volume of the DNA solution required by each cfDNA sample according to the standard that the initial amount of the cfDNA library is more than or equal to 10ng and less than or equal to 100ng, recording the volume on an experimental record book, and placing the corresponding volume in a corresponding 200 mu L PCR tube.
Add appropriate amount of nucleic-Free water to each 200. mu.L PCR tube to bring the final volume to 50. mu.L.
vii, annotate: the following rules should be followed for formulating all reaction systems during the library building process: if the number of the samples is less than four, a mixed system is not required to be prepared, and each sample is independently added into each component solution in the reaction system; if the amount of the reaction solution exceeds four samples, preparing a mixed system by 105 percent of the required amount of each component solution in the reaction system, and then adding the mixed system into each sample one by one.
b) End repair & Add A
i. The end-repair & A reaction system was prepared as shown in Table 2.
Table 2:
Figure BDA0003062453200000141
add 10. mu.L of the above-mentioned end-repair reaction system to each 200. mu.L PCR tube, mix well and centrifuge at low speed, set up the PCR instrument, and program as in Table 3 below.
Table 3:
Figure BDA0003062453200000142
and iii, taking the reaction system out of the PCR instrument, placing the reaction system on a small yellow plate, and performing joint connection reaction.
c) Linker ligation reaction system
i. The linker ligation reaction system was prepared as shown in Table 4.
Table 4:
Figure BDA0003062453200000143
Figure BDA0003062453200000151
and ii, adding 45 mu L of the reaction system into each reaction tube, mixing the mixture gently and uniformly, and centrifuging the mixture at a low speed.
Add the appropriate amount of adapter according to the amount of input DNA, which is shown in Table 5 below, and add 5. mu.L of adapter to each reaction tube. In addition, according to the sequencing requirement, different adapters are added to each sample, so that the situation that two samples use the same adapter cannot occur in the same lane, and the adapter information used by each sample is recorded.
Table 5:
Figure BDA0003062453200000152
and iv, mixing uniformly, putting into a PCR instrument, setting the temperature to be 20 ℃, and reacting for 15 min.
d) DNA purification
i. 80% ethanol (for example, 50mL of 80% ethanol: 40mL of absolute ethanol +10mL of nucleic-fresh Water) is prepared, and the 80% ethanol should be prepared just before use.
Prepare a corresponding number of 1.5mL sample tubes and mark them accordingly.
The beads equilibrated at room temperature were mixed well with shaking and dispensed into 88. mu.L each tube.
And iv, mixing the DNA added with the adapter with the magnetic beads. Standing at room temperature for 10 min.
v. place 1.5mL sample tube on magnetic rack for magnetic bead adsorption until the solution is clear.
Carefully remove the supernatant, add 200 μ L80% ethanol, rotate the sample tube horizontally 360 degrees, stand for 30s and discard the supernatant. (this process, the centrifuge tube was kept on the magnetic stand.)
Repeating the above steps once.
All remaining alcohol solution should be removed. And opening the tube cover, drying the magnetic beads at normal temperature, and volatilizing ethanol to prevent excessive ethanol from influencing the effect of the enzyme in a subsequent reaction system. Note that: the beads cannot be dried too much, which would otherwise result in DNA not being easily eluted from the beads, resulting in yield loss. When the surface of the magnetic beads is no longer glossy, the drying is finished.
Add 21. mu.L of nucleic-Free water to each sample tube, resuspend the beads, mix well and then let stand at room temperature for 5 min.
x. prepare a new batch of 200 μ L PCR tubes, with the tube lid labeled with the corresponding sample number.
And xi, placing the sample tube in a magnetic frame, carrying out magnetic bead adsorption until the solution is clarified, and transferring the supernatant into the PCR tube with the corresponding number to be used as a template of the PCR experiment.
e) Library amplification
i. Library amplification reaction systems were prepared as shown in Table 6.
Table 6:
Figure BDA0003062453200000161
and ii, adding 30 mu L of Pre-PCR amplification reaction system into each 0.2mL sample tube, gently mixing uniformly, centrifuging at low speed, and placing into a PCR instrument for reaction.
The PCR machine was programmed as follows, and the PCR cycles were adjusted appropriately according to the amount of input DNA, see Table 7.
Table 7:
Figure BDA0003062453200000162
cycle number selection reference table 8.
Table 8:
amount of Input DNA (ng) PCR cycle
X>50ng 4
25ng<X≤50ng 5
10ng<X≤25ng 6
X≤10ng 7
After the end of the Pre-PCR reaction, library purification was started.
f) Library purification
i. A corresponding number of 1.5mL sample tubes are prepared and labeled accordingly.
The beads equilibrated at room temperature were mixed well with shaking and 50. mu.L of each tube was dispensed.
And iii, mixing the DNA added with the adapter with the magnetic beads. Standing at room temperature for 10 min.
Placing a 1.5mL sample tube on a magnetic rack, and performing magnetic bead adsorption until the solution is clarified.
v. carefully remove the supernatant, add 200 μ L80% ethanol, rotate the sample tube horizontally 360 degrees, stand for 30s and discard the supernatant. (this process, the centrifuge tube was kept on the magnetic stand.)
Repeating the above steps once.
All remaining alcohol solution should be removed. And opening the tube cover, drying the magnetic beads at normal temperature, and volatilizing ethanol to prevent excessive ethanol from influencing the effect of the enzyme in a subsequent reaction system. Note that: the beads cannot be dried too much, which would otherwise result in DNA not being easily eluted from the beads, resulting in yield loss. When the surface of the magnetic beads is no longer glossy, the drying is finished.
Add 35. mu.L of nucleic-Free water to each sample tube, resuspend the beads, mix well and then let stand at room temperature for 5 min.
Preparing a batch of new centrifuge tubes, and marking the items, the sampling date and the sample name on tube covers; and marking joint information, database building date and concentration on the pipe wall.
And x, placing the 1.5mL sample tube on a magnetic rack, carrying out magnetic bead adsorption until the solution is clarified, and transferring the supernatant to a corresponding new 1.5mL centrifuge tube written with sample information.
Taking 1ul library and using the Qubit to quantify, using Agilent 2100 to measure the size of the library fragment in 1ul library, and recording corresponding information.
And xi, putting the sample into a freezing storage box of a corresponding project, and storing at-20 ℃.
And xiii, after the experiment is finished, returning all the articles, cleaning the experiment table top, turning on the ultraviolet lamp of the ultra-clean workbench, and turning off the ultraviolet lamp after irradiating for 30 min. Detailed experimental information was recorded.
4. Library pooling
g) Preparing instruments, reagents and consumables required by the experiment.
h) The required pooling volume for each sample was calculated according to the measured concentration and the amount of data that needs to be measured.
i) A new 1.5ml centrifuge tube was taken and labeled. The samples were pooling in the same 1.5ml centrifuge tube according to the calculated pooling volume.
j) After mixing well, the concentration was measured and the information was recorded.
k) After the experiment is completed, all the articles are returned and the experiment table top is cleaned.
5. Sequencing on machine
The above pooling library was denatured by dilution with Tris-HCl and NaOH, and then subjected to on-machine sequencing.
Example 2:
(1) after filtering out low quality reads, these sequencing reads were aligned to the human reference genome using alignment software (bwa) (hg 19).
(2) And filtering comparison results, namely firstly selecting normal pairs, comparing reads, and further filtering out:
a) any one of the reads aligns to a quality value of <30,
b) repeated reads, and supplemental alignment reads;
c) the alignment results include insertions and deletions (InDel), 5 'or 3' end splicing (soft/hard splicing) alignment of reads, and gap (clipped region from the reference) in the middle of reads, and the number of base mismatches included in the alignment results is > 2.
(3) Simultaneously taking out a comparison result of the paired reads; for paired reads overlap regions, as in FIG. 1: the overlapping area of readsA _1 and readsA _2 is counted according to the following method to search for single nucleotide variation;
assuming that there is a SNV mutation at one site, the reference base is a and the mutation is C. Reads from different fragments can be divided into five different categories (fig. 3):
a) reads 1 and 2 from one fragment may be identical and identical to the reference sequence ("Ref _ base _ PE").
b) Reads 1 and 2 from one fragment may be identical and identical to the mutation ("Alt _ base _ PE").
c) Reads 1 and 2 can also be identical, but not identical to the reference base or known mutation ("Other _ PE").
d) Reads 1 and 2 have different nucleotides at the same site ("Diff _ base _ PE").
e) Reads 1 and 2 are co-located-one as the reference and one not ("Diff _ base _ SE")
(4) Based on the above algorithm, two reads were obtained that simultaneously supported a mutated single nucleotide variation. And simultaneously obtaining read reads supporting single nucleotide variation, re-extracting a sequencing read sequence, and converting into a fastq format. Re-aligning to a human reference genome by using other alignment software (bowtie) and GATK re-alignment software, and filtering out different alignment software, wherein the initial positions of the alignments have different mononucleoside mutations; the single nucleotide variations were further filtered using the following methods simultaneously:
1) the position of the single nucleotide variation, at both ends of the read (within 9 bp);
2) the mutation of Alt _ base _ PE and Other _ PE exists in the same site;
4) the base quality of the single nucleotide variation position is any one less than 30;
5) the simple repeat region, black region, where the mutation is located on the genome was filtered out.
(3) Counting the sequencing data volume after filtering the comparison result and the total overlapping sequencing data volume of the read 1 and the read 2; the results of example sample 1 are shown in table 9.
Table 9:
TotalPEReads: 35650782
OverlapingBase: 871729732
Diff_base_PE: 42950
Diff_base_SE: 1913190
ALTcount: 2459496
TotalPEReads: 35650782
(4) example sample 1 was subjected to high-depth sequencing (50X) at the same time, and the single nucleotide variation detection accuracy was calculated to be 86.12% using the variation detected by Varscan2 as a reference set.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (20)

1. A method for determining single nucleotide variations in a nucleic acid sample, comprising:
performing bidirectional low-depth sequencing on the nucleic acid sample to obtain sequencing data of the nucleic acid sample, wherein the sequencing data comprises forward sequencing reads and reverse sequencing reads of the nucleic acid sample;
based on the sequencing data, each of the forward sequencing reads and the reverse sequencing reads independently performs a first comparison process with a reference sequence to obtain a position of a sequencing read at the reference sequence and an overlapping region of the sequencing read;
obtaining the single nucleotide variation based on the overlapping region.
2. The method of claim 1, wherein the number of nucleic acids of the overlapping region is no less than 10bp based on the length of the forward sequencing reads or the reverse sequencing reads.
3. The method of claim 1, wherein the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently the same based on the overlap region.
4. The method of claim 3, further comprising:
aligning at least a portion of the overlapping region with the reference sequence based on the forward sequencing reads and the reverse sequencing reads to obtain the single nucleotide variation.
5. The method of claim 4, wherein the alignment is performed by:
determining a pending single nucleotide variation on at least a portion of the region of overlap of the forward sequencing reads and the reverse sequencing reads, the pending single nucleotide variation being a nucleotide on at least a portion of the region of overlap that is different from a nucleotide on the reference sequence;
the undetermined single nucleotide variation is not at the same site as the predicted single nucleotide variation and is indicative of the undetermined single nucleotide variation being a target single nucleotide variation.
6. The method of claim 5, wherein said variation of a nucleoside to be ordered is not less than 9bp from the end of said forward sequencing read and said reverse sequencing read.
7. The method of claim 5, further comprising:
performing a second alignment process on the forward sequencing reads and the reverse sequencing reads, on which the nucleotide variation to be determined is located, and the reference sequence independently of each other, so as to obtain overlapping regions of the first sequencing reads and the second sequencing reads and the reference sequence, the second alignment process using different alignment software than the first alignment process,
the overlap region of the second alignment treatment is the same as the overlap region of the first alignment treatment and is indicative of the pending single nucleotide variation being a target single nucleotide variation.
8. The method of claim 1, wherein the forward sequencing reads and the reverse sequencing reads have different nucleotide types at the same site in the overlap region are indicative of the sequencing data not being subsequently processed.
9. An apparatus for determining single nucleotide variations in a nucleic acid sample, comprising:
a sequencing data module configured to perform bidirectional low-depth sequencing on the nucleic acid sample to obtain sequencing data of the nucleic acid sample, the sequencing data comprising forward sequencing reads and reverse sequencing reads of the nucleic acid sample;
an overlap region acquisition module configured to perform a first comparison process on the forward sequencing reads and the reverse sequencing reads each independently from a reference sequence based on the sequencing data to obtain a position of the sequencing reads on the reference sequence and an overlap region of the sequencing reads;
a single nucleotide variation acquisition module configured to obtain the single nucleotide variation based on the overlapping region.
10. The apparatus of claim 9, wherein the number of nucleic acids of the overlapping region is no less than 10bp based on the length of the forward sequencing reads or the reverse sequencing reads.
11. The apparatus of claim 9, wherein the nucleotides in the forward sequencing reads and the reverse sequencing reads are each independently the same based on the overlap region.
12. The apparatus of claim 9, further comprising:
a reference sequence alignment unit configured to align at least a portion of the overlapping region with the reference sequence based on the forward sequencing reads and the reverse sequencing reads to obtain the single nucleotide variation.
13. The apparatus of claim 12, further comprising:
a pending single nucleotide variation determination unit configured to determine a pending single nucleotide variation on at least part of an overlap region of the forward sequencing read and the reverse sequencing read, the pending single nucleotide variation being a nucleotide on at least part of the overlap region that is different from a nucleotide on the reference sequence;
the undetermined single nucleotide variation is not at the same site as the predicted single nucleotide variation and is indicative of the undetermined single nucleotide variation being a target single nucleotide variation.
14. The apparatus of claim 13, wherein the variation of a nucleoside to be ordered is not less than 9bp from the end of the forward sequencing reads and the reverse sequencing reads.
15. The apparatus of claim 12, further comprising:
a second alignment processing unit configured to perform a second alignment processing on the forward sequencing reads and the reverse sequencing reads, on which the pending nucleotide variation is located, and the reference sequence independently of each other so as to obtain an overlapping region of the first sequencing reads and the second sequencing reads and the reference sequence, the second alignment processing using different alignment software than the first alignment processing,
the overlap region of the second alignment treatment is the same as the overlap region of the first alignment treatment and is indicative of the pending single nucleotide variation being a target single nucleotide variation.
16. The apparatus of claim 9, wherein the forward sequencing reads and the reverse sequencing reads have different nucleotide types at the same site in the overlap region are indicative of the sequencing data not being subsequently processed.
17. A method for determining the genotype of a test sample, comprising:
obtaining a single nucleotide variation of the test sample, the single nucleotide variation being obtained by the method of any one of claims 1 to 8 or the apparatus of any one of claims 9 to 16;
and acquiring the genotype of the sample to be detected by utilizing the single nucleotide variation.
18. An apparatus for determining the genotype of a test sample, comprising:
a single nucleotide variation acquisition apparatus comprising the device of any one of claims 9 to 16; and
a genotype acquisition device configured to acquire a genotype of the test sample using the obtained single nucleotide variation.
19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8, 17.
20. An electronic device, comprising:
the computer-readable storage medium recited in claim 19; and
one or more processors to execute the program in the computer-readable storage medium.
CN202110517844.3A 2021-05-12 2021-05-12 Method and device for detecting nucleotide variation based on single molecule sequencing Pending CN113186255A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110517844.3A CN113186255A (en) 2021-05-12 2021-05-12 Method and device for detecting nucleotide variation based on single molecule sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110517844.3A CN113186255A (en) 2021-05-12 2021-05-12 Method and device for detecting nucleotide variation based on single molecule sequencing

Publications (1)

Publication Number Publication Date
CN113186255A true CN113186255A (en) 2021-07-30

Family

ID=76981258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110517844.3A Pending CN113186255A (en) 2021-05-12 2021-05-12 Method and device for detecting nucleotide variation based on single molecule sequencing

Country Status (1)

Country Link
CN (1) CN113186255A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103484566A (en) * 2013-09-11 2014-01-01 浙江省医学科学院 Primer, kit and method for conducting genetic typing on hantavirus by means of PCR direct sequencing method
CN104834833A (en) * 2014-02-12 2015-08-12 深圳华大基因科技有限公司 Single nucleotide polymorphism (SNP) detection method and apparatus
CN106755300A (en) * 2016-11-17 2017-05-31 中国科学院华南植物园 A kind of method for recognizing Kiwi berry hybrid strain to filial generation genome contribution proportion
CN106909806A (en) * 2015-12-22 2017-06-30 广州华大基因医学检验所有限公司 The method and apparatus of fixed point detection variation
US20180298448A1 (en) * 2015-01-21 2018-10-18 Bin Tean Teh Method and kit for pathologic grading of breast neoplasm
WO2019047181A1 (en) * 2017-09-08 2019-03-14 深圳华大生命科学研究院 Method for genotyping on the basis of low-depth genome sequencing, device and use
CN110010197A (en) * 2019-03-29 2019-07-12 深圳裕策生物科技有限公司 Single nucleotide variations detection method, device and storage medium based on blood circulation Tumour DNA
CN112236536A (en) * 2018-07-04 2021-01-15 昭和电工株式会社 Aluminum alloy material for forming fluoride film and aluminum alloy material having fluoride film

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103484566A (en) * 2013-09-11 2014-01-01 浙江省医学科学院 Primer, kit and method for conducting genetic typing on hantavirus by means of PCR direct sequencing method
CN104834833A (en) * 2014-02-12 2015-08-12 深圳华大基因科技有限公司 Single nucleotide polymorphism (SNP) detection method and apparatus
US20180298448A1 (en) * 2015-01-21 2018-10-18 Bin Tean Teh Method and kit for pathologic grading of breast neoplasm
CN106909806A (en) * 2015-12-22 2017-06-30 广州华大基因医学检验所有限公司 The method and apparatus of fixed point detection variation
CN106755300A (en) * 2016-11-17 2017-05-31 中国科学院华南植物园 A kind of method for recognizing Kiwi berry hybrid strain to filial generation genome contribution proportion
WO2019047181A1 (en) * 2017-09-08 2019-03-14 深圳华大生命科学研究院 Method for genotyping on the basis of low-depth genome sequencing, device and use
CN110997936A (en) * 2017-09-08 2020-04-10 深圳华大生命科学研究院 Method and device for genotyping based on low-depth genome sequencing and application of method and device
CN112236536A (en) * 2018-07-04 2021-01-15 昭和电工株式会社 Aluminum alloy material for forming fluoride film and aluminum alloy material having fluoride film
CN110010197A (en) * 2019-03-29 2019-07-12 深圳裕策生物科技有限公司 Single nucleotide variations detection method, device and storage medium based on blood circulation Tumour DNA

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JEAN-SIMON BROUARD等: ""Low-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality genotypes and the accuracy of imputation"", 《BMC GENETICS》 *
付丽霞: ""芸薹属蔬菜低深度测序SNP分型及其应用"", 《中国优秀博硕士学位论文全文数据库(硕士)农业科技辑》 *
李勇: ""不同SNP分型技术在猪基因组选择中效果评估及全基因组关联研究"", 《中国优秀博硕士学位论文全文数据库(博士)农业科技辑》 *
董世武等: ""基于GBS技术对梅花鹿, 马鹿及其杂交后代基因组SNP特征的分析"", 《畜牧兽医学报》 *

Similar Documents

Publication Publication Date Title
CN108753967B (en) Gene set for liver cancer detection and panel detection design method thereof
CN107254514B (en) SNP molecular marker for detecting heterologous cfDNA, detection method and application
CN111370056B (en) Method, system and computer readable medium for determining predetermined chromosome instability index of a sample to be tested
CN106544407A (en) The method for determining donor source cfDNA ratios in receptor cfDNA samples
AU2012380221B2 (en) Method, system and computer readable medium for determining base information in predetermined area of fetus genome
CN111370057B (en) Method for determining chromosome structure variation signal intensity and insert length distribution characteristics of sample and application
CN112397143B (en) Method for predicting tumor risk value based on plasma multi-omic multi-dimensional features and artificial intelligence
TW201718874A (en) Single-molecule sequencing of plasma DNA
WO2016049878A1 (en) Snp profiling-based parentage testing method and application
CN109852672B (en) Method for screening acute myeloid leukemia DNA methylation prognosis marker
TWI647236B (en) Primers, snp markers and method for genotyping mycobacterium tuberculosis
CN106498078B (en) A kind of method and its application for the single nucleotide polymorphism detecting sheep KITLG gene
US10961591B2 (en) Methods of mast cell tumor prognosis and uses thereof
CN112410422B (en) Method for predicting tumor risk value based on fragmentation pattern
CN111850116A (en) Gene mutation site group of NK/T cell lymphoma, targeted sequencing kit and application
CN107988385B (en) Method for detecting marker of PLAG1 gene Indel of beef cattle and special kit thereof
CN110993025B (en) Method and device for quantifying fetal concentration and method and device for genotyping fetus
CN105063222A (en) Human ADH2 genotype detection kit
CN105177161B (en) The micro-deleted detection kit in the area Y chromosome AZF
CN111518896A (en) Primer group, application, product and method for detecting nicotine dependence related SNP site
WO2023142625A1 (en) Methylation sequencing data filtering method and application
CN108823330B (en) Soybean HRM-SNP molecular marker point marking method and application thereof
CN113186255A (en) Method and device for detecting nucleotide variation based on single molecule sequencing
CN115198035A (en) Detection method for simultaneously obtaining virus integration transcript and RNA modification based on nanopore sequencing and application
CN112226440B (en) Pathogenic mutation of hereditary primary infertility and detection reagent thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination