WO2019001168A1 - 测序数据结果分析方法和装置、测序文库构建和测序方法 - Google Patents

测序数据结果分析方法和装置、测序文库构建和测序方法 Download PDF

Info

Publication number
WO2019001168A1
WO2019001168A1 PCT/CN2018/087509 CN2018087509W WO2019001168A1 WO 2019001168 A1 WO2019001168 A1 WO 2019001168A1 CN 2018087509 W CN2018087509 W CN 2018087509W WO 2019001168 A1 WO2019001168 A1 WO 2019001168A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
sequence
sample
tag
sequenced
Prior art date
Application number
PCT/CN2018/087509
Other languages
English (en)
French (fr)
Inventor
王克剑
刘庆
王春
Original Assignee
中国水稻研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国水稻研究所 filed Critical 中国水稻研究所
Priority to US16/476,079 priority Critical patent/US20200111542A1/en
Publication of WO2019001168A1 publication Critical patent/WO2019001168A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1093General methods of preparing gene libraries, not provided for in other subgroups
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • the invention relates to the field of gene sequencing, in particular to a method and a device for analyzing the results of sequencing data, a method for constructing a sequencing library and a method for sequencing.
  • Sequence mutations are divided into two types: Single Nucleotide Polymorphism (SNP, also known as single nucleotide polymorphism) and Insertion Deletion Mutation. These two different mutation types also differ in detection methods.
  • Existing methods for identifying SNP mutations include TaqMan probe method, SNaPshot method, Mass Array method, Illumina BeadXpress method, Sanger direct sequencing method, High Resolution Melting (HRM) and enzyme digestion. The method can identify both SNP mutations and insertion deletion mutations. Several methods for identifying SNP mutations are described in detail below.
  • the TaqMan probe method is shown in Figure 1.
  • the TaqMan probe method is designed for real-time PCR amplification of PCR primers and TaqMan probes for different SNP sites on the chromosome.
  • a reporter fluorophore and a quenching fluorophore are labeled at the 5' and 3' ends of the probe, respectively.
  • the probe anneals to the template, thereby producing a substrate suitable for exonuclease activity, thereby cleaving the fluorescent molecule attached to the 5' end of the probe from the probe, destroying the two fluorescences.
  • Intermolecular PRET fluorescing. Usually used for small SNP site analysis.
  • the SNaPshot method is shown in Figure 2.
  • the SNaPshot method is a typing technique based on the principle of single-base extension of fluorescent labeling, also known as small sequencing, mainly for medium-throughput SNP typing projects.
  • a reaction system containing a sequencing enzyme four fluorescently labeled ddNTPs, different length extension primers and a PCR product template immediately adjacent to the 5'-end of the polymorphic site, the primer is terminated by one base extension and detected by the ABI sequencer.
  • the SNP site corresponding to the extension product is determined according to the movement position of the peak, and the type of the base to be incorporated is known according to the color of the peak, thereby determining the genotype of the sample.
  • Templates for PCR products can be obtained by multiplex PCR reaction systems. Usually used for 10-30 SNP site analysis.
  • the HRM method is shown in Figure 3.
  • the HRM method is a SNP research tool that has been developed in recent years. It monitors the binding of double-stranded DNA fluorescent dyes to PCR amplification products in real-time to determine whether SNPs exist and different SNPs.
  • the locus, whether it is heterozygous or the like affects the peak shape of the melting curve, so HRM analysis can effectively distinguish different SNP loci from different genotypes.
  • This method of detection is not limited by the location and type of mutated bases. Without the need for sequence-specific probes, high-resolution melting can be performed directly after PCR to complete the analysis of the genotypes of the samples. This method eliminates the need to design probes, is simple, fast, low cost, accurate, and enables true closed-tube operation.
  • HRM technology is a new molecular diagnostic technique for detecting gene mutation and genotyping combined with saturated fluorescent dyes, unlabeled probes and real-time PCR.
  • the temperature at which half of the DNA double-stranded structure is melted is called the melting temperature (Tm).
  • Tm melting temperature
  • different sequences of DNA correspond to different Tm values.
  • Non-specific dyes such as SYBR green can be directly inserted into double-stranded DNA fragments and can stimulate fluorescence. Thus, it is possible to show the process of renaturation and denaturation of DNA by a change in the intensity of fluorescence in a specific temperature range.
  • the curve formed by the change of the fluorescence signal with temperature is the melting curve.
  • Any DNA molecule will have its own melting curve shape and position when heated and denatured, mainly because the length, GC content, GC distribution, etc. of different nucleic acid molecules are different.
  • the ordinary melting curve is slowly heated at 0.5 ° C / cycle, and the PCR amplification product is denatured and the fluorescence signal is detected in real time. Different products form characteristic peaks of different melting curves, and ordinary Realtime-PCR passes characteristic peaks. Specificity is used to judge the specificity of the amplification product.
  • the Mass Array method (also known as Mass Array Molecular Array Technology) is a genetic analysis tool that combines primer extension or cleavage reactions with sensitive and reliable MALDI-TOF-MS technology for genotyping.
  • the iPLEX GOLD technology based on the Mass Array platform can design PCR reactions and genotype detection up to 40 weights, with flexible experimental design and high accuracy of typing results.
  • Mass Array has the best price/performance ratio for tens to hundreds of SNP sites. It is especially suitable for verifying the results of genome-wide research, or a limited number. The research site has been identified.
  • the Illumina BeadXpress method uses the BeadXpress system for batch SNP site detection, which can simultaneously detect 1-384 SNP sites, often used for genomic chip results confirmation, and is suitable for high-throughput detection.
  • the microbead chip has high density, high reproducibility, high sensitivity, low sample loading, flexible customization, and high integration density, resulting in extremely high detection and screening speed, which can significantly reduce costs in high-throughput screening.
  • the mutation identification method based on the above method has a low flux, and some can only be subjected to single sample identification and analysis, and the cost is high; the detection frequency of the low frequency mutation type is low; and the operation steps are cumbersome, and the sequencing is performed. After the machine data, the bioinformatics background is needed to analyze the data and other issues.
  • the embodiment of the invention provides a method and a device for analyzing the result of sequencing data, a method for constructing and sequencing a sequencing library, and at least solving the result of the sequencing in the related art, which requires a technical staff having a technical background to manually identify the sample, resulting in low efficiency. And the technical problems with higher costs.
  • a method for analyzing a result of sequencing data comprising: obtaining a result of sequencing data of a sequencing library, wherein the sequencing library includes a plurality of samples mixed, each sample corresponding to a combination of tag sequences, and Different combinations of tag sequences corresponding to different samples, wherein the tag sequence combination includes a plurality of tag sequences, and the sequencing data result is a set of sequencing fragments obtained by sequencing the mixed plurality of samples, and the sequenced segment set includes a plurality of sequenced segments; A combination of tag sequences for each of the sequenced fragments is determined; a sample corresponding to each of the sequenced fragments is determined based on the combination of tag sequences for each of the sequenced fragments.
  • the plurality of sequencing fragments includes the first sequencing fragment
  • determining the label sequence combination of the first sequencing fragment comprises: extracting all the label sequences in the first sequencing fragment; and each label sequence to be extracted in the first sequencing fragment Aligning with a plurality of reference tag sequences with known numbers to determine a number corresponding to each tag sequence in the first sequencing segment; determining a combination of numbers of all tag sequences in the first sequencing segment as the first sequencing segment The number of the label sequence combination.
  • the method further includes: acquiring a plurality of reference numbers that are pre-stored and known. The sequence of tags.
  • each of the sequencing segments includes a forward read forward sequence and a reverse read reverse read sequence
  • extracting all the tag sequences in the first sequencing segment includes: respectively A tag sequence is extracted from the positive read sequence and the reverse read sequence of a sequenced segment, wherein the tag sequence combination of the first sequenced segment includes a tag sequence extracted from the positive read sequence and a tag sequence extracted from the reverse read sequence.
  • the method further comprises: obtaining a reference sequence of each sample; extracting a sequence of samples in each of the sequenced segments; Each sample sequence is aligned with a reference sequence of a corresponding sample to determine mutation information for each sample.
  • obtaining the reference sequence of each sample includes: receiving a reference sequence of each sample uploaded by the client terminal through the control; after determining the mutation information of each sample, the method further includes: feeding back the mutation information of each sample to Client terminal.
  • the obtaining the sequencing data result of the sequencing library comprises: receiving the result of the sequencing data uploaded by the client terminal through the control; after determining the sample corresponding to each of the sequencing segments according to the label sequence combination of each of the sequencing segments, the method further comprises: The correspondence between the segmented segments and the plurality of samples is fed back to the client terminal.
  • a method for constructing and sequencing a sequencing library comprising: performing a first round of PCR reaction on a target gene fragment by using a first pair of primers to obtain a first round of PCR products; using a second pair The primer performs a second round of PCR reaction on the first round of PCR products to obtain a sample, wherein the second pair of primers comprises a plurality of tag sequences; and the first round of PCR reaction and the second round of PCR reaction are performed on different target gene segments, respectively.
  • the plurality of samples included in the sequencing library are equally mixed.
  • the PCR plate used in performing the second round of PCR reaction has a plurality of holes, and each hole is correspondingly placed with one sample, and each hole is numbered by the number of the tag sequence combination used for the placed sample.
  • kits comprising: a plurality of reagent wells, wherein each reagent well is provided with a corresponding label, and the label of each reagent hole is set to indicate a corresponding A tag sequence added to the reagent placed in the reagent well.
  • the kit includes a label plate that is configured to provide a plurality of labels, and the plurality of labels on the label sheet correspond one-to-one with the positions of the plurality of reagent holes.
  • a sequencing data result analyzing apparatus comprising: an obtaining unit configured to acquire a sequencing data result of a sequencing library, wherein the sequencing library includes a plurality of mixed samples, each sample corresponding to one The tag sequence is combined, and the tag sequence combination corresponding to different samples is different, wherein the tag sequence combination includes a plurality of tag sequences, and the sequencing data result is a sequence segment set obtained by sequencing the mixed plurality of samples, and the sequence segment set includes disorder a plurality of sequencing fragments; a first determining unit configured to determine a label sequence combination of each of the sequencing fragments; and a second determining unit configured to determine a sample corresponding to each of the sequencing fragments according to the label sequence combination of each of the sequencing fragments.
  • a storage medium comprising a stored program, wherein the device in which the storage medium is located controls the execution of the sequencing data result analysis method of the present invention while the program is running.
  • a processor configured to execute a program, wherein the program runs the sequencing data result analysis method of the present invention.
  • the sequencing data result of the sequencing library is obtained, wherein the sequencing library includes a plurality of mixed samples, each sample corresponds to a combination of tag sequences, and the tag sequence combinations corresponding to different samples are different, wherein the tag sequence is different.
  • the combination includes a plurality of tag sequences
  • the sequencing data result is a set of sequencing fragments obtained by sequencing the mixed plurality of samples
  • the sequence of the sequenced segments includes a plurality of sequenced segments; and determining a combination of tag sequences of each of the sequenced segments;
  • the combination of the tag sequences of the sequenced fragments determines the samples corresponding to each of the sequenced fragments, and solves the technical problem that the technically-skilled technicians in the related art need to manually identify the samples, resulting in low efficiency and high cost,
  • the technical effect of being able to directly determine a sample corresponding to each data in the offline data including the plurality of mixed samples for sequencing is realized.
  • 1 is a schematic diagram showing the principle of detecting a SNP by a TaqMan probe method in the prior art
  • FIG. 2 is a schematic diagram of a principle of detecting a SNP by a SNaPshot method in the prior art
  • FIG. 3 is a schematic diagram of a principle of detecting a SNP by a HRM technology in the prior art
  • FIG. 4 is a flow chart of an alternative method for analyzing sequencing data results in accordance with an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of an optional barcode board in accordance with an embodiment of the present invention.
  • FIG. 6 is a flow chart of an alternative sequencing library construction and sequencing method in accordance with an embodiment of the present invention.
  • Figure 7a is a schematic illustration of an alternative first round of amplification principle, in accordance with an embodiment of the present invention.
  • Figure 7b is a schematic illustration of an alternative first round of amplification product, in accordance with an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of an alternative sequencing data result analysis device in accordance with an embodiment of the present invention.
  • the present application provides an embodiment of a method for analyzing the results of sequencing data.
  • FIG. 4 is a flow chart of an optional method for analyzing the result of sequencing data according to an embodiment of the present invention. As shown in FIG. 4, the method includes the following steps:
  • Step S101 obtaining a sequencing data result of the sequencing library
  • Step S102 determining a tag sequence combination of each sequence segment
  • Step S103 determining a sample corresponding to each sequence segment according to a combination of tag sequences of each sequence segment.
  • the sequencing library is a pre-built gene library, and the sequencing library includes a plurality of mixed samples, and each sample can be obtained by processing the target gene fragment, wherein the target gene fragment refers to a test (eg, mutation identification).
  • the gene fragment of the experiment but since it is necessary to sequence a mixture of multiple genes during sequencing, it is impossible to distinguish each target gene fragment in the obtained sequencing result, and therefore, each target gene fragment needs to be processed in the target gene fragment.
  • On the basis of at least the label sequence for labeling is added to obtain samples, so that each sample can be distinguished from other samples.
  • each sample corresponds to one label sequence combination, and different sample sequences corresponding to different label sequences are different, wherein
  • the tag sequence combination includes a plurality of tag sequences
  • the sequencing data result is a set of sequencing fragments obtained by sequencing the mixed plurality of samples, and the sequence of the sequenced segments includes a plurality of sequenced segments.
  • the sequencing library may be constructed by adopting a general library construction method, for example, by separately targeting a plurality of target gene fragments (the target gene fragment refers to one fragment of the gene, and the plurality of target gene fragments may be different sample objects).
  • the same segment of the gene is subjected to two rounds of PCR amplification to obtain a plurality of samples, wherein the primers for performing PCR amplification include a plurality of tag sequences, and the tag sequences are used as markers for the target gene segments.
  • Each sample in the sequencing library includes a plurality of tag sequences, and the combination of multiple tag sequences for each sample is different.
  • a plurality of tag sequences include R1, R2, and F1, F2, etc.
  • each target gene segment is labeled by two tag sequences
  • the first target gene segment is labeled by R1, F1
  • the second target gene fragment is passed.
  • R1 and F2 are labeled
  • the third target gene fragment is labeled by R2 and F1
  • the fourth target gene fragment is labeled by R2 and F2, that is, the label sequence combination of each target gene fragment is different.
  • the combination of the label sequences of the different target gene segments for labeling is exemplified for illustrative purposes only and does not constitute a limitation on the technical solution of the present application.
  • the sequencing library can be sequenced through the sequencing platform to obtain the sequencing data, that is, the sequencing data result, since the plurality of samples in the sequencing library are mixed at the time of sequencing, the sequencing data result is
  • the sequenced fragments are unordered, and each sequenced fragment corresponds to one sample, but the specific sample for each sequenced fragment is unknown. Therefore, after obtaining the sequencing data results of the sequencing library, the tag sequence combination of each of the sequencing fragments is determined, and the plurality of sequencing fragments are individually associated with the plurality of samples according to the tag sequence combination of each of the sequencing fragments.
  • the data processing method provided by this embodiment is performed by software, and specifically, may be performed by a program or an application installed on the terminal device.
  • the embodiment may be executed by a server.
  • the server may receive the sequencing data result uploaded by the client terminal through a control (for example, an input box on a webpage), and the server is according to each After the tag sequence combination of the sequenced segments is used to map the plurality of sequenced segments to the plurality of samples one by one, the correspondence between the plurality of sequenced segments and the plurality of samples is fed back to the client terminal.
  • the data processing method provided in this embodiment may be performed by a server, and the method for obtaining the sequencing data result in step S101 may be that the server receives the sequencing data sent by the requesting end (other terminals requesting data processing on the sequencing data result) through the network.
  • the server can transmit the correspondence to the requesting end via the network.
  • the server may further obtain a reference sequence of each target gene segment (sample sequence) uploaded by the requesting end through the network, and compare the sample sequence in each sequenced segment with the corresponding reference sequence, and then feedback the mutation identification result. To the request side.
  • the above server may receive the data sent by the requesting end through the webpage, the server may be a Linux system, and the Apeche software may be used, and the database may adopt a Mysql (for example, MariaDB) database system, and the webpage may be constructed by using a language script such as Perl, PHP or Python. to make.
  • a language script such as Perl, PHP or Python. to make.
  • a program in a server can be composed of a perl script combined with a shell execution script.
  • the web analytics interface is built from the PHP language in conjunction with the JavaScript language.
  • the step of determining the tag sequence combination of the first sequencing fragment by using one of the plurality of sequencing fragments (the first sequencing fragment) included in the sequencing data result comprises: in the first sequencing fragment Extracting all tag sequences; comparing each tag sequence extracted in the first sequencing segment with a plurality of reference tag sequences having known numbers to determine a number corresponding to each tag sequence in the first sequencing segment; The combination of the numbering of all tag sequences in the first sequenced fragment is determined as the number of the tag sequence combination of the first sequenced fragment.
  • each sequencing fragment includes at least a plurality of tag sequences and sample sequences (sequencing results of the target gene segments).
  • the sequencing data result is in the format of the data compression package.
  • the data compression package is decompressed, and multiple sequencing fragments can be obtained.
  • each of the sequencing fragments may exist in a data packet format, and each data packet includes multiple pieces of sequencing data, and each piece of sequencing data may be a sample sequence, a label sequence, or one of other sequences included in the sample.
  • the tag sequence extracted from the data packet is compared with the library of the tag sequence used in constructing the sequencing library, and the number of the tag sequence extracted in the data packet is determined.
  • sequence of reference tags whose numbers are known may be data uploaded by the client, or may be data pre-stored locally by the server.
  • each of the sequencing segments includes a forward read forward sequence and a reverse read reverse read sequence
  • extracting all the tag sequences in the first sequencing segment includes: The tag sequence is extracted from the positive read sequence and the reverse read sequence of the first sequenced segment, wherein the tag sequence combination of the first sequenced segment includes a tag sequence extracted from the positive read sequence and a tag sequence extracted from the reverse read sequence.
  • the one-to-one mapping of the plurality of sequencing fragments to the plurality of samples according to the combination of the tag sequences of each of the sequenced segments may include: determining, according to the number corresponding to each tag sequence in the first sequencing segment, the sample corresponding to the first sequencing segment, for each The sequenced fragments are processed in the same manner as the first sequenced fragment to determine the corresponding sample for each sequenced fragment.
  • the method may further include: obtaining a reference sequence of each sample; extracting a sequence of the samples in each of the sequenced segments; The sample sequences are aligned with the reference sequences of the corresponding samples to determine the mutation information for each sample.
  • the step of acquiring the reference sequence of each sample may include: receiving a reference sequence of each sample uploaded by the client terminal through the control; after determining the mutation information of each sample, the method further comprises: mutating each sample Information is fed back to the client terminal.
  • the step of obtaining the sequencing data result of the sequencing library may include: receiving the sequencing data result uploaded by the client terminal through the control; after determining the sample corresponding to each sequencing segment according to the label sequence combination of each of the sequencing segments, the method further includes : feeding back the correspondence between multiple sequencing segments and multiple samples to the client terminal.
  • the sequencing data result of the sequencing library is obtained, wherein the sequencing library includes a plurality of mixed samples, each sample corresponds to one label sequence combination, and the label sequence combination corresponding to different samples is different, wherein the label sequence combination includes multiple
  • the sequencing data result is a set of sequencing fragments obtained by sequencing the mixed plurality of samples, the sequencing fragment set includes a plurality of sequenced fragments; the label sequence combination of each sequencing fragment is determined; and the label of each sequencing fragment is determined according to each
  • the sequence combination determines the sample corresponding to each segment, and solves the problem that the sequencing result in the related art requires a technical staff having a technical background to manually identify the sample, resulting in low efficiency and high cost, thereby achieving direct
  • the technical effect of the sample corresponding to each data in the down data including the plurality of mixed samples for sequencing is determined.
  • the data processing method provided by the embodiment can improve the identification efficiency of the mutation, obtain high-throughput analysis results without the biological information background, and further, each sample is determined on the basis of identifying the data corresponding to each sample.
  • the sequencing results were compared with the reference gene to obtain the mutation identification result, which is a new high-throughput mutation identification method, which can simplify the experimental steps.
  • the requesting end uploads the data of the sequencing result (the format of the compressed package) (via PHP) to the server;
  • the server calls the server's local decompression software (such as gunzip) to decompress the uploaded data;
  • the server's local decompression software such as gunzip
  • the server (using a perl script) extracts a barcode (tag sequence) combination of each pair of pairend sequences (double-ended sequencing sequences);
  • the server (using the perl script) combines the barcode combination to determine the number of each sample in the sequencing library:
  • the sequencing library can be placed through a test orifice plate (or a barcode plate, a label sequence plate) as shown in FIG. 5, and each well corresponds to a label sequence combination, and each label sequence combination is different.
  • One well, one sample per well, and when sequencing, the test well plate can be placed directly into the sequencing instrument for sequencing;
  • the server obtains the database of all the tag sequences and knows the above 20 tags of F1 to F12 and R1 to R8, the tag sequence in each of the sequenced segments in the sequenced data result can be compared with the known tag sequence.
  • the tag sequence combination of each sequenced fragment is determined (for example, F1R1, F2R1, etc.), and after determining the tag sequence combination of each sequenced segment, the number of each sample can be determined (for example, the sample number is also It can be labeled by a combination of tag sequences), and the corresponding relationship between each segment of the sequencing data and each well on the test well plate can also be determined.
  • the database of all tag sequences can be uploaded by the requester. Or a general tag sequence database pre-stored in a database of the server;
  • the server (using local short-order alignment data tools such as BWA software, etc.) compares each sample sequence with the reference genome sequence uploaded by the requester:
  • the server needs to obtain the reference genome sequence uploaded by the requesting end, and the step of obtaining the reference genome sequence uploaded by the requesting end may be performed before the step, that is, after the server receives the reference genome sequence. Perform this step;
  • the server (using the perl script) analyzes, organizes, and counts the mutation information of each sample
  • the requester downloads the results of the server analysis.
  • the present application also provides an embodiment of a storage medium, the storage medium of the embodiment comprising a stored program, wherein the device in which the storage medium is located controls the method of analyzing the result of the sequencing data according to the embodiment of the present invention when the program is running.
  • the present application also provides an embodiment of a processor for running a program, wherein the program runs the sequencing data result analysis method of the embodiment of the present invention.
  • the present application also provides an embodiment of a method of constructing and sequencing a sequencing library.
  • FIG. 6 is a flow chart of an alternative sequencing library construction and sequencing method, as shown in FIG. 6, the method includes the following steps:
  • Step S201 performing a first round of PCR reaction on the target gene fragment by using the first pair of primers to obtain a first round of PCR product:
  • the target sequence fragment of different sample materials can be integrated by the first pair of primers by integrating the two sequences of the bridging sequence and the target fragment amplification specific primer sequence on the first pair of primers that are amplified upstream and downstream ( Target gene fragment) amplification and enrichment.
  • the step of extracting the genomic DNA (gene) of the sample to be tested may be performed, and the DNA sample extracted by any method may be used, and there is no special requirement for the sample amount and concentration.
  • the first pair of primers comprises a first upstream primer sequence and a first downstream primer sequence
  • the first upstream primer sequence is mixed with a solution of the first downstream primer sequence and the DNA sequence of the single sample, and a PCR reaction is performed.
  • the first PCR product was obtained.
  • the first upstream primer sequence or the first downstream primer sequence may include a bridging sequence and a first upstream target fragment amplification specific primer sequence or a first downstream target fragment amplification specific primer sequence from the 5′ end to the 3′ end.
  • the bridging sequence is a sequence complementary to the primer sequence of the second round of PCR, and the first upstream target fragment amplification specific primer sequence and the first downstream target fragment amplification specific primer sequence are complementary to the 3' end of the two single strands of the DNA sequence, respectively.
  • the length of the specific fragment amplification sequence of the target fragment can be 15-25 bp
  • the length of the bridge sequence can be 15-30 bp.
  • a tag sequence may also be added to the first upstream primer sequence and the first downstream primer sequence, and the length of the added tag sequence may be random, and may be 1 to 50 bp, and a random combination may be used to distinguish a large number of samples at one time.
  • Step S202 performing a second round of PCR reaction on the first round of PCR products by using the second pair of primers to obtain a sample:
  • the second pair of primers comprises a plurality of tag sequences, and the different target gene segments correspond to different tag sequence combinations, and the tag sequence combination is a combination of a plurality of tag sequences included in the second pair of primers.
  • the second round of PCR products is shown in Figure 7b, with a sequence of tags on each end of each sample.
  • the linker sequence, the sequencing primer, the tag sequence, and the bridging sequence can be integrated into a universal second pair of primers, and the second round of primers is a universal primer combination that has been arranged into a fixed 96-well plate (as shown in FIG. 5). ) or a combination of 384-well plates to make a mixed kit.
  • the second pair of primers includes a second upstream primer sequence and a second downstream primer sequence, and the PCR product obtained in step S201 is mixed with a solution of a different combination of the second upstream primer sequence and the second downstream primer sequence, respectively, or may be directly used.
  • a configured mixing kit (for example, the 96-well or 384-well second round primer mixing kit described above) is subjected to a PCR reaction to obtain a second PCR product.
  • the 5' end to the 3' end of the second upstream primer sequence may sequentially comprise a linker sequence, a sequencing primer sequence, a tag sequence and a bridging sequence in the first upstream primer sequence, and the second downstream primer sequence is from the 5' end to the 3' end.
  • the linker sequence, the sequencing primer sequence, the tag sequence, and the bridging sequence in the first downstream primer sequence may be sequentially included; the tag sequence of each pair of the second upstream primer sequence and the second downstream primer sequence is such that each DNA sequence has a different PCR reaction.
  • the length of the tag sequence can be 1-20 bp.
  • 4 to 10 bases can be introduced between the sequencing primer sequence and the tag sequence to improve the accuracy of the sequence of the tag sequence obtained by sequencing.
  • Step S203 performing the first round of PCR reaction and the second round of PCR reaction on different target gene fragments respectively, to obtain a plurality of samples:
  • the different target gene segments correspond to different combinations of tag sequences
  • the tag sequence combination is a combination of multiple tag sequences included in the second pair of primers.
  • multiple samples included in the sequencing library are equally mixed. Two rounds of PCR reactions are performed on different target samples, and an equal amount of the second round of PCR products is purified to obtain a sequencing library, and the sequencing library includes a plurality of mixed samples.
  • the above two rounds of amplification can save the cost of synthesizing primers, and the method can complete the construction of the sequencing library by two steps of amplification, which not only improves the quality of the sequencing library and the efficiency of database construction, but also constructs the obtained sequencing library.
  • the sequencing library With the linker sequence used on conventional sequencing platforms and sequencing primer sequences, the sequencing library enables high-throughput sequencing using reagents used in conventional sequencing, without the need for additional sequencing primers and sequencing of the mixed MiX library. Primers are changed.
  • Step S204 performing sequencing on the sequencing library, and obtaining the sequencing data result:
  • the sequencing library is a mixed sample, and the sequencing library can be sequenced by the second generation using a high-throughput sequencing platform to obtain the sequencing data, that is, the sequencing data result, and the sequencing data result is disordered multiple sequencing. Fragment.
  • the step of performing quality inspection on the sequencing library is further included.
  • Step S205 performing the sequencing data result analysis method of the present invention on the sequencing data result, and obtaining the analysis result.
  • the two-stage PCR and the second-generation sequencing are combined, and the data of the sequencing offline is directly provided by the data processing method, thereby realizing the effect of automatically identifying the result of the mixed multi-sample sequencing data.
  • the number of mixed samples can be adjusted by adjusting the number of combinations of the second round of primer pairs.
  • the mutation of the target gene fragment of each sample can be automatically identified and analyzed.
  • the label sequence in the primer sequence used in step S202 is preferentially used to distinguish the samples, and if the discrimination is successful, the identification sequence provided by the sequencing company can be used to distinguish multiple DNA sequences from different samples.
  • the PCR plate used in performing the second round of PCR reaction has a plurality of holes, and each hole is correspondingly placed with one sample, and each hole is numbered by the number of the tag sequence combination used for the placed sample.
  • the sequencing method provided by the embodiment identifies the mutation information by a database method for identifying high-throughput amplicon mutations and the corresponding analysis software, and has a method for building and analyzing the mutation method compared with other types of mutation identification methods. New improvements, and software with automatic decoding and identification of mutations, and the cost is cheaper, the time required to build a library is shorter, the operation is simpler, and the number of mixed samples can be distinguished in a single time without restriction.
  • the method can automatically split the results of sequencing of mixed samples and automatically identify the type of mutation of a single material.
  • the method of operation is relatively simple, and the identification of a large number of samples can be completed without any bioinformatics background.
  • kits comprising a plurality of reagent wells, wherein each reagent well is provided with a corresponding label, and the label of each reagent well is used to indicate placement into the corresponding reagent well
  • the tag sequence added to the reagent may be disposed on a label board.
  • the kit may include a label board, and the label board may be configured by a plurality of labels, a plurality of labels, and a plurality of reagent holes. One-to-one correspondence.
  • the kit provided in this embodiment may include a barcode plate as shown in FIG. 5, each label on the barcode plate corresponding to one reagent well, and the label of each reagent well indicating the number of one reagent well.
  • the reagent corresponding to each reagent well can be added to two label sequences, and the number of each label includes the number of the two label sequences added. Up to 96 reagents can be labeled by 20 differently numbered tag sequences as shown in FIG.
  • the present application also provides an embodiment of a sequencing data result analysis device.
  • FIG. 8 is a schematic diagram of an optional sequencing data result analyzing apparatus according to an embodiment of the present invention.
  • the apparatus includes an obtaining unit 10, a first determining unit 20, and a second determining unit 30, wherein a unit for obtaining a sequencing data result of the sequencing library, wherein the sequencing library includes a plurality of mixed samples, each sample corresponding to one label sequence combination, and different label sequences corresponding to different samples are different, wherein the label sequence combination includes multiple
  • the sequencing data result is a set of sequencing fragments obtained by sequencing the mixed plurality of samples, the sequencing fragment set includes a plurality of sequenced fragments;
  • the first determining unit is configured to determine a label sequence combination of each of the sequencing fragments;
  • a second determining unit configured to determine a sample corresponding to each of the sequenced segments according to a combination of tag sequences of each of the sequenced segments.
  • the embodiment is used to obtain the sequencing data result of the sequencing library by using an acquiring unit, wherein the sequencing library includes a plurality of mixed samples, each sample corresponding to a combination of label sequences, and different label sequences corresponding to different samples are different, wherein the label The sequence combination includes a plurality of tag sequences, and the sequencing data result is a set of sequencing fragments obtained by sequencing the mixed plurality of samples, the sequence of the sequenced segments includes a plurality of sequenced segments; the first determining unit is configured to determine each of the sequenced segments The combination of the label sequences; the second determining unit is configured to determine the sample corresponding to each of the sequencing fragments according to the combination of the label sequences of each of the sequencing fragments, and solves the problem that the sequencing result in the related art requires a technical staff having a technical background to perform the manual Identifying the sample results in a less efficient and costly technical problem, which in turn enables the technical effect of being able to directly determine the sample corresponding to each of the data in the down data including the plurality of
  • the multiple sequencing segments include a first sequencing segment
  • the first determining unit includes: an extraction module, configured to extract all the tag sequences in the first sequencing segment; and the comparison module is configured to The extracted plurality of tag sequences are each aligned with a plurality of reference tag sequences of known numbers to determine a number corresponding to each tag sequence in the first sequenced segment.
  • the foregoing obtaining unit 10, the first determining unit 20 and the second determining unit 30 may be operated in a computer terminal as part of the device, and the functions implemented by the above module may be performed by a processor in the computer terminal.
  • the computer terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
  • the above apparatus may include a processor and a memory, each of which may be stored as a program unit in a memory, and the processor executes the above-described program unit stored in the memory to implement a corresponding function.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one Memory chip.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • the division of the unit may be a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Biomedical Technology (AREA)
  • Immunology (AREA)
  • Plant Pathology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种测序数据结果分析方法和装置、测序文库构建和测序方法。其中,该测序数据结果分析方法包括:获取测序文库的测序数据结果(S101),其中,测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段;确定每个测序片段的标签序列组合(S102);根据每个测序片段的标签序列组合确定每个测序片段对应的样本(S103)。所述方法解决了相关技术中的测序下机结果需要具有技术背景的科技工作人员进行人工辨别样本导致效率较低且成本较高的技术问题。

Description

测序数据结果分析方法和装置、测序文库构建和测序方法 技术领域
本发明涉及基因测序领域,具体而言,涉及一种测序数据结果分析方法和装置、测序文库构建和测序方法。
背景技术
随着基因组学研究的深入,对特定区域序列进行突变鉴定的需求越来越大。序列突变分为两种类型:单碱基替换(Single Nucleotide Polymorphism,简写SNP,又称单核苷酸多态性)和***缺失突变,这两种不同的突变类型在检测方法上也有差异。现有的SNP突变的鉴定方法主要包括TaqMan探针法、SNaPshot法、Mass Array法、Illumina BeadXpress法等,Sanger直接测序法、高分辨率溶解曲线分析技术(High Resolution Melting,简写HRM)以及酶切法既可以鉴定SNP突变,也可以鉴定***缺失突变。下面对几种能够对SNP突变进行鉴定的方法进行详细的介绍。
TaqMan探针法如图1所示,TaqMan探针法是针对染色体上的不同SNP位点分别设计PCR引物和TaqMan探针,进行实时荧光PCR扩增。在探针的5’端和3’端分别标记一个报告荧光基团和一个淬灭荧光基团。当溶液中存在PCR产物时,该探针与模板退火,即产生了适合于核酸外切酶活性的底物,从而将探针5’端连接的荧光分子从探针上切割下来,破坏两荧光分子间的PRET,发出荧光。通常用于少量SNP位点分析。
SNaPshot法如图2所示,SNaPshot法是基于荧光标记单碱基延伸原理的分型技术,也称小测序,主要针对中等通量的SNP分型项目。在一个含有测序酶、四种荧光标记ddNTP、紧临多态位点5’-端的不同长度延伸引物和PCR产物模板的反应体系中,引物延伸一个碱基即终止,经ABI测序仪检测后,根据峰的移动位置确定该延伸产物对应的SNP位点,根 据峰的颜色可得知掺入的碱基种类,从而确定该样本的基因型。对于PCR产物模板可通过多重PCR反应体系来获得。通常用于10-30个SNP位点分析。
HRM法如图3所示,HRM法是近几年兴起的SNP研究工具,它通过实时监测升温过程中双链DNA荧光染料与PCR扩增产物的结合情况,来判断是否存在SNP,而且不同SNP位点、是否是杂合子等都会影响熔解曲线的峰形,因此HRM分析能够有效区分不同SNP位点与不同基因型。这种检测方法不受突变碱基位点与类型的局限,无需序列特异性探针,在PCR结束后直接运行高分辨率熔解,即可完成对样品基因型的分析。该方法无需设计探针,操作简便、快速,成本低,结果准确,并且实现了真正的闭管操作。HRM技术是结合饱和荧光染料、未标记探针和实时荧光定量PCR的一种新的检测基因突变与基因分型的分子诊断技术,利用DNA双链结构解链一半的温度称为熔解温度(Tm),不同序列的DNA对应不同的Tm值。DNA中GC含量越高,Tm值越高,GC含量与Tm值成正比关系。SYBR green等花菁类非特异性染料,可以直接***双链的DNA片段当中,并可激发荧光。由此就可以通过在特定温度区间内,通过荧光的强度变化显示出DNA的复性及变性的这一过程,这个荧光信号随温度变化而变化形成的曲线就是熔解曲线。任何DNA分子在加热变性时都会有自己熔解曲线的形状和位置,主要因为不同核酸分子的片段长短、GC含量、GC分布等是不同的。普通的熔解曲线,以0.5℃/循环,进行缓慢升温,将PCR扩增产物进行变性并实时检测荧光信号,不同的产物会形成不同的熔解曲线的特征峰,普通的Realtime-PCR通过特征峰的特异性来判断扩增产物的特异性。
Mass Array法(也即Mass Array分子量阵列技术)是一种基因分析工具,通过引物延伸或切割反应与灵敏、可靠的MALDI-TOF-MS技术相结合,实现基因分型检测。基于Mass Array平台的iPLEX GOLD技术可以设计最高达40重的PCR反应和基因型检测,实验设计灵活,分型结果准确性高。根据应用需要,对数十到数百个SNP位点进行数百至数千份样 本检测时,Mass Array具有最佳的性价比,特别适合于对全基因组研究发现的结果进行验证,或者是有限数量的研究位点已经确定的情况。
Illumina BeadXpress法是采用BeadXpress***进行批量SNP位点检测,可以同时检测1-384个SNP位点,往往用于基因组芯片结果确认,适合高通量检测。微珠芯片具有高密度、高重复性、高灵敏度、低上样量、定制灵活等特点,极高的集成密度,从而获得极高的检测筛选速度,在高通量筛选时可显著降低成本。
在上述方法基础上进行的突变鉴定方法,通量较低,有的只能进行单样本进行鉴定分析,成本较高;低频率的突变类型检出效率低;而且操作步骤繁琐,在得到测序下机数据之后还需要生物信息学背景才能分析数据等问题。
针对相关技术中的测序下机结果需要具有技术背景的科技工作人员进行人工辨别样本导致效率较低且成本较高的技术问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种测序数据结果分析方法和装置、测序文库构建和测序方法,以至少解决相关技术中的测序下机结果需要具有技术背景的科技工作人员进行人工辨别样本导致效率较低且成本较高的技术问题。
根据本发明实施例的一个方面,提供了一种测序数据结果分析方法,包括:获取测序文库的测序数据结果,其中,测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段;确定每个测序片段的标签序列组合;根据每个测序片段的标签序列组合确定每个测序片段对应的样本。
进一步地,多个测序片段包括第一测序片段,确定第一测序片段的标 签序列组合包括:在第一测序片段中提取所有的标签序列;将在第一测序片段中提取出的每个标签序列与多个编号已知的参考标签序列分别进行比对,以确定第一测序片段中每个标签序列对应的编号;将第一测序片段中所有标签序列的编号的组合确定为第一测序片段的标签序列组合的编号。
进一步地,在将在第一测序片段中提取出的每个标签序列与多个编号已知的参考标签序列分别进行比对之前,该方法还包括:获取预先存储的多个编号已知的参考标签序列。
进一步地,在测序数据结果为通过pariend测序方法获取的情况下,每个测序片段包括正读forward read序列和反读reverse read序列,在第一测序片段中提取所有的标签序列包括:分别在第一测序片段的正读序列和反读序列中提取标签序列,其中,第一测序片段的标签序列组合包括从正读序列中提取的标签序列和从反读序列中提取的标签序列。
进一步地,在根据每个测序片段的标签序列组合确定每个测序片段对应的样本之后,该方法还包括:获取每个样本的参考序列;在每个测序片段中提取样本的序列;将提取出的每个样本序列与对应样本的参考序列进行比对,以确定每个样本的突变信息。
进一步地,获取每个样本的参考序列包括:接收客户终端通过控件上传的每个样本的参考序列;在确定每个样本的突变信息之后,该方法还包括:将每个样本的突变信息反馈至客户终端。
进一步地,获取测序文库的测序数据结果包括:接收客户终端通过控件上传的测序数据结果;在根据每个测序片段的标签序列组合确定每个测序片段对应的样本之后,该方法还包括:将多个测序片段与多个样本的对应关系反馈至客户终端。
根据本发明实施例的一个方面,提供了一种测序文库的构建和测序方法,包括:利用第一对引物对目标基因片段进行第一轮PCR反应,得到 第一轮PCR产物;利用第二对引物对第一轮PCR产物进行第二轮PCR反应,得到样本,其中,第二对引物包括多个标签序列;对不同的目标基因片段分别执行上述第一轮PCR反应和第二轮PCR反应,得到多个样本,其中,不同的目标基因片段对应的标签序列组合不同,标签序列组合为第二对引物中包括的多个标签序列的组合;对测序文库执行测序,得到测序数据结果,其中,测序文库为混合的多个样本,测序数据结果为无序的多个测序片段;对测序数据结果执行本发明的测序数据结果分析方法,得到分析结果。
进一步地,测序文库中包括的多个样本是等量混合的。
进一步地,在执行第二轮PCR反应时采用的PCR板上具有多个孔,每个孔对应放置一个样本,每个孔的编号为放置的样本采用的标签序列组合的编号。
根据本发明实施例的一个方面,提供了一种试剂盒,该试剂盒包括:多个试剂孔,其中,每个试剂孔设置有对应的标签,每个试剂孔的标签设置为指示向对应的试剂孔中放置的试剂中添加的标签序列。
进一步地,试剂盒包括一个标签板,标签板设置为设置多个标签,标签板上的多个标签与多个试剂孔的位置一一对应。
根据本发明实施例的一个方面,提供了一种测序数据结果分析装置,包括:获取单元,设置为获取测序文库的测序数据结果,其中,测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段;第一确定单元,设置为确定每个测序片段的标签序列组合;第二确定单元,设置为根据每个测序片段的标签序列组合确定每个测序片段对应的样本。
根据本发明实施例的另一方面,还提供了一种存储介质,该存储介质 包括存储的程序,其中,在程序运行时控制存储介质所在设备执行本发明的测序数据结果分析方法。
根据本发明实施例的另一方面,还提供了一种处理器,该处理器设置为运行程序,其中,程序运行时执行本发明的测序数据结果分析方法。
在本发明实施例中,通过获取测序文库的测序数据结果,其中,测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段;确定每个测序片段的标签序列组合;根据每个测序片段的标签序列组合确定每个测序片段对应的样本,解决了相关技术中的测序下机结果需要具有技术背景的科技工作人员进行人工辨别样本导致效率较低且成本较高的技术问题,进而实现了能够直接确定包括多个混合样本进行测序的下机数据中每个数据对应的样本的技术效果。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是现有技术的一种TaqMan探针法检测SNP的原理示意图;
图2是现有技术的一种SNaPshot方法检测SNP的原理示意图;
图3是现有技术的一种HRM技术检测SNP的原理示意图;
图4是根据本发明实施例的一种可选的测序数据结果分析方法的流程图;
图5是根据本发明实施例的一种可选的barcode板的示意图;
图6是根据本发明实施例的一种可选的测序文库的构建和测序方法的流程图;
图7a是根据本发明实施例的一种可选的第一轮扩增原理的示意图;
图7b是根据本发明实施例的一种可选的第一轮扩增产物的示意图;
图8是根据本发明实施例的一种可选的测序数据结果分析装置的示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请提供了一种测序数据结果分析方法的实施例。
图4是根据本发明实施例的一种可选的测序数据结果分析方法的流程图,如图4所示,该方法包括如下步骤:
步骤S101,获取测序文库的测序数据结果;
步骤S102,确定每个测序片段的标签序列组合;
步骤S103,根据每个测序片段的标签序列组合确定每个测序片段对应的样本。
测序文库是预先构建的基因文库,测序文库中包括混合的多个样本,每个样本可以是通过对目标基因片段进行处理之后得到的,其中,目标基因片段是指需要进行试验(例如,突变鉴定试验)的基因片段,但是由于在测序时需要对多段基因的混合物进行测序,在得到的测序结果中无法区分出各个目标基因片段,因此,需要对每个目标基因片段进行处理,在目标基因片段的基础上,至少加入用于标记的标签序列得到样本,以使得每个样本能够与其它的样本区分开,因此,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段。
具体而言,测序文库的构建方式可以是采用通用的文库构建方法,例如,通过分别对多个目标基因片段(目标基因片段是指基因中的一个片段,多个目标基因片段可以是不同样本对象的基因中的同一段基因片段)进行两轮PCR扩增得到多个样本,其中,在进行PCR扩增时的引物中包括多个标签序列,标签序列是用于作为对目标基因片段的标记,测序文库中的每个样本包括多个标签序列,且每个样本的多个标签序列的组合都是不同的。
例如,多个标签序列包括R1、R2和F1、F2等等,每个目标基因片段通过两个标签序列进行标记,第一个目标基因片段通过R1、F1进行标记,第二个目标基因片段通过R1、F2进行标记,第三个目标基因片段通过R2、F1进行标记,第四个目标基因片段通过R2、F2进行标记,也即,每个目标基因片段的标签序列组合是不同的,上述对不同目标基因片段进行标记的标签序列组合进行举例仅作示例性的说明,不构成对本申请技术方案的限制。
在得到测序文库之后,可以将测序文库通过测序平台进行测序,得到测序下机数据,也即,测序数据结果,由于在测序时测序文库中的多个样本是混合的,因此,测序数据结果中的测序片段是无序的,每个测序片段 对应于一个样本,但是具体的每个测序片段对应于哪一个样本则是未知的。因此,在获取测序文库的测序数据结果之后,确定每个测序片段的标签序列组合,根据每个测序片段的标签序列组合将多个测序片段与多个样本一一对应起来。
需要说明的是,该实施例提供的数据处理方法是通过软件执行的,具体而言,可以是通过终端设备上安装的程序或应用执行的。可选的,该实施例可以通过服务器执行,在获取测序文库的测序数据结果时,服务器可以通过接收客户终端通过控件(例如,网页上的输入框)上传的测序数据结果,服务器在根据每个测序片段的标签序列组合将多个测序片段与多个样本一一对应起来之后,将多个测序片段与多个样本的对应关系反馈至客户终端。
例如,该实施例提供的数据处理方法可以是通过服务器执行的,步骤S101获取测序数据结果的方式可以是服务器接收请求端(请求对测序数据结果进行数据处理的其它终端)通过网络发送的测序数据结果,在步骤S103服务器得到测序文库中每个样本与测序数据结果中每个测序片段之间的对应关系之后,服务器可以通过网络将对应关系发送至请求端。进一步地,服务器还可以通过网络获取请求端上传的每个目标基因片段(样本序列)的参考序列,在将每个测序片段中的样本序列与对应的参考序列进行比对之后将突变鉴定结果反馈至请求端。
上述服务器可以是通过网页接收请求端发送的数据,服务器可以是采用Linux***,并使用Apeche软件,数据库可以采用Mysql(例如,MariaDB)数据库***,网页可以采用Perl、PHP或Python等语言脚本搭建而成。例如,在服务器中的程序可以由perl脚本结合shell执行脚本组成,网站分析界面由PHP语言结合JavaScript语言搭建而成。
在一个可选的实施方式中,以测序数据结果中包括的多个测序片段之一(第一测序片段)为例,确定第一测序片段的标签序列组合的步骤包括:在第一测序片段中提取所有的标签序列;将在第一测序片段中提取出的每 个标签序列与多个编号已知的参考标签序列分别进行比对,以确定第一测序片段中每个标签序列对应的编号;将第一测序片段中所有标签序列的编号的组合确定为第一测序片段的标签序列组合的编号。其中,每个测序片段中至少包括多个标签序列和样本序列(目标基因片段的测序结果)。
可选的,测序数据结果是以数据压缩包的格式存在的,在获取测序文库的测序数据结果时,对数据压缩包进行解压缩,可以得到多个测序片段。可选的,每个测序片段可以是以一个数据包的格式存在的,每个数据包中包括多段测序数据,每段测序数据可以是样本序列、一个标签序列或样本中包括的其它序列之一,在第一测序片段的数据包中提取出标签序列,并与多个编号已知的参考标签序列进行比对,其中,多个编号已知的参考标签序列是根据测序文库的构建方法确定的,将数据包中提取出的标签序列与在构建测序文库时采用的标签序列的库进行比对,确定数据包中提取出的标签序列的编号。
在将在第一测序片段中提取出的每个标签序列与多个编号已知的参考标签序列分别进行比对之前,需要获取预先存储的多个编号已知的参考标签序列。可选的,多个编号已知的参考标签序列可以是客户端上传的数据,也可以是服务器本地预先存储的数据。
可选的,在测序数据结果为通过pariend测序方法获取的情况下,每个测序片段包括正读forward read序列和反读reverse read序列,在第一测序片段中提取所有的标签序列包括:分别在第一测序片段的正读序列和反读序列中提取标签序列,其中,第一测序片段的标签序列组合包括从正读序列中提取的标签序列和从反读序列中提取的标签序列。
根据每个测序片段的标签序列组合将多个测序片段与多个样本一一对应起来可以包括:根据第一测序片段中每个标签序列对应的编号确定第一测序片段对应的样本,对每个测序片段采用与第一测序片段相同的处理方式以确定每个测序片段对应的样本。
在根据每个测序片段的标签序列组合确定每个测序片段对应的样本 之后,该方法还可以包括:获取每个样本的参考序列;在每个测序片段中提取样本的序列;将提取出的每个样本序列与对应样本的参考序列进行比对,以确定每个样本的突变信息。
进一步地,获取每个样本的参考序列的步骤可以包括:接收客户终端通过控件上传的每个样本的参考序列;在确定每个样本的突变信息之后,该方法还包括:将每个样本的突变信息反馈至客户终端。
相似的,获取测序文库的测序数据结果的步骤可以包括:接收客户终端通过控件上传的测序数据结果;在根据每个测序片段的标签序列组合确定每个测序片段对应的样本之后,该方法还包括:将多个测序片段与多个样本的对应关系反馈至客户终端。
该实施例通过获取测序文库的测序数据结果,其中,测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段;确定每个测序片段的标签序列组合;根据每个测序片段的标签序列组合确定每个测序片段对应的样本,解决了相关技术中的测序下机结果需要具有技术背景的科技工作人员进行人工辨别样本导致效率较低且成本较高的技术问题,进而实现了能够直接确定包括多个混合样本进行测序的下机数据中每个数据对应的样本的技术效果。
该实施例提供的数据处理方法能够提高突变的鉴定效率,无需生物信息背景即可获得高通量的分析结果,并且进一步地,在识别出每个样本对应的数据的基础上,将每个样本的测序结果与参考基因进行对比,得到突变鉴定结果,是一种新的高通量的突变鉴定方法,能够简化实验步骤。
作为一种可选的实施方式,描述该实施例提供的数据处理方法在一种可选的应用场景下的过程:
第一步,请求端将测序结果数据(压缩包的格式)上传(通过PHP) 上传至服务器;
第二步,服务器调用服务器本地的解压软件(例如gunzip)对上传的数据进行解压缩;
第三步,服务器(利用perl脚本)提取每组pairend序列(双端测序序列)的barcode(标签序列)组合;
第四步,服务器(利用perl脚本)结合barcode组合确定测序文库中每个样品的编号:
具体的,测序文库可以通过如图5所示的试验孔板(或称barcode板,标签序列板)放置,在该试验孔板上每个孔对应一个标签序列组合,每个标签序列组合是不同的,如图5所示,标签序列包括F1~F12、R1~R8共20种标签,构成12×8=96个标签序列组合(F1R1、F2R1等等),每个标签序列组合对应孔板上的一个孔,每个孔中放置一个样品,在测序时,可以将试验孔板直接放置到测序的仪器中进行测序;
因此,服务器如果获取到所有的标签序列的数据库,已知上述F1~F12、R1~R8的20种标签,就可以将测序数据结果中的每个测序片段中的标签序列与已知的标签序列进行对比,确定出每个测序片段的标签序列组合(例如,F1R1、F2R1等等),在确定出每个测序片段的标签序列组合之后,可以确定出每个样本的编号(例如,样本编号也可以通过标签序列组合进行标示),并且也可以确定出测序数据结果中每个测序片段与试验孔板上每个孔的对应关系,可选的,所有的标签序列的数据库可以是请求端上传的,也可以是在服务器的数据库中预先存储的通用标签序列数据库;
第五步,服务器(利用本地的短序比对数据工具如BWA软件等)将每个样本序列与请求端上传的参考基因组序列进行比对:
具体的,在该步骤之前,服务器需要获取请求端上传的参考基因组序列,获取请求端上传的参考基因组序列的步骤在该步骤之前执行即可,也即,在服务器接收到参考基因组序列之后才可以执行该步骤;
第六步,服务器(利用perl脚本)分析、整理、统计每个样本的突变信息;
第七步,请求端(通过PHP)下载服务器分析的结果。
需要说明的是,在附图的流程图虽然示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本申请还提供了一种存储介质的实施例,该实施例的存储介质包括存储的程序,其中,在程序运行时控制存储介质所在设备执行本发明实施例的测序数据结果分析方法。
本申请还提供了一种处理器的实施例,该实施例的处理器用于运行程序,其中,程序运行时执行本发明实施例的测序数据结果分析方法。
本申请还提供了一种测序文库的构建和测序方法的实施例。
图6是根据本发明实施例的一种可选的测序文库的构建和测序方法的流程图,如图6所示,该方法包括如下步骤:
步骤S201,利用第一对引物对目标基因片段进行第一轮PCR反应,得到第一轮PCR产物:
可选的,可以通过将搭桥序列和目标片段扩增特异引物序列这两种序列整合在上、下游扩增的第一对引物上,通过第一对引物将不同的样本材料的目标区域片段(目标基因片段)扩增和富集。需要说明的是,在该步骤之前,还包括提取待测样品基因组DNA(基因)的步骤,任何方法提取的DNA样品均可,对样品量及浓度也无特殊要求。
如图7a所示,第一对引物包括第一上游引物序列与第一下游引物序列,将第一上游引物序列与第一下游引物序列和单个样品的DNA序列的溶液混合,并进行PCR反应,获得第一PCR产物。具体的,第一上游引物序列或第一下游引物序列可以从5’端到3’端依次包含搭桥序列以及第一上游目标片段扩增特异引物序列或第一下游目标片段扩增特异引物 序列,搭桥序列是与第二轮PCR的引物序列互补配对的序列,第一上游目标片段扩增特异引物序列与第一下游目标片段扩增特异引物序列分别与DNA序列两条单链的3’末端互补,其中,目标片段扩增特异引物序列的长度可以取15~25bp,搭桥序列的长度可以取15~30bp。第一上游引物序列和第一下游引物序列中也可以加入标签序列,其加入的标签序列的长度可以随机,可以是1~50bp,再通过随机组合,可以一次性的区分大量的样本。
步骤S202,利用第二对引物对第一轮PCR产物进行第二轮PCR反应,得到样本:
第二对引物中包括多个标签序列,不同的目标基因片段对应的标签序列组合不同,标签序列组合为第二对引物中包括的多个标签序列的组合。第二轮PCR产物如图7b所示,每个样本的两端分别有一个标签序列。
可选的,可以将接头序列、测序引物、标签序列以及搭桥序列整合在通用的第二对引物上,第二轮引物为通用引物组合,已排列成固定的96孔板(如图5所示)或者384孔板的组合,做成混合的试剂盒。第二对引物包括第二上游引物序列和第二下游引物序列,将步骤S201中所得到PCR产物分别与第二上游引物序列和第二下游引物序列不同组合的溶液混合,或者,也可以直接使用配置好的混合试剂盒(例如上述的96孔或者384孔的第二轮引物混合试剂盒),进行PCR反应,获得第二PCR产物。具体的,第二上游引物序列5’端到3’端可以依次包含接头序列、测序引物序列、标签序列和第一上游引物序列中的搭桥序列,第二下游引物序列5’端到3’端可以依次包含接头序列、测序引物序列、标签序列和第一下游引物序列中的搭桥序列;每对笫二上游引物序列与第二下游引物序列的标签序列使得每个DNA序列经过PCR反应后具有不同于其它DNA序列的标签序列,标签序列的长度可以取1~20bp,利用pairend测序(双端测序),结合两侧的标签序列,可以同时区分1~+∞个样品的混合。
可选的,测序引物序列与标签序列之间可以引入4~10个碱基,提高 测序所得标签序列的准确度。
步骤S203,对不同的目标基因片段分别执行上述第一轮PCR反应和第二轮PCR反应,得到多个样本:
需要说明的是,不同的目标基因片段对应的标签序列组合不同,标签序列组合为第二对引物中包括的多个标签序列的组合。
可选的,测序文库中包括的多个样本是等量混合的。对不同的目标样本进行两轮PCR反应,将等量混合的第二轮PCR产物进行纯化,可以得到测序文库,测序文库中包括混合的多个样本。
通过上述两轮扩增的方法可以节约合成引物的费用,同时利用该方法只需通过两步扩增即可完成测序文库的构建,不仅提高测序文库质量及建库效率,而且构建所得的测序文库因具有常规测序平台上所使用的接头序列以及测序引物序列,使得该测序文库能够利用常规上机测序所用的试剂进行高通量测序,而无需额外提供测序引物以及对所混入的MiX文库的测序引物进行变更。
步骤S204,对测序文库执行测序,得到测序数据结果:
具体而言,测序文库为混合的多个样本,可以使用高通量测序平台对测序文库进行二代测序,得到测序下机数据,也即测序数据结果,测序数据结果为无序的多个测序片段。可选的,在得到所述测序文库后,并在进行高通量测序之前,还包括对所述测序文库进行质检的步骤。
步骤S205,对测序数据结果执行本发明的测序数据结果分析方法,得到分析结果。
应用该实施例提供的测序方法,通过两轮PCR,结合二代测序,将测序下机数据直接提供数据处理方法进行处理,即可以实现对混合多样本的测序数据结果进行自动识别的效果。需要说明的是,可以通过调节第二轮引物对组合的数目,来调节混合样本的数目。可选的,在对测序数据结果进行识别之后,还可以自动对每个样本的目标基因片段的突变进行鉴定分 析。
可选的,优先使用步骤S202采用的引物序列中的标签序列区分样本,如果区分失败可以利用测序公司提供的标识序列区别来自不同样本的多个DNA序列。
可选的,在执行第二轮PCR反应时采用的PCR板上具有多个孔,每个孔对应放置一个样本,每个孔的编号为放置的样本采用的标签序列组合的编号。
该实施例提供的测序方法通过一种高通量扩增子突变鉴定的建库方法及对应的分析软件鉴定突变信息,相比于其他类的突变鉴定方法,在建库和分析方法上有了新的改进,并且配套有自动解码及鉴定突变的软件,且成本更加便宜,建库所需时间更短,操作更加简单,单次能够区分混合样本的数目不受限制。该方法能自动对混合样本的测序的结果进行拆分,自动鉴定出单个材料的突变类型。该操作方法比较简单,无需任何生物信息学背景即可完成大量样本的鉴定工作。
本申请还提供了一种试剂盒的实施例,该试剂盒包括多个试剂孔,其中,每个试剂孔设置有对应的标签,每个试剂孔的标签用于指示向对应的试剂孔中放置的试剂中添加的标签序列。可选的,标签可以是布置在一个标签板上的,具体的,试剂盒可以包括一个标签板,标签板上可以通过粘贴、印刷等方式设置多个标签,多个标签与多个试剂孔的位置一一对应。
例如,该实施例提供的试剂盒可以包括如图5所示的barcode(标签)板,该barcode板上的每个标签与一个试剂孔对应,每个试剂孔的标签指示一个试剂孔的编号,每个试剂孔对应的试剂能够添加入两种标签序列,每个标签的编号包括所添加的两个标签序列的编号。通过如图5所示的20个编号不同的标签序列,可以标记最多96个试剂。
本申请还提供了一种测序数据结果分析装置的实施例。
图8是根据本发明实施例的一种可选的测序数据结果分析装置的示意 图,如图8所示,该装置包括获取单元10,第一确定单元20和第二确定单元30,其中,获取单元,用于获取测序文库的测序数据结果,其中,测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段;第一确定单元,用于确定每个测序片段的标签序列组合;第二确定单元,用于根据每个测序片段的标签序列组合确定每个测序片段对应的样本。
该实施例通过获取单元,用于获取测序文库的测序数据结果,其中,测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,标签序列组合包括多个标签序列,测序数据结果为对混合的多个样本进行测序得到的测序片段集,测序片段集包括无序的多个测序片段;第一确定单元,用于确定每个测序片段的标签序列组合;第二确定单元,用于根据每个测序片段的标签序列组合确定每个测序片段对应的样本,解决了相关技术中的测序下机结果需要具有技术背景的科技工作人员进行人工辨别样本导致效率较低且成本较高的技术问题,进而实现了能够直接确定包括多个混合样本进行测序的下机数据中每个数据对应的样本的技术效果。
作为上述实施例的一个可选实施方式,多个测序片段包括第一测序片段,第一确定单元包括:提取模块,设置为在第一测序片段中提取所有的标签序列;比对模块,设置为将提取出的多个标签序列分别与多个编号已知的参考标签序列进行比对,以确定第一测序片段中每个标签序列对应的编号。
此处需要说明的是,上述获取单元10,第一确定单元20和第二确定单元30可以作为装置的一部分运行在计算机终端中,可以通过计算机终端中的处理器来执行上述模块实现的功能,计算机终端也可以是智能手机 (如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。
上述的装置可以包括处理器和存储器,上述单元均可以作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。
上述本申请实施例的顺序不代表实施例的优劣。
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。
其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理 解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (15)

  1. 一种测序数据结果分析方法,包括:
    获取测序文库的测序数据结果,其中,所述测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,所述标签序列组合包括多个标签序列,所述测序数据结果为对所述混合的多个样本进行测序得到的测序片段集,所述测序片段集包括无序的多个测序片段;
    确定每个测序片段的标签序列组合;
    根据每个测序片段的标签序列组合确定每个测序片段对应的样本。
  2. 根据权利要求1所述的方法,其中,所述多个测序片段包括第一测序片段,确定所述第一测序片段的标签序列组合包括:
    在所述第一测序片段中提取所有的标签序列;
    将在所述第一测序片段中提取出的每个标签序列与多个编号已知的参考标签序列分别进行比对,以确定所述第一测序片段中每个标签序列对应的编号;
    将所述第一测序片段中所有标签序列的编号的组合确定为所述第一测序片段的标签序列组合的编号。
  3. 根据权利要求2所述的方法,其中,在将在所述第一测序片段中提取出的每个标签序列与多个编号已知的参考标签序列分别进行比对之前,所述方法还包括:
    获取预先存储的所述多个编号已知的参考标签序列。
  4. 根据权利要求2所述的方法,其中,在所述测序数据结果为通过pariend测序方法获取的情况下,每个测序片段包括正读forward read序列和反读reverse read序列,在所述第一测序片段中提取所有的标签序列包括:
    分别在所述第一测序片段的正读序列和反读序列中提取标签序列,其中,所述第一测序片段的标签序列组合包括从所述正读序列中提取的标签序列和从所述反读序列中提取的标签序列。
  5. 根据权利要求1所述的方法,其中,在根据每个测序片段的标签序列组合确定每个测序片段对应的样本之后,所述方法还包括:
    获取每个样本的参考序列;
    在每个测序片段中提取样本的序列;
    将提取出的每个样本序列与对应样本的参考序列进行比对,以确定每个样本的突变信息。
  6. 根据权利要求5所述的方法,其中,
    获取每个样本的参考序列包括:接收客户终端通过控件上传的每个样本的参考序列;
    在确定每个样本的突变信息之后,所述方法还包括:将每个样本的突变信息反馈至所述客户终端。
  7. 根据权利要求1所述的方法,其中,
    获取测序文库的测序数据结果包括:接收客户终端通过控件上传的测序数据结果;
    在根据每个测序片段的标签序列组合确定每个测序片段对应的样本之后,所述方法还包括:将所述多个测序片段与所述多个样本的对应关系反馈至所述客户终端。
  8. 一种测序文库的构建和测序方法,包括:
    利用第一对引物对目标基因片段进行第一轮PCR反应,得到第一轮PCR产物;
    利用第二对引物对所述第一轮PCR产物进行第二轮PCR反应,得到样本,其中,所述第二对引物包括多个标签序列;
    对不同的目标基因片段分别执行上述第一轮PCR反应和第二轮 PCR反应,得到多个样本,其中,不同的目标基因片段对应的标签序列组合不同,所述标签序列组合为第二对引物中包括的多个标签序列的组合;
    对测序文库执行测序,得到测序数据结果,其中,所述测序文库为混合的所述多个样本,所述测序数据结果为无序的多个测序片段;
    对所述测序数据结果执行权利要求1至7中任一项所述的测序数据结果分析方法,得到分析结果。
  9. 根据权利要求8所述的方法,其中,所述测序文库中包括的多个样本是等量混合的。
  10. 根据权利要求8所述的方法,其中,在执行所述第二轮PCR反应时采用的PCR板上具有多个孔,每个孔对应放置一个样本,每个孔的编号为放置的样本采用的标签序列组合的编号。
  11. 一种试剂盒,包括:
    多个试剂孔,其中,每个试剂孔设置有对应的标签,每个试剂孔的标签设置为指示向对应的试剂孔中放置的试剂中添加的标签序列。
  12. 根据权利要求11所述的试剂盒,其中,所述试剂盒包括一个标签板,所述标签板设置为设置多个标签,所述标签板上的多个标签与所述多个试剂孔的位置一一对应。
  13. 一种测序数据结果分析装置,包括:
    获取单元,设置为获取测序文库的测序数据结果,其中,所述测序文库包括混合的多个样本,每个样本对应一个标签序列组合,且不同样本对应的标签序列组合不同,其中,所述标签序列组合包括多个标签序列,所述测序数据结果为对所述混合的多个样本进行测序得到的测序片段集,所述测序片段集包括无序的多个测序片段;
    第一确定单元,设置为确定每个测序片段的标签序列组合;
    第二确定单元,设置为根据每个测序片段的标签序列组合确定每 个测序片段对应的样本。
  14. 一种存储介质,其中,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行权利要求1至7任意一项所述的测序数据结果分析方法。
  15. 一种处理器,其中,所述处理器设置为运行程序,其中,所述程序运行时执行权利要求1至7任意一项所述的测序数据结果分析方法。
PCT/CN2018/087509 2017-06-27 2018-05-18 测序数据结果分析方法和装置、测序文库构建和测序方法 WO2019001168A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/476,079 US20200111542A1 (en) 2017-06-27 2018-05-18 Method and Device for Analyzing Sequencing Data Result, and Sequencing Library Construction and Sequencing Method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710504178.3 2017-06-27
CN201710504178.3A CN107368706A (zh) 2017-06-27 2017-06-27 测序数据结果分析方法和装置、测序文库构建和测序方法

Publications (1)

Publication Number Publication Date
WO2019001168A1 true WO2019001168A1 (zh) 2019-01-03

Family

ID=60305718

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/087509 WO2019001168A1 (zh) 2017-06-27 2018-05-18 测序数据结果分析方法和装置、测序文库构建和测序方法

Country Status (3)

Country Link
US (1) US20200111542A1 (zh)
CN (1) CN107368706A (zh)
WO (1) WO2019001168A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326123A (zh) * 2021-04-30 2021-08-31 杭州绳武科技有限公司 一种基于容器技术的生物信息分析计算***及方法
CN116306922A (zh) * 2023-02-13 2023-06-23 中国科学院西北生态环境资源研究院 数据序列间关系分析方法、装置、存储介质及电子设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932401B (zh) * 2018-06-07 2021-09-24 江西海普洛斯生物科技有限公司 一种测序样本的标识方法及其应用
CN109063959A (zh) * 2018-06-22 2018-12-21 深圳弘睿康生物科技有限公司 一种样本质量控制分析方法和***
CN111118112A (zh) * 2018-10-30 2020-05-08 浙江大学 一种高通量基因表达谱检测试剂盒
CN110504007B (zh) * 2019-08-27 2023-03-14 上海美吉生物医药科技有限公司 一键化完成多场景菌种鉴定的工作方法及***
CN113744803A (zh) * 2020-05-29 2021-12-03 鸿富锦精密电子(天津)有限公司 基因测序进度管理方法、装置、计算机装置及存储介质
CN111676276A (zh) * 2020-07-13 2020-09-18 湖北伯远合成生物科技有限公司 一种快速精准确定基因编辑突变情况的方法及其应用
CN113380323B (zh) * 2021-07-19 2022-09-23 浙江迪谱诊断技术有限公司 Sanger测序峰图截取标识方法、***、计算机设备及存储介质
CN113921083B (zh) * 2021-10-27 2022-11-25 云舟生物科技(广州)股份有限公司 自定义序列的分析方法、计算机存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104531696A (zh) * 2014-12-31 2015-04-22 深圳华大基因研究院 引物组合物及其用途
CN104561294A (zh) * 2014-12-26 2015-04-29 北京诺禾致源生物信息科技有限公司 基因分型测序文库的构建方法和测序方法
CN205368202U (zh) * 2016-02-24 2016-07-06 李勇 一种利用功能化细菌磁颗粒纯化IgA蛋白的试剂盒
CN105969843A (zh) * 2016-04-16 2016-09-28 杨永臣 一种基于mlpa的基因拷贝数和突变的高通量测序检测方法
CN106555226A (zh) * 2016-04-14 2017-04-05 北京京诺玛特科技有限公司 一种构建高通量测序文库的方法和试剂盒

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101434988B (zh) * 2007-11-16 2013-05-01 深圳华因康基因科技有限公司 一种高通量寡核苷酸测序方法
CN101921841B (zh) * 2010-06-30 2014-03-12 深圳华大基因科技有限公司 基于Illumina GA测序技术的HLA基因高分辨率分型方法
CN102181533B (zh) * 2011-03-17 2015-04-01 北京贝瑞和康生物技术有限公司 多样本混合测序方法及试剂盒
CN104293783A (zh) * 2014-09-30 2015-01-21 天津诺禾致源生物信息科技有限公司 适用于扩增子测序文库构建的引物、构建方法、扩增子文库及包含其的试剂盒

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104561294A (zh) * 2014-12-26 2015-04-29 北京诺禾致源生物信息科技有限公司 基因分型测序文库的构建方法和测序方法
CN104531696A (zh) * 2014-12-31 2015-04-22 深圳华大基因研究院 引物组合物及其用途
CN205368202U (zh) * 2016-02-24 2016-07-06 李勇 一种利用功能化细菌磁颗粒纯化IgA蛋白的试剂盒
CN106555226A (zh) * 2016-04-14 2017-04-05 北京京诺玛特科技有限公司 一种构建高通量测序文库的方法和试剂盒
CN105969843A (zh) * 2016-04-16 2016-09-28 杨永臣 一种基于mlpa的基因拷贝数和突变的高通量测序检测方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326123A (zh) * 2021-04-30 2021-08-31 杭州绳武科技有限公司 一种基于容器技术的生物信息分析计算***及方法
CN113326123B (zh) * 2021-04-30 2024-03-26 杭州绳武科技有限公司 一种基于容器技术的生物信息分析计算***及方法
CN116306922A (zh) * 2023-02-13 2023-06-23 中国科学院西北生态环境资源研究院 数据序列间关系分析方法、装置、存储介质及电子设备
CN116306922B (zh) * 2023-02-13 2023-09-15 中国科学院西北生态环境资源研究院 数据序列间关系分析方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
US20200111542A1 (en) 2020-04-09
CN107368706A (zh) 2017-11-21

Similar Documents

Publication Publication Date Title
WO2019001168A1 (zh) 测序数据结果分析方法和装置、测序文库构建和测序方法
US11680284B2 (en) Screening for structural variants
Macaulay et al. Single-cell multiomics: multiple measurements from single cells
Kivioja et al. Counting absolute numbers of molecules using unique molecular identifiers
Bock Analysing and interpreting DNA methylation data
Mamanova et al. Target-enrichment strategies for next-generation sequencing
Hirst et al. Next generation sequencing based approaches to epigenomics
Tsai et al. Discovery of rare mutations in populations: TILLING by sequencing
CN105849276B (zh) 用于检测结构变异体的***和方法
Kozlowski et al. New applications and developments in the use of multiplex ligation‐dependent probe amplification
US20140129201A1 (en) Validation of genetic tests
US20160002717A1 (en) Determining mutation burden in circulating cell-free nucleic acid and associated risk of disease
Lechner et al. Large-scale genotyping by mass spectrometry: experience, advances and obstacles
CN115198023B (zh) 一种海南黄牛液相育种芯片及其应用
Robinson et al. Computational exome and genome analysis
Butz et al. Brief summary of the most important molecular genetic methods (PCR, qPCR, microarray, next-generation sequencing, etc.)
Neiman et al. Decoding a substantial set of samples in parallel by massive sequencing
Dearlove High throughput genotyping technologies
CN110600079B (zh) 转基因鉴定方法及鉴定装置
Eché et al. A Bos taurus sequencing methods benchmark for assembly, haplotyping, and variant calling
CN106661613B (zh) 用于验证测序结果的***和方法
Chaparro et al. Methods and software in NGS for TE analysis
CN113614832A (zh) 用于检测伴侣未知的基因融合的方法
Martin et al. Detection of single nucleotide polymorphisms (SNP) in equine coat color genes using SNaPshotTM multiplex kit or pluronic F‐108 tri‐block copolymer and capillary electrophoresis
CN110656183A (zh) 用于犬的str基因座集及用途

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18823937

Country of ref document: EP

Kind code of ref document: A1