WO2017139945A1 - Typing method and device - Google Patents

Typing method and device Download PDF

Info

Publication number
WO2017139945A1
WO2017139945A1 PCT/CN2016/074027 CN2016074027W WO2017139945A1 WO 2017139945 A1 WO2017139945 A1 WO 2017139945A1 CN 2016074027 W CN2016074027 W CN 2016074027W WO 2017139945 A1 WO2017139945 A1 WO 2017139945A1
Authority
WO
WIPO (PCT)
Prior art keywords
type
candidate
haplotype
types
supporting
Prior art date
Application number
PCT/CN2016/074027
Other languages
French (fr)
Chinese (zh)
Inventor
章元伟
张涛
曹红志
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to PCT/CN2016/074027 priority Critical patent/WO2017139945A1/en
Priority to CN201680067128.7A priority patent/CN108350498B/en
Publication of WO2017139945A1 publication Critical patent/WO2017139945A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of biological information, and in particular, the present invention relates to a method and apparatus for typing.
  • HLA-DRB3, DRB4 and DRB5 belong to the homologous coding gene of the ⁇ chain of HLA (human leukocyte antigen) class II molecules.
  • HLA class II molecules are expressed on the cell membrane of antigen-presenting cells, and can present peptides of exogenous proteins, which play a central role in the immune system.
  • HLA-DRB3 has been reported to be associated with Crohn's disease, Graves' disease, type 1 diabetes, and the like.
  • HLA-DRB4 has been reported to be associated with childhood acute lymphoblastic leukemia, Hashimoto's thyroiditis, allergic granulomatous vasculitis, vitiligo, and the like.
  • HLA-DRB5 has been reported to be associated with keloids, systemic lupus erythematosus, multiple sclerosis, narcolepsy, and the like.
  • HLA-DRB3, DRB4, and DRB5 has important medical applications and disease research value.
  • the existing HLA-DRB3, 4, 5 typing methods are mainly exon PCR binding gene sequencing, or long-sequence PCR combined with gene sequencing, involving the necessary PCR primer design, more experimental steps and can not be applied to Qualcomm Volumetric genome sequencing or high-throughput chip capture sequencing issues.
  • the present invention is directed to at least one of the above problems or to at least one alternative business means.
  • the present invention provides a typing method, the method comprising: acquiring sequencing data of a sample to be tested, the sequencing data comprising a plurality of readings from a target region; and comparing the reference sequence groups to a reference a sequence set, a set of types comprising a plurality of types, the set of reference sequences comprising one or more reference sequences, the set of reference sequences capable of completely covering all gene sequences of the target region, different
  • the reference sequence comprises different gene full sequences, the reference sequence set comprising a plurality of reference sequences, the reference sequence set being capable of completely covering all exons of the coding region sequence of the target region, different reference sequences comprising Different exons;
  • the sequencing data is aligned to the reference sequence set to obtain a comparison result; the alignment result is converted into an alignment result relative to the reference sequence group, and the transformed alignment is obtained Resulting; assembling, based on the comparison result of the transformation, respectively, comparing the reads of the same reference sequence to obtain an assembly result, the
  • a computer readable medium for storing a computer executable program, the program comprising the step of completing the above described aspect of the present invention.
  • the storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • a parting apparatus comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data, including a computer An executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising completing the above parting method.
  • the present invention provides a parting system comprising: a data input module for inputting sequencing data of a sample to be tested, the sequencing data comprising a plurality of readings from a target area; a comparison module The first comparison module and the second comparison module are configured to compare the reference sequence group to the reference sequence group to obtain a type set, where the type set includes multiple types.
  • a second comparison module configured to compare the sequencing data from the data input module to the reference sequence group to obtain a comparison result
  • the reference sequence group comprising one or more reference sequences, the reference sequence group being capable of being completely Covering all gene sequences of the target region, different reference sequences comprising different gene full sequences, the reference sequence group comprising a plurality of reference sequences, the reference sequence group being capable of completely covering the coding region sequence of the target region All exons, different reference sequences comprising different exons; a transformation module for converting the alignment result to a ratio relative to the reference sequence group As a result, a converted alignment result is obtained; an assembly module for assembling the reads of the same reference sequence based on the alignment results of the transformations, respectively, to obtain an assembly result, the assembly result including a plurality of haplotypes a haplotype support type determining module for comparing a variation on the haplotype and a type in the genre set to determine a type supported by the haplotype; a clustering module for According to the haplo
  • the method of the present invention constructs a type set by constructing a reference sequence and a reference sequence of a target region, constructs a haplotype based on the read information, and compares the type and the haplotype to genotype the target region.
  • the typing method of the present invention is applicable to Genotyping of any region is particularly suitable for the typing of highly polymorphic regions, for example for the typing of HLA-DRB3, 4 and/or 5, in particular based on the inclusion of any form of HLA-DRB3, 4 and / or 5 sequence information of sequencing information, typing of HLA-DRB3, 4 and / or 5.
  • the method of the invention does not require PCR or long-segment PCR for the exons of the target gene, which reduces the experimental workload and experimental difficulty, and improves the flexibility of the design of the application or research.
  • FIG. 1 is a flow chart of a parting method in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a parting method in an embodiment of the present invention.
  • FIG. 3 is a flow chart of a parting method in an embodiment of the present invention.
  • Fig. 4 is a schematic structural view of a parting device in an embodiment of the present invention.
  • Fig. 5 is a schematic structural view of a parting system in an embodiment of the present invention.
  • Fig. 6 is a schematic structural view of a parting system in an embodiment of the present invention.
  • a parting method according to an embodiment of the present invention includes the following steps:
  • S10 obtains sequencing data.
  • Sequencing data of the sample to be tested is obtained, the sequencing data comprising a plurality of reads from the target area.
  • the so-called sequencing data is obtained by sequencing library preparation and sequencing on the nucleic acid sequence of the sample to be tested.
  • acquiring the sequencing data comprises: acquiring nucleic acid in a sample to be tested, preparing a sequencing library of the nucleic acid, and sequencing the sequencing library.
  • the preparation method of the sequencing library is carried out according to the requirements of the selected sequencing method, and the sequencing method may be selected from, but not limited to, Illumina's Hisq2000/2500 sequencing platform, Life Technologies' Ion Torrent platform, BGI according to the selected sequencing platform.
  • the BGISEQ platform and the single-molecule sequencing platform can be selected for single-end sequencing or double-end sequencing.
  • the obtained offline data is a read-out fragment called a read.
  • the target area can be any gene region of interest.
  • the target region comprises members of the MHC (major histocompatibility complex) gene family.
  • the mammalian MHC gene (MHC gene) is highly polymorphic, and human MHC is commonly referred to as HLA (human leucocyte antigen (HLA), a human leukocyte antigen.
  • HLA human leucocyte antigen
  • the sample to be tested is from a human
  • the target region comprises at least one of HLA-DRB3, HLA-DRB4 and HLA-DRB5.
  • the target regions include HLA-DRB3, HLA-DRB4, and HLA-DRB5.
  • a type set is obtained, the type set containing a plurality of types.
  • the sequencing data is aligned to a reference sequence set to obtain alignment results.
  • the reference sequence group includes one or more reference sequences, and the reference sequence is a predetermined sequence containing the full length of the gene, and may be a reference template of a biological class to which the sample to be tested belongs in advance, for example, if the sample to be tested is derived from For a human individual, the reference sequence can select the full sequence of the gene in the target region contained in the HG19 provided by the NCBI database.
  • a reference sequence is a full sequence of genes of a gene in the target region. If a gene has multiple gene sequences, the longest gene sequence can be selected as the reference sequence of the gene.
  • the reference sequence group can completely cover all gene sequences in the target region, and different reference sequences contain different gene sequences. .
  • the reference sequence group includes a plurality of reference sequences, and the reference sequence is a predetermined sequence containing exons, and may be a reference template of a biological class to which the sample to be tested belongs in advance, for example, if the sample to be tested is a human individual.
  • the reference sequence may select a sequence of exons of the target region included in the HG19 provided by the NCBI database.
  • a resource library including more reference sequences may be pre-configured, for example, according to the state, region, etc. of the sample source. Factor selection or assay to assemble a closer sequence as a reference sequence.
  • the reference sequence set is capable of completely covering all exons of the coding region sequence of the target region, and the different reference sequences contain different exons.
  • type is an allele.
  • a type is an allele of a gene that is essentially a collection of specific variations in a certain genomic region.
  • Some international organizations, such as WHO, name known alleles of certain genes, and generally named alleles are called types.
  • the target region comprises at least a portion of an HLA gene. Because of the large number of alleles of the HLA gene and its critical role in medical transplantation, WHO has named all known alleles of the HLA gene, such as DRB3*01:01:01, a type refers to Generation of a named allele.
  • Typing is the process of determining the type of a target gene of a target individual using various methods.
  • the type provided in the so-called type set is position information and variation information with respect to the reference sequence group.
  • the longest full sequence of genes is selected as a reference sequence, and all coding region sequences and gene sequences are recorded.
  • the variation of the column relative to the reference sequence correlates the variation with the type. Establishing the association between the mutated information and the type facilitates the subsequent establishment of a type benchmark based on the detected variability and facilitates genotyping.
  • the so-called alignment sequencing data is referred to a reference sequence set to obtain a alignment result comprising positional information and variation information of the reads of any one of the reference sequence groups in the alignment.
  • the sequencing data is not directly aligned to the reference sequence group because the reference sequence group contains only the sequence of one allele of each gene, and the sequencing data is first compared to the reference by the above method.
  • the sequence set, and then based on the reference sequence conversion alignment result since the reference sequence group contains all the alleles, the number of reads on the alignment can be increased, and the data utilization rate is remarkably improved.
  • the target region is a highly polymorphic region
  • the reference sequence group is constructed by: obtaining a coding region sequence and a gene complete sequence including the target region; and dividing the coding region sequence by an exon to obtain a plurality of An exon sequence; the sequence of K bp flanking the exon sequence is extracted from the entire sequence of the closest type of the exon sequence, and added to the sides of the corresponding exon sequence to obtain a reference sequence set.
  • K is the length of the read.
  • HLA genotypes and sequence data can be downloaded from the IMGT/HLA database.
  • the sequence data includes multiple coding region sequences and multiple gene sequences. .
  • the format of the downloaded HLA genotype and sequence data can be modified to facilitate subsequent step analysis.
  • each coding region sequence is segmented by exons, and K bps on both sides of the exon are extracted from the full-length sequence of the closest similar gene to be added to both sides of the corresponding exon sequence to form a reference sequence.
  • K depends on the length of the read.
  • the extended exon sequence is to ensure that the read to the exon edge can be retained during the comparison, so that the data can be utilized to improve data utilization. It should be noted that, according to the length of the read segment being a uniform length or an unequal length, K may vary.
  • the target typing region comprises HLA-DRB3, HLA-DRB4 and HLA-DRB5, and the inventors obtained 94 published HLA-DRB3, HLA-DRB4 or HLA-DRB5 from the IMGT/HLA database.
  • the so-called full sequence of the gene closest to the exon sequence means that the corresponding exon sequence on the entire sequence of the gene has the highest degree of matching with the exon sequence, that is, the sequence is most similar and the difference is the smallest. For example, if there are bases different from the reference genome in a plurality of specific positions on the exon sequence, and the corresponding positions of the corresponding exon regions in the full length sequence of the certain gene are also the same as the exon sequence, then it is determined The full length sequence of the gene is closest to the type of the exon sequence.
  • the so-called “alignment” means matching.
  • the known comparison software may be used, for example, SOAP, BWA, and TeraMap, etc., which is not limited in this embodiment.
  • a maximum of n base mismatches may be allowed for a pair or a read, for example, n is 1 or 2, if there are more than n bases in the read. If a mismatch occurs, it is considered that the pair of reads cannot be compared to the reference sequence, or if the mismatched n bases are all located in one of the read pairs, the read is regarded as the read in the pair. Segments cannot be compared to the reference sequence.
  • the segment is a read that supports the variation.
  • the alignment results are converted into alignment results relative to the reference sequence set, and the transformed alignment results are obtained. That is, all position information and variation information in the comparison result are converted into position information and variation information with respect to the reference sequence group, and the converted alignment result is obtained.
  • the obtained alignment result of the transformation and the type set in S20 are both based on the reference sequence, which facilitates subsequent comparison of the two for genotype determination.
  • the most complete gene full sequence of each target gene is selected as a reference sequence of the gene, and the alignment position information and the mutation information in the comparison result are converted into position information relative to the reference sequence group. And the mutation information, the obtained alignment result is obtained, and the subsequent screening based on the detected mutation information is used to determine the genotype.
  • the reads of the same reference sequence are assembled separately to obtain an assembly result, and the assembly result includes a plurality of haplotypes.
  • a haplotype is a combination of single nucleotide polymorphisms, also referred to as haploid or haplotypes, that are interrelated in a particular region of a chromosome and tend to be inherited globally to progeny.
  • the step comprises: comparing the reads of the same reference sequence to the read, using the read segments having overlapping portions and the overlapping portions being completely identical, to perform the assembly to obtain the plurality of haplotypes .
  • the reference sequence in the reference sequence group is a complete sequence containing one gene, and one gene of a eukaryote generally contains a plurality of exons, based on the read of the same reference sequence on the alignment, the overlap portion is utilized
  • the plurality of haplotypes obtained by assembling the overlapping portions of the overlapping partial sequences are substantially at least a portion of at least one exon sequence comprising different base sequences or different lengths. According to an embodiment of the invention, the assembly is obtained After the results, the haplotype with a coverage of the exon of less than 95% was filtered out.
  • discarding the relatively short sequence can improve the reliability of the data used in the subsequent analysis steps, reduce the data complexity, and facilitate accurate typing.
  • the steps of obtaining the type set in S20 are not limited to S30 and S40.
  • the steps of obtaining the type set in S20 may be performed first, and then S30 and S40 may be performed; or S30 and S40 may be performed first, and then the matching step of obtaining the type set in S20 may be performed.
  • S50 determines the type of haplotype support.
  • the variation in the haplotype and the type in the collection are compared to determine the type supported by the haplotype.
  • the step prior to performing the step, performing: scoring each haplotype based on the result of the assembly of the reads of the same reference sequence on the alignment, respectively, based on the score pairs of the obtained haplotypes
  • the haplotype is screened to obtain a candidate haplotype; the candidate haplotype is substituted for the haplotype for the subsequent step. In this way, it helps to reduce data interference and reduce the amount of data that needs to be processed.
  • each of the haplotypes is scored based on the result of the assembly of the reads of the same reference sequence, respectively, based on the scores of the obtained haplotypes.
  • Screening to obtain candidate haplotypes includes: determining the score Score of the haplotype using the following formula, Where c is the coverage of the exon by the haplotype, N is the number of reads supporting the haplotype, and R is the reliability of the haplotype, Xi is the sequencing depth of position i of the haplotype, i is the number of the position on the haplotype, X is the average depth of the haplotype, and L is the length of the haplotype.
  • the reliability of the haplotype is judged based on the average sequencing depth of the haplotype itself, the sequencing depth of each position, and the length, and then the coverage of the corresponding phenotype is covered according to the reliability of the haplotype.
  • the degree and the number of reads that support it are assigned to the haplotype score, which is representative of the haplotype, which facilitates comparative evaluation of multiple haplotypes.
  • the so-called screening includes: for aligning the haplotype corresponding to the highest haplotype scoring value of the same reference sequence, and removing the haplotype corresponding to the scoring value of the highest haplotype, and removing the assembly result satisfying the following conditions Monomeric: inconsistent with the sequence of the haplotype taken and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype; repeat the above steps until no more than 4 singles are taken Body type to obtain the candidate haplotype. Since the target region is from diploid and one region has at most two types (ie, when it is heterozygous), this step is used to perform haplotype screening so that each exon has at least four corresponding haplotypes. Dropping haplotypes that do not match the real situation or meaninglessness, reducing the complexity of the data, facilitates the rapid analysis and accurate typing of subsequent analysis.
  • the subsequent step is performed by substituting the candidate haplotype for the haplotype, the step comprising: comparing the variation on each candidate haplotype with the type in the genre set, if exactly matched, It is then determined that the candidate haplotype supports the type. In this way, the type of candidate haplotype support is determined, which facilitates further determination of the type based on the candidate haplotype support.
  • the candidate haplotypes of the same type are merged. Conducive to the rapid progress of the next steps.
  • S60 determines a first candidate type group and a second candidate type group.
  • the types are divided into two groups according to the haplotypes supporting the type and the reads supporting the types, and the first candidate type group and the second candidate type group are obtained.
  • the coding sequence of the target region comprises a plurality of exons, and according to one embodiment of the invention, the type of supported exons less than 30% of the total number of exons is filtered out prior to performing this step. In this way, according to the condition, the meaningless or less meaningful type in the type set is removed, and the data interference and the complexity of the typing are reduced, which facilitates the subsequent rapid progress and accurate typing.
  • the step includes: performing a first scoring of the type according to the haplotype supporting the type and the reading supporting the type, based on obtaining Type first value of the type to filter the type to obtain a first candidate type and a second candidate type; based on supporting the first candidate type of reading and supporting the second candidate type For the case where the read segment supports the other types in the type set, the other types in the type set are respectively assigned to the first candidate type and the second candidate type to obtain The first candidate type group and the second candidate type group.
  • the so-called haplotype supporting the type and the reading supporting the type are referred to, the first score is performed on the type, based on the first score of the obtained type.
  • TScore N ⁇ S, wherein N is supported
  • S is the sum of the scores of the candidate haplotypes supporting the type
  • the first scores of all the types are combined in pairs, and the first score is determined
  • the types in the highest combination are the first candidate type and the second candidate type, respectively.
  • the first formula of the type is given by the above formula, and the score can reflect the representative type, which is beneficial to the subsequent type determination.
  • the support based on the read segment supporting the first candidate type and the read support segment of the second candidate type support other types in the type set,
  • the other types in the type set are respectively assigned to the first candidate type and the second candidate type, and the first candidate type group and the second candidate type group are obtained, including: For each of the other types, comparing the sizes of the first intersection and the second intersection, based on the comparison And assigning each of the other types to the first candidate type or the second candidate type to obtain the first candidate type group and the second candidate type group,
  • An intersection is an intersection of the read segment supporting the other type and the read segment supporting the first candidate type, the second intersection being a read support for the other type and a support for the second candidate type The intersection of the readings.
  • the other types are assigned to the first candidate type. Don't be in the same group, otherwise it will be classified as the second candidate type.
  • the other types in the type set are divided into two groups. This treatment will facilitate subsequent screening and accurate typing.
  • S70 determines the primary and secondary types.
  • the types in the first candidate type group and the second candidate type group are respectively screened to determine the main Type and secondary type.
  • the step includes: respectively, the first candidate type group and the second candidate type based on the read of the type and the first score of the type
  • the type in the group performs a second score, and the primary type and the secondary type are determined based on the second score of the obtained type.
  • N * is a supported read of the type in the first candidate type group a number of paragraphs other than the read segment supporting the second candidate type, or a support read paragraph of the type in the second candidate type group outside the read support for the first candidate type
  • the number of the second candidate with the highest second score in the first candidate type group and the second candidate type group is the primary type and the secondary type.
  • the so-called primary and secondary types are relative concepts based on the relative size of the frequencies.
  • the primary and secondary types are distinguished based on the relative number of reads that support them.
  • the first score of the previous step is adjusted by the above formula, that is, the first score is corrected, and the obtained second score is more representative of the type, and the main type is determined according to the level of the second score.
  • the other and secondary types are conducive to the accurate determination of the subsequent type.
  • S80 determines the genotype of the target region.
  • the genotype of the target region is determined based on the difference in the number of reads supporting the primary type and supporting the secondary type.
  • the step of determining that the target area is heterozygous if the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the primary type is greater than 0.1
  • the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively, otherwise the target region is determined to be homozygous, and the two homozygous components are composed. Alleles are of the major type.
  • the inventor based on the aforementioned steps The intermediate result of the sequence and the analysis of a large number of sample data, find that the threshold that only supports the ratio of the number of reads of the secondary type to the number of reads that only support the main type is set to 0.1, and the target area can be determined simply and accurately. Genotype.
  • any numerical value expressed in an accurate manner may represent a range, such as an interval including plus or minus 10% of the numerical value, unless otherwise specified.
  • the parting method in the above embodiment may further include the following steps:
  • S90 determines the number of copies of the target area.
  • the average sequencing depth of the region fixed at at least one copy number to 2 is the reference depth
  • the ratio of the sequencing depth of the target region to the reference depth is calculated
  • the copy number of the target region is determined according to the calculated ratio. According to an embodiment of the invention, this step is based on the use of sequencing depth information to determine the copy number of the target regions HLA-DRB3, 4 and/or 5.
  • the determined genotype of the target region is further corrected to more accurately determine the genotype of the target region.
  • the copy number of the target area is 0, it is determined that the target area does not exist; if the copy number of the target area is 1, it is determined that the target area is homozygous, and the composition is The two alleles of the homozygote are all of the major types; if the copy number of the target region is 2, the target region is determined to be heterozygous, and the major alleles constituting the heterozygote are The minor alleles are the primary type and the secondary type, respectively.
  • the specific numerical values referred to herein are statistically significant. Therefore, the numerical values “0”, “1” and “2” expressed herein in a precise manner may represent a range, for example, including the value plus or minus 10% or Positive and negative range of 20%.
  • a computer readable medium for storing a computer executable program, the program comprising performing all or part of the parting method in any of the above embodiments step.
  • the storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • a parting apparatus 100 includes: a data input unit 110 for inputting data; a data output unit 120 for outputting data; and a storage unit 140, for storing data, including a computer executable program; a processor 130, coupled to the data input unit, the data output unit, and the storage unit, for executing the computer executable program, performing the The program includes all or part of the steps of completing the parting method in any of the above embodiments.
  • an embodiment of the present invention further provides a parting system 1000, which can be used to implement the parting method in any of the above embodiments of the present invention.
  • the system 1000 includes: data input.
  • the module 1001 is configured to input sequencing data of the sample to be tested, the sequencing data includes a plurality of readings from the target area
  • the comparison module 1003 includes a first comparison module 1013 and a second comparison module 1023, the first ratio
  • the pair module 1013 is configured to compare the reference sequence group to the reference sequence group to obtain a type set, the type set includes a plurality of types
  • the second comparison module 1023 is configured to compare the sequencing data from the data input module.
  • the reference sequence group comprising one or more reference sequences, the reference sequence group being capable of completely covering all gene sequences of the target region, different reference sequences comprising different a full sequence of genes, the set of reference sequences comprising a plurality of reference sequences capable of completely covering all exons of the coding region sequence of the target region, different The reference sequence comprises different exons;
  • a transformation module 1005 is configured to convert the alignment result into a comparison result with respect to the reference sequence group to obtain a converted alignment result;
  • the assembly module 1007 is configured to Comparing the results of the transformation, respectively, comparing the reads of the same reference sequence to obtain an assembly result, the assembly result comprising a plurality of haplotypes;
  • the haplotype support type determination module 1009 for comparing the Determining the haplotype and the type in the type set to determine the type supported by the haplotype;
  • the clustering module 1011 for supporting the haplotype and support of the type Defining the types of readings, dividing the types
  • the target area comprises at least one of HLA-DRB3, HLA-DRB4 and HLA-DRB5.
  • the target region comprises HLA-DRB3, HLA-DRB4 and HLA-DRB5.
  • the reference sequence group in the comparison module 1003 is obtained by constructing: obtaining a coding region sequence and a gene full sequence including the target region; and dividing the coding region sequence by an exon, Obtaining a plurality of exon sequences; extracting a sequence of K bp flanking the exon sequence from the entire sequence of the closest type of the exon sequence, and adding to both sides of the corresponding exon sequence, A reference sequence in the set of reference sequences is obtained, where K is determined based on the length of the read.
  • one of the reference sequences in the first alignment module 1013 and/or the transformation module 1005 is a full sequence of genes of one gene in the target region.
  • the assembly module 1007 is used to perform the following: comparing the reads of the same reference sequence on the pair, performing the assembly using the read segments having the overlapping portions and the overlapping portions are completely identical to obtain the plurality of Monomer type.
  • the haplotype having a coverage of the exon of less than 95% is filtered out.
  • a candidate haplotype determining module 1008 is further included for performing the following before determining the haplotype-supported type by using the haplotype supporting type determining module 1009: respectively based on the comparison Combining the results of the reads of the same reference sequence, scoring each of the haplotypes, screening the haplotypes based on the scores of the obtained haplotypes, obtaining candidate haplotypes; A monomeric type replaces the haplotype.
  • each of the haplotypes is scored using the following formula to determine a score for the haplotype: Where c is the coverage of the exon by the haplotype, N is the number of reads supporting the haplotype, and R is the reliability of the haplotype, Xi is the sequencing depth of position i of the haplotype, i is the number of the position on the haplotype, X is the average depth of the haplotype, and L is the length of the haplotype.
  • screening of the haplotypes based on the scores of the obtained haplotypes is performed in the candidate haplotype determination module 1008, including: assembly of reads of the same reference sequence on the alignment
  • the haplotype in the assembly result satisfying the following conditions was removed: the sequence of the haplotype taken out was inconsistent and inconsistent
  • the sequencing depth of the dots is less than 20% of the sequencing depth of the corresponding sites of the haplotype; the above haplotypes are obtained by repeating up to 4 of the haplotypes.
  • the following is performed in the haplotype support type determining module 1009: comparing the variability on each candidate haplotype with the type in the type set, and if it is completely matched, determining The candidate haplotype supports the type.
  • the candidate haplotypes supporting the same type are merged.
  • the type filtering module 1010 is further configured to filter out the number of supported exons less than the outer number before obtaining the first candidate type group and the second candidate type group by using the clustering module 1011. 30% of the total number of alleles Type.
  • the clustering module 1011 is configured to: perform a first scoring of the type according to the haplotype supporting the type and the readlist supporting the type, based on the obtained Type first value to filter the type to obtain a first candidate type and a second candidate type; based on supporting the first candidate type of reading and supporting the second candidate type Supporting the support of other types in the type set, and classifying other types in the type set to the first candidate type and the second candidate type respectively to obtain The first candidate type group and the second candidate type group are described.
  • the first score is performed on the type, based on the obtained type.
  • a read based on the support of the first candidate type and a read support of the second candidate type are performed on other types in the type set.
  • the support case, the other types in the type set are respectively assigned to the first candidate type and the second candidate type, and the first candidate type group and the second candidate type are obtained.
  • Another group includes: comparing, for each of the other types, a size of the first intersection and the second intersection, and assigning each of the other types to the first candidate type or the first according to a result of the comparison a second candidate type to obtain the first candidate type group and the second candidate type group, wherein the first intersection is a read that supports the other type and the read that supports the first candidate type The intersection of the segments, the second intersection being an intersection of the reads supporting the other types and the reads supporting the second candidate type.
  • the primary/secondary type determining module 1013 is configured to: perform the following on the first candidate based on the read of the type and the first score of the type The type in the other group and the second candidate type group is subjected to a second score, and the main type and the secondary type are determined based on the obtained second score of the type.
  • the genotype determining module 1015 is configured to: if the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the main type is greater than 0.1, then the determination The target region is heterozygous, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively, otherwise the target region is determined to be homozygous, and the composition is Both alleles of homozygotes are of the major type.
  • a copy number determining module 1016 is further included for performing the following: calculating an average target depth of a region fixed at at least one copy number to 2 as a reference depth, and calculating the target region The ratio of the sequencing depth to the reference depth, the copy number of the target region is determined according to the calculated ratio; and the genotype of the target region is determined according to the copy number of the target region.
  • the copy number determining module 1016 is configured to determine the genotype of the target region according to the copy number of the target region, including: if the copy number of the target region is 0, determining the The target area is absent; if the copy number of the target area is 1, it is determined that the target area is homozygous, and the two alleles constituting the homozygote are all the main type; if the target area If the copy number is 2, the target region is determined to be heterozygous, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively.
  • the typing method, apparatus and/or system in any of the above embodiments constructs a type set by constructing a reference sequence and a reference sequence of the target region, constructs a haplotype based on the read information, and performs genotyping on the target region.
  • This typing method is suitable for genotyping of any region, and is particularly suitable for typing of highly polymorphic regions, for example for typing of HLA-DRB3, 4 and/or 5.
  • the method does not require PCR for the target gene, which reduces the experimental workload and experimental difficulty, and improves the flexibility of the design of the application or research.
  • the reagents, instruments, or software involved in the following examples are conventional commercial products or open source, such as purchasing a sequencing library preparation kit from Illumina, and building a library according to the kit instructions.
  • Sequence data includes the coding region sequence and the full length sequence of the gene.
  • the coding region sequence is divided into exons, and the sequence of 100 bp (or other length depending on the length of the sequencing read) on both sides of the exon is extracted from the full-length sequence of the closest type of gene, and added to the corresponding explicit display. Both sides of the subsequence form a reference sequence.
  • the exon sequences are extended to ensure that the reads of the exon edges are preserved during the alignment.
  • the full-length full-length sequence of the gene was selected as the reference sequence, and the variation of all coding region sequences and full-length sequences of the gene relative to the reference sequence was recorded, and the variation was associated with the type.
  • each exon is assembled using a read segment in which the overlapping portion and the overlapping portion sequence are completely identical, and a plurality of haplotypes are obtained.
  • Filter haplotypes with lower exon coverage Obtain the number of supported readings and coverage of the haplotype, and score according to the number of supported readings and coverage.
  • the haplotypes are taken out from the bottom according to the scores. Each time a haplotype is removed, the sequence conflicts and the depth of the collision sites is removed. Low haplotypes give candidate haplotypes. Compare variants of candidate haplotypes and variants of each type. The exact match type is the type supported by this haplotype, and one haplotype may support multiple types.
  • the best type obtained from the set of candidate reads with a larger number of reads is the primary type, and the other best type is the secondary type. If the ratio of the number of reads that only support the secondary type is larger than the number of reads that only support the main type, for example, greater than 0.1, the gene is judged to be heterozygous, and the primary and secondary types are retained, otherwise it is judged to be homozygous. , only retain the main type.
  • step 1-5 If only step 1-5 is performed and all the above steps are performed, it is found that the result of the classification of the target area is inconsistent, and the result of the classification after steps 6 and 7 can be taken as the standard.
  • Sequence data includes the coding region sequence and the full length sequence of the gene.
  • the gene coding sequence was divided into exons, and the sequence of 100 bp (or other length depending on the length of the sequencing read) on both sides of the exon was extracted from the full-length sequence of the closest type of gene, and added to the corresponding explicit Both sides of the subsequence form a reference sequence.
  • the exon sequences are extended to ensure that the reads of the exon edges are preserved during the alignment.
  • the full-length full-length sequence of the gene was selected as the reference sequence, and the variation of all coding region sequences and full-length sequences of the gene relative to the reference sequence was recorded, and the variation was associated with the type.
  • each exon is assembled using a read having an overlapping portion and the overlapping portion sequences are completely identical, and a plurality of haplotypes are obtained. Filter haplotypes with an exon coverage below 95%. The number of reads, depth and coverage of the haplotype is obtained, and the score of the haplotype Score is calculated. Score calculation formula:
  • N the number of reads supporting this haplotype
  • Xi the depth of each position of the monomer type
  • the haplotype is taken from the high bottom according to Score, and each haplotype is removed to remove other haplotypes which conflict with the sequence and have a collision site depth lower than 20% of the extracted haplotype.
  • a maximum of 4 monomer types were taken out as candidate haplotypes. Compare variants of candidate haplotypes and variants of each type. The exact match type is the type supported by this haplotype, and one haplotype may support multiple types.
  • N support the number of reads of this type
  • N * the number of readings of this type that are supported by another set of candidate reads
  • the two highest types of TScore_New in the two groups are the best types.
  • the best type in the first group is the main type, and the best type in the second group is the secondary type. If the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the main type is greater than 0.1, it is judged that the gene is heterozygous, and the primary and secondary types are retained, otherwise it is judged to be homozygous, and only the primary type is retained. do not.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed is a typing method, comprising the steps of: obtaining sequencing data; aligning a reference genome to a baseline sequence group and obtaining a type set; aligning the sequencing data to a reference sequence group and obtaining an alignment result; converting the alignment result to an alignment result relative to the conversion of the baseline sequence; assembling reads aligned to the same baseline sequence and obtaining a haplotype; determining the types supported by the haplotype; grouping the types and obtaining a first and a second candidate type group; respectively screening the types in the first and the second candidate type group to determine the primary and secondary types; and determining the genotype of the target region based on the difference in the number of reads supporting the primary and secondary types. This typing method is applicable to genotyping of any region, and is particularly applicable to the typing of highly polymorphic regions, such as the typing of HLA-DRB3, 4 and/or 5.

Description

分型方法和装置Typing method and device 技术领域Technical field
本发明涉及生物信息领域,具体的,本发明涉及分型方法和装置。The present invention relates to the field of biological information, and in particular, the present invention relates to a method and apparatus for typing.
背景技术Background technique
HLA-DRB3,DRB4和DRB5属于HLA(人类白细胞抗原)II类分子的β链的同源编码基因。HLA-DRB3, DRB4 and DRB5 belong to the homologous coding gene of the β chain of HLA (human leukocyte antigen) class II molecules.
HLA II类分子表达在抗原递呈细胞的细胞膜上,能递呈细胞外源的蛋白的肽段,在免疫***中起着中心作用。HLA-DRB3被报道与克罗恩病,格雷夫斯病,I型糖尿病等相关。HLA-DRB4被报道与儿童急性淋巴细胞白血病,桥本甲状腺炎,变应性肉芽肿性血管炎,白癜风等相关。HLA-DRB5被报道与瘢痕疙瘩,***性红斑狼疮,多发性硬化症,嗜睡症等相关。HLA class II molecules are expressed on the cell membrane of antigen-presenting cells, and can present peptides of exogenous proteins, which play a central role in the immune system. HLA-DRB3 has been reported to be associated with Crohn's disease, Graves' disease, type 1 diabetes, and the like. HLA-DRB4 has been reported to be associated with childhood acute lymphoblastic leukemia, Hashimoto's thyroiditis, allergic granulomatous vasculitis, vitiligo, and the like. HLA-DRB5 has been reported to be associated with keloids, systemic lupus erythematosus, multiple sclerosis, narcolepsy, and the like.
对HLA-DRB3,DRB4,DRB5进行分型有着重要的医学应用和疾病研究价值。The classification of HLA-DRB3, DRB4, and DRB5 has important medical applications and disease research value.
而已有的HLA-DRB3,4,5的分型方法主要为外显子PCR结合基因测序,或者为长片段PCR结合基因测序,涉及必需的PCR引物设计、试验步骤较多以及无法应用于高通量全基因组测序或高通量芯片捕获测序等问题。The existing HLA-DRB3, 4, 5 typing methods are mainly exon PCR binding gene sequencing, or long-sequence PCR combined with gene sequencing, involving the necessary PCR primer design, more experimental steps and can not be applied to Qualcomm Volumetric genome sequencing or high-throughput chip capture sequencing issues.
发明内容Summary of the invention
本发明旨在至少解决上述问题至少之一或者提供至少一种可选择的商业手段。The present invention is directed to at least one of the above problems or to at least one alternative business means.
依据本发明的一方面,本发明提供一种分型方法,该方法包括:获取待测样品的测序数据,所述测序数据包括多个来自目标区域的读段;将参考序列组比对至基准序列组,获得型别集合,所述型别集合包含多个型别,所述基准序列组包含一条或多条基准序列,所述基准序列组能够完全覆盖所述目标区域的所有基因序列,不同所述基准序列包含不同的基因全序列,所述参考序列组包括多条参考序列,所述参考序列组能够完全覆盖所述目标区域的编码区序列的所有外显子,不同所述参考序列包含不同的外显子;将所述测序数据比对至所述参考序列组,获得比对结果;将所述比对结果转化为相对于所述基准序列组的比对结果,获得转化的比对结果;基于所述转化的比对结果,分别对比对上相同基准序列的读段进行组装,获得组装结果,所述组装结果包括多个单体型;比较所述单体型上的变异和所述型别集合中的型别,以确定所述单体型支持的型别;依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组; 基于支持所述型别的单体型和支持所述型别的读段,分别对所述第一候选型别组和所述第二候选型别组中的型别进行筛选,以确定出主要型别和次要型别;基于支持所述主要型别和支持所述次要型别的读段的数目的差异,判定所述目标区域的基因型。According to an aspect of the present invention, the present invention provides a typing method, the method comprising: acquiring sequencing data of a sample to be tested, the sequencing data comprising a plurality of readings from a target region; and comparing the reference sequence groups to a reference a sequence set, a set of types comprising a plurality of types, the set of reference sequences comprising one or more reference sequences, the set of reference sequences capable of completely covering all gene sequences of the target region, different The reference sequence comprises different gene full sequences, the reference sequence set comprising a plurality of reference sequences, the reference sequence set being capable of completely covering all exons of the coding region sequence of the target region, different reference sequences comprising Different exons; the sequencing data is aligned to the reference sequence set to obtain a comparison result; the alignment result is converted into an alignment result relative to the reference sequence group, and the transformed alignment is obtained Resulting; assembling, based on the comparison result of the transformation, respectively, comparing the reads of the same reference sequence to obtain an assembly result, the assembly result including a haplotype; comparing the variation in the haplotype to the type in the collection of types to determine the type supported by the haplotype; based on the haplotype and support that supports the haplotype Defining the types of readings, dividing the types into two groups, and obtaining the first candidate type group and the second candidate type group; Based on the haplotypes supporting the type and the reads supporting the types, the types in the first candidate type group and the second candidate type group are respectively screened to determine the main The type and the secondary type; determining the genotype of the target region based on the difference in the number of reads supporting the primary type and supporting the secondary type.
依据本发明的另一方面,本发明提供一种计算机可读介质,该计算机可读介质用于存储计算机可执行程序,执行所述程序包括完成上述本发明一方面的分型方法。本领域技术人员可以理解,在执行该计算机可执行程序时,通过指令相关硬件可完成上述分型方法的全部或部分步骤。所称存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。According to another aspect of the present invention, there is provided a computer readable medium for storing a computer executable program, the program comprising the step of completing the above described aspect of the present invention. Those skilled in the art can understand that all or part of the above parting methods can be completed by instructing related hardware when executing the computer executable program. The storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
依据本发明的再一方面,本发明提供一种分型装置,该装置包括:数据输入单元,用于输入数据;数据输出单元,用于输出数据;存储单元,用于存储数据,其中包括计算机可执行程序;处理器,与所述数据输入单元、所述数据输出单元和所述存储单元连接,用于执行所述计算机可执行程序,执行所述程序包括完成上述分型方法。According to still another aspect of the present invention, there is provided a parting apparatus comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for storing data, including a computer An executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising completing the above parting method.
依据本发明的又一方面,本发明提供一种分型***,包括:数据输入模块,用于输入待测样品的测序数据,所述测序数据包括多个来自目标区域的读段;比对模块,包括第一比对模块和第二比对模块,所述第一比对模块用于将参考序列组比对至基准序列组,获得型别集合,所述型别集合包含多个型别,第二比对模块,用于将来自数据输入模块的测序数据比对至所述参考序列组,获得比对结果,所述基准序列组包含一条或多条基准序列,所述基准序列组能够完全覆盖所述目标区域的所有基因序列,不同所述基准序列包含不同的基因全序列,所述参考序列组包括多条参考序列,所述参考序列组能够完全覆盖所述目标区域的编码区序列的所有外显子,不同所述参考序列包含不同的外显子;转化模块,用于将所述比对结果转化为相对于所述基准序列组的比对结果,获得转化的比对结果;组装模块,用于基于所述转化的比对结果,分别对比对上相同基准序列的读段进行组装,获得组装结果,所述组装结果包括多个单体型;单体型支持型别确定模块,用于比较所述单体型上的变异和所述型别集合中的型别,以确定所述单体型支持的型别;聚类模块,用于依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组;主/次要型别确定模块,用于基于支持所述型别的单体型和支持所述型别的读段,分别对所述第一候选型别组和所述第二候选型别组中的型别进行筛选,以确定出主要型别和次要型别;基因型确定模块,用于基于支持所述主要型别和支持所述次要型别的读段的数目的差异,判定所述目标区域的基因型。According to still another aspect of the present invention, the present invention provides a parting system comprising: a data input module for inputting sequencing data of a sample to be tested, the sequencing data comprising a plurality of readings from a target area; a comparison module The first comparison module and the second comparison module are configured to compare the reference sequence group to the reference sequence group to obtain a type set, where the type set includes multiple types. a second comparison module, configured to compare the sequencing data from the data input module to the reference sequence group to obtain a comparison result, the reference sequence group comprising one or more reference sequences, the reference sequence group being capable of being completely Covering all gene sequences of the target region, different reference sequences comprising different gene full sequences, the reference sequence group comprising a plurality of reference sequences, the reference sequence group being capable of completely covering the coding region sequence of the target region All exons, different reference sequences comprising different exons; a transformation module for converting the alignment result to a ratio relative to the reference sequence group As a result, a converted alignment result is obtained; an assembly module for assembling the reads of the same reference sequence based on the alignment results of the transformations, respectively, to obtain an assembly result, the assembly result including a plurality of haplotypes a haplotype support type determining module for comparing a variation on the haplotype and a type in the genre set to determine a type supported by the haplotype; a clustering module for According to the haplotype supporting the type and the reading supporting the type, the types are divided into two groups, and the first candidate type group and the second candidate type group are obtained; the primary/secondary type a determining module, configured to filter types in the first candidate type group and the second candidate type group respectively based on a haplotype supporting the type and a read type supporting the type Determining a primary type and a secondary type; a genoty determination module for determining a gene of the target region based on a difference in the number of reads supporting the primary type and supporting the secondary type type.
本发明的方法通过构建目标区域的参考序列和基准序列,构建型别集合,基于读段信息构建单体型,比较型别与单体型以对目标区域进行基因分型。本发明的分型方法适用于 任何区域的基因分型,特别适用于高度多态性区域的分型,例如适用于对HLA-DRB3,4和/或5的分型,特别是基于包含有任意形式的HLA-DRB3,4和/或5的序列信息的测序数据、对HLA-DRB3,4和/或5的分型。本发明的方法不需要针对目标基因的外显子进行PCR或者长片段PCR,降低了实验工作量和实验难度,提高了应用或研究时方案设计的灵活性。The method of the present invention constructs a type set by constructing a reference sequence and a reference sequence of a target region, constructs a haplotype based on the read information, and compares the type and the haplotype to genotype the target region. The typing method of the present invention is applicable to Genotyping of any region is particularly suitable for the typing of highly polymorphic regions, for example for the typing of HLA-DRB3, 4 and/or 5, in particular based on the inclusion of any form of HLA-DRB3, 4 and / or 5 sequence information of sequencing information, typing of HLA-DRB3, 4 and / or 5. The method of the invention does not require PCR or long-segment PCR for the exons of the target gene, which reduces the experimental workload and experimental difficulty, and improves the flexibility of the design of the application or research.
附图说明DRAWINGS
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1是本发明的实施例中的分型方法的流程图。1 is a flow chart of a parting method in an embodiment of the present invention.
图2是本发明的实施例中的分型方法的流程图。2 is a flow chart of a parting method in an embodiment of the present invention.
图3是本发明的实施例中的分型方法的流程图。3 is a flow chart of a parting method in an embodiment of the present invention.
图4是本发明的实施例中的分型装置的结构示意图。Fig. 4 is a schematic structural view of a parting device in an embodiment of the present invention.
图5是本发明的实施例中的分型***的结构示意图。Fig. 5 is a schematic structural view of a parting system in an embodiment of the present invention.
图6是本发明的实施例中的分型***的结构示意图。Fig. 6 is a schematic structural view of a parting system in an embodiment of the present invention.
具体实施方式detailed description
如图1所示,根据本发明的一个实施方式提供的分型方法,包括以下步骤:As shown in FIG. 1, a parting method according to an embodiment of the present invention includes the following steps:
S10获取测序数据。S10 obtains sequencing data.
获取待测样品的测序数据,所述测序数据包括多个来自目标区域的读段。Sequencing data of the sample to be tested is obtained, the sequencing data comprising a plurality of reads from the target area.
所称的测序数据通过对待测样品的核酸序列进行测序文库制备、上机测序获得。根据本发明的实施例,获取所述测序数据,包括:获取待测样本中的核酸,制备所述核酸的测序文库,对所述测序文库进行测序。测序文库的制备方法根据所选择的测序方法的要求进行,测序方法依据所选的测序平台的不同,可选择但不限于Illumina公司的Hisq2000/2500测序平台、Life Technologies公司的Ion Torrent平台、BGI的BGISEQ平台和单分子测序平台,测序方式可以选择单端测序,也可以选择双末端测序,获得的下机数据是测读出来的片段,称为读段(reads)。The so-called sequencing data is obtained by sequencing library preparation and sequencing on the nucleic acid sequence of the sample to be tested. According to an embodiment of the present invention, acquiring the sequencing data comprises: acquiring nucleic acid in a sample to be tested, preparing a sequencing library of the nucleic acid, and sequencing the sequencing library. The preparation method of the sequencing library is carried out according to the requirements of the selected sequencing method, and the sequencing method may be selected from, but not limited to, Illumina's Hisq2000/2500 sequencing platform, Life Technologies' Ion Torrent platform, BGI according to the selected sequencing platform. The BGISEQ platform and the single-molecule sequencing platform can be selected for single-end sequencing or double-end sequencing. The obtained offline data is a read-out fragment called a read.
目标区域可以为任何感兴趣的基因区域。根据本发明的实施例,目标区域包括MHC(主要组织相容性复合体,major histocompatibility complex)基因家族中的成员。哺乳动物的MHC基因(MHC gene)呈高度多态性,人类的MHC通常被称为HLA(human leucocyte antigen,HLA),即人类白细胞抗原。根据本发明的一个实施例,待测样品来自人类,目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5中的至少之一。根据本发明的另一个实 施例,目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5。利用本发明的方法,能够不进行PCR,只要基于任意形式的包含HLA-DRB3、HLA-DRB4和/或HLA-DRB5的测序数据,就能对这类高度多态性区域进行准确分型。The target area can be any gene region of interest. According to an embodiment of the invention, the target region comprises members of the MHC (major histocompatibility complex) gene family. The mammalian MHC gene (MHC gene) is highly polymorphic, and human MHC is commonly referred to as HLA (human leucocyte antigen (HLA), a human leukocyte antigen. According to an embodiment of the invention, the sample to be tested is from a human, and the target region comprises at least one of HLA-DRB3, HLA-DRB4 and HLA-DRB5. Another reality according to the present invention For example, the target regions include HLA-DRB3, HLA-DRB4, and HLA-DRB5. With the method of the present invention, it is possible to perform accurate typing of such highly polymorphic regions based on any form of sequencing data including HLA-DRB3, HLA-DRB4 and/or HLA-DRB5 without performing PCR.
S20比对。S20 is aligned.
比对参考序列组至基准序列组,获得型别集合,所述型别集合包含多个型别。Comparing the reference sequence group to the reference sequence group, a type set is obtained, the type set containing a plurality of types.
比对所述测序数据至参考序列组,获得比对结果。The sequencing data is aligned to a reference sequence set to obtain alignment results.
上述两个比对过程是独立进行的,无顺序限制。实际上获得两个比对结果,即为所称的型别集合和比对结果。The above two alignment processes are performed independently, without order restrictions. In fact, two alignment results are obtained, which is the so-called type set and comparison result.
所称基准序列组包含一条或多条基准序列,基准序列为预先确定的包含基因全长的序列,可以是预先获得的待测样本所属生物类别的参考模板,例如,若待测样本来源的为人类个体,基准序列可选择NCBI数据库提供的HG19中包含的目标区域中的基因的基因全序列。根据本发明的一个实施例,一条基准序列为目标区域中的一个基因的基因全序列。若一个基因有多条基因全序列,可以选择其中最长的基因全序列作为该基因的基准序列所称基准序列组能够完全覆盖目标区域的所有基因序列,不同的基准序列包含不同的基因全序列。The reference sequence group includes one or more reference sequences, and the reference sequence is a predetermined sequence containing the full length of the gene, and may be a reference template of a biological class to which the sample to be tested belongs in advance, for example, if the sample to be tested is derived from For a human individual, the reference sequence can select the full sequence of the gene in the target region contained in the HG19 provided by the NCBI database. According to an embodiment of the invention, a reference sequence is a full sequence of genes of a gene in the target region. If a gene has multiple gene sequences, the longest gene sequence can be selected as the reference sequence of the gene. The reference sequence group can completely cover all gene sequences in the target region, and different reference sequences contain different gene sequences. .
所称参考序列组包括多条参考序列,参考序列为预先确定的包含外显子的序列,可以是预先获得的待测样本所属生物类别的参考模板,例如,若待测样本来源的为人类个体,参考序列可选择NCBI数据库提供的HG19中包含的目标区域的外显子的序列,进一步地,也可以预先配置包含更多参考序列的资源库,例如依据待测样本来源个体的状态、地域等因素选择或是测定组装出更接近的序列作为参考序列。参考序列组能够完全覆盖目标区域的编码区序列的所有外显子,不同的参考序列包含不同的外显子。The reference sequence group includes a plurality of reference sequences, and the reference sequence is a predetermined sequence containing exons, and may be a reference template of a biological class to which the sample to be tested belongs in advance, for example, if the sample to be tested is a human individual. The reference sequence may select a sequence of exons of the target region included in the HG19 provided by the NCBI database. Further, a resource library including more reference sequences may be pre-configured, for example, according to the state, region, etc. of the sample source. Factor selection or assay to assemble a closer sequence as a reference sequence. The reference sequence set is capable of completely covering all exons of the coding region sequence of the target region, and the different reference sequences contain different exons.
所称的型别为等位基因。某个型别是某个基因的某个等位基因,本质上是某个基因组区域的特定变异的集合。某些国际组织如WHO会对某些基因的已知等位基因命名,一般有被命名的等位基因才会被称作型别。根据本发明的一个实施例,目标区域包含HLA基因的至少一部分。由于HLA基因的等位基因数目众多,且对医学移植有关键作用,因此WHO对HLA基因的所有已知等位基因都给了命名,例如DRB3*01:01:01,某个型别就指代某个命名的等位基因。分型是利用各种方法确定目标个体的目标基因的型别的过程。The so-called type is an allele. A type is an allele of a gene that is essentially a collection of specific variations in a certain genomic region. Some international organizations, such as WHO, name known alleles of certain genes, and generally named alleles are called types. According to one embodiment of the invention, the target region comprises at least a portion of an HLA gene. Because of the large number of alleles of the HLA gene and its critical role in medical transplantation, WHO has named all known alleles of the HLA gene, such as DRB3*01:01:01, a type refers to Generation of a named allele. Typing is the process of determining the type of a target gene of a target individual using various methods.
所称的型别集合中提供的型别是相对于基准序列组的位置信息和变异信息。根据本发明的一个实施例,选取最长的基因全序列作为基准序列,记录所有编码区序列和基因全序 列相对于基准序列的变异,将变异与型别相关联。建立变异信息与型别的关联,利于后续基于检测出的变异建立型别基准、利于基因分型。The type provided in the so-called type set is position information and variation information with respect to the reference sequence group. According to one embodiment of the present invention, the longest full sequence of genes is selected as a reference sequence, and all coding region sequences and gene sequences are recorded. The variation of the column relative to the reference sequence correlates the variation with the type. Establishing the association between the mutated information and the type facilitates the subsequent establishment of a type benchmark based on the detected variability and facilitates genotyping.
所称的比对测序数据至参考序列组以获得比对结果,该比对结果包含比对上所述参考序列组中的任一参考序列的读段的位置信息和变异信息。The so-called alignment sequencing data is referred to a reference sequence set to obtain a alignment result comprising positional information and variation information of the reads of any one of the reference sequence groups in the alignment.
在本实施方式中,不将测序数据直接比对至基准序列组,是因为基准序列组只包含每个基因的一种等位基因的序列,利用上述方式,即先将测序数据比对至参考序列组、再基于基准序列转化比对结果,由于参考序列组包含所有等位基因,使得能够增加比对上的读段(reads)的数目,显著提高数据利用率。In the present embodiment, the sequencing data is not directly aligned to the reference sequence group because the reference sequence group contains only the sequence of one allele of each gene, and the sequencing data is first compared to the reference by the above method. The sequence set, and then based on the reference sequence conversion alignment result, since the reference sequence group contains all the alleles, the number of reads on the alignment can be increased, and the data utilization rate is remarkably improved.
根据本发明的一个实施例,目标区域为高度多态性区域,参考序列组通过以下步骤构建:获得包含目标区域的编码区序列和基因全序列;将编码区序列按外显子分割,获得多个外显子序列;从与外显子序列最接近型别的基因全序列提取该外显子序列两侧各K bp的序列,加到对应的外显子序列的两侧,获得参考序列组中的参考序列,K为读段的长度。例如,欲对HLA-DRB3、HLA-DRB4和/或HLA-DRB5进行分型,可从IMGT/HLA数据库下载HLA基因型别和序列数据,序列数据包括多条编码区序列和多条基因全序列。可以修改下载的HLA基因型别和序列数据的格式,以方便于后续步骤分析。接着,将各个编码区序列按外显子分割,并从最相近型别的基因全长序列提取外显子两侧各K bp加在对应的外显子序列的两侧,构成参考序列。K取决于读段的长度,延伸外显子序列是为了保证比对时能保留比对至外显子边缘的读段,使得该部分数据能够得以利用,提高数据利用率。需要说明的是,根据读段的长度为均一长度或者为不等长度,K可变化,再者,K不须为一个确定值或者为一个精确值,可以为一个数值范围,例如为任一读段的长度的正负10%的区间,而且,加到外显子序列两侧的K可取不同数值。根据本发明的一个实施例,目标分型区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5,发明人从IMGT/HLA数据库获得94条已公布的HLA-DRB3、HLA-DRB4或HLA-DRB5的编码区序列以及7条HLA-DRB3、HLA-DRB4或HLA-DRB5的基因全长序列,各编码区序列或各基因全长序列在长度、包含的外显子个数、特定位置的碱基序列或结构等方面有所差异,构建的参考序列组包含166条参考序列。According to an embodiment of the present invention, the target region is a highly polymorphic region, and the reference sequence group is constructed by: obtaining a coding region sequence and a gene complete sequence including the target region; and dividing the coding region sequence by an exon to obtain a plurality of An exon sequence; the sequence of K bp flanking the exon sequence is extracted from the entire sequence of the closest type of the exon sequence, and added to the sides of the corresponding exon sequence to obtain a reference sequence set. In the reference sequence, K is the length of the read. For example, to classify HLA-DRB3, HLA-DRB4, and/or HLA-DRB5, HLA genotypes and sequence data can be downloaded from the IMGT/HLA database. The sequence data includes multiple coding region sequences and multiple gene sequences. . The format of the downloaded HLA genotype and sequence data can be modified to facilitate subsequent step analysis. Next, each coding region sequence is segmented by exons, and K bps on both sides of the exon are extracted from the full-length sequence of the closest similar gene to be added to both sides of the corresponding exon sequence to form a reference sequence. K depends on the length of the read. The extended exon sequence is to ensure that the read to the exon edge can be retained during the comparison, so that the data can be utilized to improve data utilization. It should be noted that, according to the length of the read segment being a uniform length or an unequal length, K may vary. Further, K does not need to be a certain value or an exact value, and may be a range of values, for example, for any reading. The length of the segment is plus or minus 10% of the length, and K added to both sides of the exon sequence may take different values. According to one embodiment of the invention, the target typing region comprises HLA-DRB3, HLA-DRB4 and HLA-DRB5, and the inventors obtained 94 published HLA-DRB3, HLA-DRB4 or HLA-DRB5 from the IMGT/HLA database. The coding region sequence and the full-length sequence of seven HLA-DRB3, HLA-DRB4 or HLA-DRB5 genes, the length of each coding region sequence or the full length sequence of each gene, the number of exons contained, and the base sequence at a specific position Or the structure and other aspects are different, the constructed reference sequence group contains 166 reference sequences.
所称的与外显子序列最接近型别的基因全序列,是指该基因全序列上的相应外显子序列与该外显子序列的匹配度最高,即序列最相似、差异最小。例如,外显子序列上多个特定位置存在不同于参考基因组的碱基,存在某一基因全长序列的对应外显子区域的相应位置也都为与该外显子序列的相同,则确定该基因全长序列与该外显子序列的型别最接近。 The so-called full sequence of the gene closest to the exon sequence means that the corresponding exon sequence on the entire sequence of the gene has the highest degree of matching with the exon sequence, that is, the sequence is most similar and the difference is the smallest. For example, if there are bases different from the reference genome in a plurality of specific positions on the exon sequence, and the corresponding positions of the corresponding exon regions in the full length sequence of the certain gene are also the same as the exon sequence, then it is determined The full length sequence of the gene is closest to the type of the exon sequence.
所称的“比对上”意同匹配。具体比对时,可以利用已知比对软件进行,例如SOAP、BWA和TeraMap等,本实施方式对此不作限制。在比对过程中,根据比对参数的设置,一对或一条读段最多允许有n个碱基错配(mismatch),例如设置n为1或2,若读段中有超过n个碱基发生错配,则视为该对读段无法比对到参考序列,或者,若错配的n个碱基全部位于读段对中的一个读段,则视为该读段对中的该读段无法比对到参考序列。The so-called "alignment" means matching. For the specific comparison, the known comparison software may be used, for example, SOAP, BWA, and TeraMap, etc., which is not limited in this embodiment. In the alignment process, depending on the setting of the alignment parameter, a maximum of n base mismatches may be allowed for a pair or a read, for example, n is 1 or 2, if there are more than n bases in the read. If a mismatch occurs, it is considered that the pair of reads cannot be compared to the reference sequence, or if the mismatched n bases are all located in one of the read pairs, the read is regarded as the read in the pair. Segments cannot be compared to the reference sequence.
当匹配为完全匹配,例如当读段与参考序列上存在变异的那段序列无错配,即读段上包含该变异位点或包含该变异发生时应当具有的特征序列,则称这种读段为支持该变异的读段。When the match is a perfect match, for example, when the read segment has no mismatch with the sequence on the reference sequence, that is, the read segment contains the variant site or contains the feature sequence that should be present when the mutation occurs, then the read is called The segment is a read that supports the variation.
S30获得转化的比对结果。S30 obtains the result of the alignment of the transformation.
将比对结果转化为相对于基准序列组的比对结果,获得转化的比对结果。即,将比对结果中的所有位置信息和变异信息转化为相对于基准序列组的位置信息和变异信息,获得转化的比对结果。如此,获得的转化的比对结果与S20中的型别集合都是基于基准序列,便于后续比较二者进行基因型判断。The alignment results are converted into alignment results relative to the reference sequence set, and the transformed alignment results are obtained. That is, all position information and variation information in the comparison result are converted into position information and variation information with respect to the reference sequence group, and the converted alignment result is obtained. Thus, the obtained alignment result of the transformation and the type set in S20 are both based on the reference sequence, which facilitates subsequent comparison of the two for genotype determination.
根据本发明的一个实施例,选取每个目标基因的最完整的基因全序列作为该基因的基准序列,将比对结果中的比对位置信息和变异信息转化为相对于基准序列组的位置信息和变异信息,获得转化的比对结果,利于后续基于检出的变异信息对型别进行筛选判断以确定基因型。According to an embodiment of the present invention, the most complete gene full sequence of each target gene is selected as a reference sequence of the gene, and the alignment position information and the mutation information in the comparison result are converted into position information relative to the reference sequence group. And the mutation information, the obtained alignment result is obtained, and the subsequent screening based on the detected mutation information is used to determine the genotype.
S40组装。S40 assembly.
基于所述转化的比对结果,分别对比对上相同基准序列的读段进行组装,获得组装结果,所述组装结果包括多个单体型。Based on the result of the comparison of the transformations, the reads of the same reference sequence are assembled separately to obtain an assembly result, and the assembly result includes a plurality of haplotypes.
所称单体型(Haplotype)为位于一条染色体特定区域的一组相互关联、并倾向于以整体遗传给后代的单核苷酸多态的组合,又称单倍体型或单元型。A haplotype is a combination of single nucleotide polymorphisms, also referred to as haploid or haplotypes, that are interrelated in a particular region of a chromosome and tend to be inherited globally to progeny.
序列的组装可以根据已知的组装方法,本实施方式对此不作限制。根据本发明的一个实施例,该步骤包括:对比对上同一参考序列的读段,利用具有重叠部分并且所述重叠部分完全一致的读段进行所述组装,以获得所述多个单体型。The assembly of the sequences can be based on known assembly methods, which are not limited in this embodiment. According to an embodiment of the invention, the step comprises: comparing the reads of the same reference sequence to the read, using the read segments having overlapping portions and the overlapping portions being completely identical, to perform the assembly to obtain the plurality of haplotypes .
由于基准序列组中的基准序列均为包含一个基因的全序列,而真核生物的一个基因一般包含多个外显子,所以基于比对上同一基准序列的读段,利用其中具有重叠部分并且重叠部分序列完全一致的读段组装而获得的多个单体型,实质上为包含不同碱基序列或者不同长度的至少一个外显子序列的至少一部分。根据本发明的一个实施例,在获得所述组装 结果之后,过滤掉对所述外显子的覆盖度低于95%的单体型。如此,弃去长度相对短的序列,可以提高后续分析步骤所使用的数据的可靠性,减少数据复杂度,利于准确分型。Since the reference sequence in the reference sequence group is a complete sequence containing one gene, and one gene of a eukaryote generally contains a plurality of exons, based on the read of the same reference sequence on the alignment, the overlap portion is utilized The plurality of haplotypes obtained by assembling the overlapping portions of the overlapping partial sequences are substantially at least a portion of at least one exon sequence comprising different base sequences or different lengths. According to an embodiment of the invention, the assembly is obtained After the results, the haplotype with a coverage of the exon of less than 95% was filtered out. Thus, discarding the relatively short sequence can improve the reliability of the data used in the subsequent analysis steps, reduce the data complexity, and facilitate accurate typing.
需要说明的是,S20中的获得型别集合的步骤,与S30和S40无进行顺序先后的限制。例如,可以先进行S20中的获得型别集合的步骤,再进行S30和S40;也可以先进行S30和S40,再进行S20中的获得型别集合的比对步骤。It should be noted that the steps of obtaining the type set in S20 are not limited to S30 and S40. For example, the steps of obtaining the type set in S20 may be performed first, and then S30 and S40 may be performed; or S30 and S40 may be performed first, and then the matching step of obtaining the type set in S20 may be performed.
S50确定单体型支持的型别。S50 determines the type of haplotype support.
比较所述单体型上的变异和所述型别集合中的型别,以确定所述单体型支持的型别。The variation in the haplotype and the type in the collection are compared to determine the type supported by the haplotype.
根据本发明的一个实施例,在进行该步骤之前包括进行:分别基于比对上相同基准序列的读段的组装结果,对每个单体型进行打分,基于获得的单体型的分值对单体型进行筛选,获得候选单体型;以候选单体型替代单体型进行后续步骤。如此,利于减少数据干扰,减少需要处理的数据量。According to an embodiment of the invention, prior to performing the step, performing: scoring each haplotype based on the result of the assembly of the reads of the same reference sequence on the alignment, respectively, based on the score pairs of the obtained haplotypes The haplotype is screened to obtain a candidate haplotype; the candidate haplotype is substituted for the haplotype for the subsequent step. In this way, it helps to reduce data interference and reduce the amount of data that needs to be processed.
根据本发明的再一个实施例,上述分别基于比对上相同基准序列的读段的组装结果,对每个所述单体型进行打分,基于获得的单体型的分值对所述单体型进行筛选,获得候选单体型,包括:利用以下公式确定所述单体型的分值Score,
Figure PCTCN2016074027-appb-000001
其中,c为所述单体型对所述外显子的覆盖度,N为支持所述单体型的读段数目,R表示所述单体型的可靠性,
Figure PCTCN2016074027-appb-000002
Xi为所述单体型的位置i的测序深度,i为所述单体型上的位置的编号,X为所述单体型的平均深度,L为所述单体型的长度。利用以上公式,基于单体型自身的平均测序深度、各个位置的测序深度以及长度来评判该单体型的可靠性,进而依据该单体型的可靠性、对其对应的外显子的覆盖度以及支持其的读段的数目,来赋予该单体型分值,该分值能够反映代表该单体型,利于多个单体型的比较评判。
According to still another embodiment of the present invention, each of the haplotypes is scored based on the result of the assembly of the reads of the same reference sequence, respectively, based on the scores of the obtained haplotypes. Screening to obtain candidate haplotypes includes: determining the score Score of the haplotype using the following formula,
Figure PCTCN2016074027-appb-000001
Where c is the coverage of the exon by the haplotype, N is the number of reads supporting the haplotype, and R is the reliability of the haplotype,
Figure PCTCN2016074027-appb-000002
Xi is the sequencing depth of position i of the haplotype, i is the number of the position on the haplotype, X is the average depth of the haplotype, and L is the length of the haplotype. Using the above formula, the reliability of the haplotype is judged based on the average sequencing depth of the haplotype itself, the sequencing depth of each position, and the length, and then the coverage of the corresponding phenotype is covered according to the reliability of the haplotype. The degree and the number of reads that support it are assigned to the haplotype score, which is representative of the haplotype, which facilitates comparative evaluation of multiple haplotypes.
所称的筛选包括:对于比对上同一基准序列的读段的组装结果,取出其中的最高的单体型的打分值所对应的单体型后,去除满足以下条件的所述组装结果中的单体型:与取出的所述单体型的序列不一致并且不一致位点的测序深度低于所述单体型相应位点的测序深度的20%;重复以上步骤,直至取出不超过4个单体型,以获得所述候选单体型。由于目标区域来自二倍体,一个区域至多有两种型别(即为杂合子时),利用该步骤进行单体型筛选,使得每个外显子至多具有相应的4个单体型,筛掉不符合真实情况或没有意义的单体型,降低数据的复杂度,利于后续分析的快速进行以及准确分型。 The so-called screening includes: for aligning the haplotype corresponding to the highest haplotype scoring value of the same reference sequence, and removing the haplotype corresponding to the scoring value of the highest haplotype, and removing the assembly result satisfying the following conditions Monomeric: inconsistent with the sequence of the haplotype taken and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype; repeat the above steps until no more than 4 singles are taken Body type to obtain the candidate haplotype. Since the target region is from diploid and one region has at most two types (ie, when it is heterozygous), this step is used to perform haplotype screening so that each exon has at least four corresponding haplotypes. Dropping haplotypes that do not match the real situation or meaninglessness, reducing the complexity of the data, facilitates the rapid analysis and accurate typing of subsequent analysis.
根据本发明的一个实施例,以候选单体型替代单体型进行后续步骤,该步骤包括:比较每个候选单体型上的变异和所述型别集合中的型别,若完全匹配,则确定所述候选单体型支持所述型别。如此,确定候选单体型支持的型别,利于后续基于候选单体型支持情况对型别进一步判定。According to one embodiment of the invention, the subsequent step is performed by substituting the candidate haplotype for the haplotype, the step comprising: comparing the variation on each candidate haplotype with the type in the genre set, if exactly matched, It is then determined that the candidate haplotype supports the type. In this way, the type of candidate haplotype support is determined, which facilitates further determination of the type based on the candidate haplotype support.
根据本发明的一个实施例,确定所述候选单体型支持的型别之后,合并支持相同型别的所述候选单体型。利于后续步骤的快速进行。According to one embodiment of the invention, after determining the type of candidate haplotype support, the candidate haplotypes of the same type are merged. Conducive to the rapid progress of the next steps.
S60确定第一候选型别组和第二候选型别组。S60 determines a first candidate type group and a second candidate type group.
依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组。The types are divided into two groups according to the haplotypes supporting the type and the reads supporting the types, and the first candidate type group and the second candidate type group are obtained.
目标区域的编码序列包含多个外显子,根据本发明的一个实施例,在进行该步骤之前,过滤掉支持的外显子数目小于外显子总数30%的型别。如此,依据该条件去除掉型别集合中的没有意义或者意义小的型别,降低数据的干扰、分型的复杂度,利于后续快速进行以及准确分型。The coding sequence of the target region comprises a plurality of exons, and according to one embodiment of the invention, the type of supported exons less than 30% of the total number of exons is filtered out prior to performing this step. In this way, according to the condition, the meaningless or less meaningful type in the type set is removed, and the data interference and the complexity of the typing are reduced, which facilitates the subsequent rapid progress and accurate typing.
如图2所示,根据本发明的一个实施例,该步骤包括:依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别;基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述第二候选型别,以获得所述第一候选型别组和所述第二候选型别组。As shown in FIG. 2, according to an embodiment of the present invention, the step includes: performing a first scoring of the type according to the haplotype supporting the type and the reading supporting the type, based on obtaining Type first value of the type to filter the type to obtain a first candidate type and a second candidate type; based on supporting the first candidate type of reading and supporting the second candidate type For the case where the read segment supports the other types in the type set, the other types in the type set are respectively assigned to the first candidate type and the second candidate type to obtain The first candidate type group and the second candidate type group.
根据本发明的一个实施例,所称的依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别,包括:利用以下公式确定所述型别的第一分值TScore,TScore=N×S,其中,N为支持所述型别的读段数目,S为支持所述型别的候选单体型的分值的总和;将所有所述型别的第一分值作两两组合,确定所述第一分值之和最高的组合中的型别分别为所述第一候选型别和所述第二候选型别。利用以上公式赋予型别第一分值,该分值能够反映代表该型别,利于后续型别判定。According to an embodiment of the present invention, the so-called haplotype supporting the type and the reading supporting the type are referred to, the first score is performed on the type, based on the first score of the obtained type. The value is filtered to obtain the first candidate type and the second candidate type, including: determining the first score TScore of the type by using the following formula, TScore=N×S, wherein N is supported The number of reads of the type, S is the sum of the scores of the candidate haplotypes supporting the type; the first scores of all the types are combined in pairs, and the first score is determined The types in the highest combination are the first candidate type and the second candidate type, respectively. The first formula of the type is given by the above formula, and the score can reflect the representative type, which is beneficial to the subsequent type determination.
根据本发明的一个实施例,所述基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述第二候选型别,获得所述第一候选型别组和所述第二候选型别组,包括:对于每个所述其它型别,比较第一交集和第二交集的大小,依据比较的结 果将每个所述其它型别归至所述第一候选型别或者所述第二候选型别,以获得所述第一候选型别组和所述第二候选型别组,所述第一交集为支持该其它型别的读段与所述支持第一候选型别的读段的交集,所述第二交集为支持该其它型别的读段与所述支持第二候选型别的读段的交集。若第一交集较第二交集大,即支持该个其它型别的读段落在支持第一候选型别的读段中的数目更多,则将该个其它型别归到与第一候选型别同一组,否则归为第二候选型别所在组。如此,基于第一和第二候选型别,将型别集合中的其它型别分成了两个组。这样处理,利于后续进一步筛选准确分型。According to an embodiment of the present invention, the support based on the read segment supporting the first candidate type and the read support segment of the second candidate type support other types in the type set, The other types in the type set are respectively assigned to the first candidate type and the second candidate type, and the first candidate type group and the second candidate type group are obtained, including: For each of the other types, comparing the sizes of the first intersection and the second intersection, based on the comparison And assigning each of the other types to the first candidate type or the second candidate type to obtain the first candidate type group and the second candidate type group, An intersection is an intersection of the read segment supporting the other type and the read segment supporting the first candidate type, the second intersection being a read support for the other type and a support for the second candidate type The intersection of the readings. If the first intersection is larger than the second intersection, that is, the number of reading segments supporting the other type is greater in the number of readings supporting the first candidate type, then the other types are assigned to the first candidate type. Don't be in the same group, otherwise it will be classified as the second candidate type. Thus, based on the first and second candidate types, the other types in the type set are divided into two groups. This treatment will facilitate subsequent screening and accurate typing.
S70确定主要型别和次要型别。S70 determines the primary and secondary types.
基于支持所述型别的单体型和支持所述型别的读段,分别对所述第一候选型别组和所述第二候选型别组中的型别进行筛选,以确定出主要型别和次要型别。Based on the haplotypes supporting the type and the reads supporting the types, the types in the first candidate type group and the second candidate type group are respectively screened to determine the main Type and secondary type.
根据本发明的一个实施例,该步骤包括:基于支持所述型别的读段和所述型别的第一分值,分别对所述第一候选型别组和所述第二候选型别组中的型别进行第二打分,基于获得的型别的第二分值确定出所述主要型别和所述次要型别。According to an embodiment of the present invention, the step includes: respectively, the first candidate type group and the second candidate type based on the read of the type and the first score of the type The type in the group performs a second score, and the primary type and the secondary type are determined based on the second score of the obtained type.
根据本发明的一个实施例,利用以下公式确定所述型别的第二分值TScore_New,TScore_New=N*×TScore,其中,N*为所述第一候选型别组中的型别的支持读段落在所述支持第二候选型别的读段之外的数目,或者为所述第二候选型别组中的型别的支持读段落在所述支持第一候选型别的读段之外的数目;确定所述第一候选型别组和所述第二候选型别组中的第二分值最高的两个型别为所述主要型别和所述次要型别。所称的主要型别和次要型别,是基于频率的相对大小的相对概念。这里,主要型别和次要型别是基于支持其的读段的相对多和少来区分的。利用以上公式调整前面步骤赋予型别的第一分值,即对第一分值进行修正,得到的第二分值,更能够反映代表该型别,依据第二分值的高低确定出主要型别和次要型别利于后续型别的准确判定。According to an embodiment of the present invention, the second score TScore_New of the type is determined by the following formula, TScore_New = N * × TScore, where N * is a supported read of the type in the first candidate type group a number of paragraphs other than the read segment supporting the second candidate type, or a support read paragraph of the type in the second candidate type group outside the read support for the first candidate type The number of the second candidate with the highest second score in the first candidate type group and the second candidate type group is the primary type and the secondary type. The so-called primary and secondary types are relative concepts based on the relative size of the frequencies. Here, the primary and secondary types are distinguished based on the relative number of reads that support them. The first score of the previous step is adjusted by the above formula, that is, the first score is corrected, and the obtained second score is more representative of the type, and the main type is determined according to the level of the second score. The other and secondary types are conducive to the accurate determination of the subsequent type.
S80确定目标区域的基因型。S80 determines the genotype of the target region.
基于支持所述主要型别和支持所述次要型别的读段的数目的差异,判定所述目标区域的基因型。The genotype of the target region is determined based on the difference in the number of reads supporting the primary type and supporting the secondary type.
根据本发明的一个实施例,该步骤包括:若仅支持所述次要型别的读段数目与仅支持所述主要型别的读段数目的比值大于0.1,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别,否则则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别。发明人基于前述步 骤的中间结果以及大量样本数据分析,发现将仅支持所述次要型别的读段数目与仅支持所述主要型别的读段数目的比例的阈值设置为0.1,能够简单准确的确定目标区域的基因型。According to an embodiment of the present invention, the step of determining that the target area is heterozygous if the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the primary type is greater than 0.1 The major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively, otherwise the target region is determined to be homozygous, and the two homozygous components are composed. Alleles are of the major type. The inventor based on the aforementioned steps The intermediate result of the sequence and the analysis of a large number of sample data, find that the threshold that only supports the ratio of the number of reads of the secondary type to the number of reads that only support the main type is set to 0.1, and the target area can be determined simply and accurately. Genotype.
需要说明的是,本发明中涉及的具体数值大多具有统计意义,因此,如无特殊说明,任意以精确方式表达的数值均可代表一个范围,例如包含该数值正负10%的区间。It should be noted that the specific numerical values involved in the present invention are mostly statistically significant. Therefore, any numerical value expressed in an accurate manner may represent a range, such as an interval including plus or minus 10% of the numerical value, unless otherwise specified.
如图3所示,根据本发明的实施例,上述实施方式中的分型方法还可以进一步包括以下步骤:As shown in FIG. 3, according to an embodiment of the present invention, the parting method in the above embodiment may further include the following steps:
S90判定目标区域的拷贝数。S90 determines the number of copies of the target area.
以至少一个拷贝数固定为2的区域的平均测序深度为基准深度,计算目标区域的测序深度与所述基准深度的比例,依据计算得到的比例确定所述目标区域的拷贝数。根据本发明的一个实施例,该步骤基于利用测序深度信息,可以判断目标区域HLA-DRB3、4和/或5的拷贝数。The average sequencing depth of the region fixed at at least one copy number to 2 is the reference depth, the ratio of the sequencing depth of the target region to the reference depth is calculated, and the copy number of the target region is determined according to the calculated ratio. According to an embodiment of the invention, this step is based on the use of sequencing depth information to determine the copy number of the target regions HLA-DRB3, 4 and/or 5.
依据所述目标区域的拷贝数,进一步修正判定的所述目标区域的基因型,以更加准确的确定目标区域的基因型。根据本发明的一个实施例,若所述目标区域的拷贝数为0,则判定所述目标区域不存在;若所述目标区域的拷贝数为1,则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别;若所述目标区域的拷贝数为2,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别。需要说明的是,该处涉及的具体数值具有统计意义,因此,这里以精确方式表达的数值“0”、“1”和“2”均可代表一个范围,例如包含该数值正负10%或者正负20%的区间。Based on the copy number of the target region, the determined genotype of the target region is further corrected to more accurately determine the genotype of the target region. According to an embodiment of the present invention, if the copy number of the target area is 0, it is determined that the target area does not exist; if the copy number of the target area is 1, it is determined that the target area is homozygous, and the composition is The two alleles of the homozygote are all of the major types; if the copy number of the target region is 2, the target region is determined to be heterozygous, and the major alleles constituting the heterozygote are The minor alleles are the primary type and the secondary type, respectively. It should be noted that the specific numerical values referred to herein are statistically significant. Therefore, the numerical values “0”, “1” and “2” expressed herein in a precise manner may represent a range, for example, including the value plus or minus 10% or Positive and negative range of 20%.
本领域技术人员可以理解,上述分型方法的全部或部分步骤,可以利用机器可识别语言编写成程序,存储于存储介质中。根据本发明的另一个实施方式提供的一种计算机可读介质,该计算机可读介质用于存储计算机可执行程序,执行所述程序包括完成上述任一实施例中的分型方法的全部或部分步骤。本领域技术人员可以理解,在执行该计算机可执行程序时,通过指令相关硬件可完成上述任一分型方法的全部或部分步骤。所称存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。Those skilled in the art can understand that all or part of the steps of the above-mentioned parting method can be written into a program in a machine readable language and stored in a storage medium. According to another embodiment of the present invention, a computer readable medium for storing a computer executable program, the program comprising performing all or part of the parting method in any of the above embodiments step. Those skilled in the art will appreciate that all or part of the steps of any of the above-described typing methods may be performed by the instruction-related hardware when executing the computer-executable program. The storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
如图4所示,根据本发明的再一个实施方式提供的一种分型装置100,该装置100包括:数据输入单元110,用于输入数据;数据输出单元120,用于输出数据;存储单元140,用于存储数据,其中包括计算机可执行程序;处理器130,与所述数据输入单元、所述数据输出单元和所述存储单元连接,用于执行所述计算机可执行程序,执行所述程序包括完成上述任一实施例中的分型方法的全部或部分步骤。 As shown in FIG. 4, a parting apparatus 100 according to still another embodiment of the present invention includes: a data input unit 110 for inputting data; a data output unit 120 for outputting data; and a storage unit 140, for storing data, including a computer executable program; a processor 130, coupled to the data input unit, the data output unit, and the storage unit, for executing the computer executable program, performing the The program includes all or part of the steps of completing the parting method in any of the above embodiments.
如图5所示,根据本发明的一个实施方式还提供的一种分型***1000,该***1000能够用于实施上述本发明任一实施例中的分型方法,该***1000包括:数据输入模块1001,用于输入待测样品的测序数据,所述测序数据包括多个来自目标区域的读段;比对模块1003,包括第一比对模块1013和第二比对模块1023,第一比对模块1013用于将参考序列组比对至基准序列组,获得型别集合,所述型别集合包含多个型别,第二比对模块1023用于将来自数据输入模块的测序数据比对至所述参考序列组,获得比对结果,所述基准序列组包含一条或多条基准序列,所述基准序列组能够完全覆盖所述目标区域的所有基因序列,不同所述基准序列包含不同的基因全序列,所述参考序列组包括多条参考序列,所述参考序列组能够完全覆盖所述目标区域的编码区序列的所有外显子,不同所述参考序列包含不同的外显子;转化模块1005,用于将所述比对结果转化为相对于所述基准序列组的比对结果,获得转化的比对结果;组装模块1007,用于基于所述转化的比对结果,分别对比对上相同基准序列的读段进行组装,获得组装结果,所述组装结果包括多个单体型;单体型支持型别确定模块1009,用于比较所述单体型上的变异和所述型别集合中的型别,以确定所述单体型支持的型别;聚类模块1011,用于依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组;主/次要型别确定模块1013,用于基于支持所述型别的单体型和支持所述型别的读段,分别对所述第一候选型别组和所述第二候选型别组中的型别进行筛选,以确定出主要型别和次要型别;基因型确定模块1015,用于基于支持所述主要型别和支持所述次要型别的读段的数目的差异,判定所述目标区域的基因型。上述对本发明任一实施例中的分型方法的优点和技术特征的描述,同样适用本发明这一实施方式的分型***,在此不再赘述。本领域技术人员能够理解,上述***中的功能模块可以包含子模块或者连接其它功能模块以实施分型方法中的可选或优化步骤或处理。As shown in FIG. 5, an embodiment of the present invention further provides a parting system 1000, which can be used to implement the parting method in any of the above embodiments of the present invention. The system 1000 includes: data input. The module 1001 is configured to input sequencing data of the sample to be tested, the sequencing data includes a plurality of readings from the target area, and the comparison module 1003 includes a first comparison module 1013 and a second comparison module 1023, the first ratio The pair module 1013 is configured to compare the reference sequence group to the reference sequence group to obtain a type set, the type set includes a plurality of types, and the second comparison module 1023 is configured to compare the sequencing data from the data input module. Obtaining an alignment result to the reference sequence group, the reference sequence group comprising one or more reference sequences, the reference sequence group being capable of completely covering all gene sequences of the target region, different reference sequences comprising different a full sequence of genes, the set of reference sequences comprising a plurality of reference sequences capable of completely covering all exons of the coding region sequence of the target region, different The reference sequence comprises different exons; a transformation module 1005 is configured to convert the alignment result into a comparison result with respect to the reference sequence group to obtain a converted alignment result; the assembly module 1007 is configured to Comparing the results of the transformation, respectively, comparing the reads of the same reference sequence to obtain an assembly result, the assembly result comprising a plurality of haplotypes; the haplotype support type determination module 1009, for comparing the Determining the haplotype and the type in the type set to determine the type supported by the haplotype; the clustering module 1011 for supporting the haplotype and support of the type Defining the types of readings, dividing the types into two groups, obtaining a first candidate type group and a second candidate type group; the primary/secondary type determining module 1013 is configured to support the type based a haplotype and a read segment supporting the type, respectively screening the types in the first candidate type group and the second candidate type group to determine a primary type and a secondary type a genotype determination module 1015 for supporting the primary type The genotype of the target region is determined not to differ from the number of reads supporting the secondary type. The above description of the advantages and technical features of the parting method in any embodiment of the present invention is also applicable to the parting system of this embodiment of the present invention, and details are not described herein again. Those skilled in the art will appreciate that the functional modules in the above described systems may include sub-modules or connect other functional modules to implement optional or optimized steps or processes in the typing method.
根据本发明的一个实施例,所述目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5中的至少之一。According to an embodiment of the invention, the target area comprises at least one of HLA-DRB3, HLA-DRB4 and HLA-DRB5.
根据本发明的一个实施例,所述目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5。According to an embodiment of the invention, the target region comprises HLA-DRB3, HLA-DRB4 and HLA-DRB5.
根据本发明的一个实施例,比对模块1003中的参考序列组通过以下步骤构建获得:获得包含所述目标区域的编码区序列和基因全序列;将所述编码区序列按外显子分割,获得多个外显子序列;从与所述外显子序列最接近型别的基因全序列提取该外显子序列两侧各K bp的序列,加到对应的外显子序列的两侧,获得所述参考序列组中的参考序列,其中K依据读段的长度来确定。 According to an embodiment of the present invention, the reference sequence group in the comparison module 1003 is obtained by constructing: obtaining a coding region sequence and a gene full sequence including the target region; and dividing the coding region sequence by an exon, Obtaining a plurality of exon sequences; extracting a sequence of K bp flanking the exon sequence from the entire sequence of the closest type of the exon sequence, and adding to both sides of the corresponding exon sequence, A reference sequence in the set of reference sequences is obtained, where K is determined based on the length of the read.
根据本发明的一个实施例,第一比对模块1013和/或转化模块1005中的一条所述基准序列为所述目标区域中的一个基因的基因全序列。According to an embodiment of the invention, one of the reference sequences in the first alignment module 1013 and/or the transformation module 1005 is a full sequence of genes of one gene in the target region.
根据本发明的一个实施例,利用组装模块1007进行以下:对比对上相同基准序列的读段,利用具有重叠部分并且所述重叠部分完全一致的读段进行所述组装,以获得所述多个单体型。According to an embodiment of the present invention, the assembly module 1007 is used to perform the following: comparing the reads of the same reference sequence on the pair, performing the assembly using the read segments having the overlapping portions and the overlapping portions are completely identical to obtain the plurality of Monomer type.
根据本发明的一个实施例,在组装模块1007中,在获得所述组装结果之后,过滤掉对所述外显子的覆盖度低于95%的单体型。According to one embodiment of the invention, in the assembly module 1007, after obtaining the assembly result, the haplotype having a coverage of the exon of less than 95% is filtered out.
根据本发明的一个实施例,还包括候选单体型确定模块1008,用于在利用所述单体型支持型别确定模块1009确定单体型支持的型别之前,进行以下:分别基于比对上相同基准序列的读段的组装结果,对每个所述单体型进行打分,基于获得的单体型的分值对所述单体型进行筛选,获得候选单体型;以所述候选单体型替代所述单体型。According to an embodiment of the present invention, a candidate haplotype determining module 1008 is further included for performing the following before determining the haplotype-supported type by using the haplotype supporting type determining module 1009: respectively based on the comparison Combining the results of the reads of the same reference sequence, scoring each of the haplotypes, screening the haplotypes based on the scores of the obtained haplotypes, obtaining candidate haplotypes; A monomeric type replaces the haplotype.
根据本发明的一个实施例,在候选单体型确定模块1008中,利用以下公式对每个所述单体型进行打分,以确定所述单体型的分值Score:
Figure PCTCN2016074027-appb-000003
其中,c为所述单体型对所述外显子的覆盖度,N为支持所述单体型的读段数目,R表示所述单体型的可靠性,
Figure PCTCN2016074027-appb-000004
Xi为所述单体型的位置i的测序深度,i为所述单体型上的位置的编号,X为所述单体型的平均深度,L为所述单体型的长度。
In accordance with an embodiment of the present invention, in the candidate haplotype determination module 1008, each of the haplotypes is scored using the following formula to determine a score for the haplotype:
Figure PCTCN2016074027-appb-000003
Where c is the coverage of the exon by the haplotype, N is the number of reads supporting the haplotype, and R is the reliability of the haplotype,
Figure PCTCN2016074027-appb-000004
Xi is the sequencing depth of position i of the haplotype, i is the number of the position on the haplotype, X is the average depth of the haplotype, and L is the length of the haplotype.
根据本发明的一个实施例,在候选单体型确定模块1008中进行基于获得的单体型的分值对所述单体型进行筛选,包括:对于比对上相同基准序列的读段的组装结果,取出其中的最高的单体型的分值所对应的单体型后,去除满足以下条件的所述组装结果中的单体型:与取出的所述单体型的序列不一致并且不一致位点的测序深度低于所述单体型相应位点的测序深度的20%;重复以上至最多取出4个所述单体型,获得所述候选单体型。According to one embodiment of the invention, screening of the haplotypes based on the scores of the obtained haplotypes is performed in the candidate haplotype determination module 1008, including: assembly of reads of the same reference sequence on the alignment As a result, after taking out the haplotype corresponding to the score of the highest haplotype therein, the haplotype in the assembly result satisfying the following conditions was removed: the sequence of the haplotype taken out was inconsistent and inconsistent The sequencing depth of the dots is less than 20% of the sequencing depth of the corresponding sites of the haplotype; the above haplotypes are obtained by repeating up to 4 of the haplotypes.
根据本发明的一个实施例,在单体型支持型别确定模块1009中进行以下:比较每个候选单体型上的变异和所述型别集合中的型别,若完全匹配,则确定所述候选单体型支持所述型别。According to an embodiment of the present invention, the following is performed in the haplotype support type determining module 1009: comparing the variability on each candidate haplotype with the type in the type set, and if it is completely matched, determining The candidate haplotype supports the type.
根据本发明的一个实施例,在单体型支持型别确定模块1009中,确定所述候选单体型支持的型别之后,合并支持相同型别的所述候选单体型。According to an embodiment of the present invention, in the haplotype support type determining module 1009, after determining the type supported by the candidate haplotype, the candidate haplotypes supporting the same type are merged.
根据本发明的一个实施例,还包括型别过滤模块1010,用于在利用聚类模块1011获得第一候选型别组和第二候选型别组之前,过滤掉支持的外显子数目小于外显子总数30% 的型别。According to an embodiment of the present invention, the type filtering module 1010 is further configured to filter out the number of supported exons less than the outer number before obtaining the first candidate type group and the second candidate type group by using the clustering module 1011. 30% of the total number of alleles Type.
根据本发明的一个实施例,聚类模块1011用于进行以下:依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别;基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述第二候选型别,以获得所述第一候选型别组和所述第二候选型别组。According to an embodiment of the present invention, the clustering module 1011 is configured to: perform a first scoring of the type according to the haplotype supporting the type and the readlist supporting the type, based on the obtained Type first value to filter the type to obtain a first candidate type and a second candidate type; based on supporting the first candidate type of reading and supporting the second candidate type Supporting the support of other types in the type set, and classifying other types in the type set to the first candidate type and the second candidate type respectively to obtain The first candidate type group and the second candidate type group are described.
根据本发明的一个实施例,在聚类模块1011中进行依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别,包括:利用以下公式确定所述型别的第一分值TScore,TScore=N×S,其中,N为支持所述型别的读段数目,S为支持所述型别的候选单体型的分值的总和;将所有所述型别的第一分值作两两组合,确定所述第一分值之和最高的组合中的型别分别为所述第一候选型别和所述第二候选型别。According to an embodiment of the present invention, in the clustering module 1011, according to the haplotype supporting the type and the read segment supporting the type, the first score is performed on the type, based on the obtained type. The first score is used to filter the type, and the first candidate type and the second candidate type are obtained, including: determining the first score TScore of the type by using the following formula, TScore=N×S, wherein , N is the number of reads supporting the type, S is the sum of the scores of the candidate haplotypes supporting the type; the first scores of all the types are combined in pairs, and the The types in the combination of the highest sum of the first scores are the first candidate type and the second candidate type, respectively.
根据本发明的一个实施例,在聚类模块1011中进行基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述第二候选型别,获得所述第一候选型别组和所述第二候选型别组,包括:对于每个所述其它型别,比较第一交集和第二交集的大小,依据比较的结果将每个所述其它型别归至所述第一候选型别或者所述第二候选型别,以获得所述第一候选型别组和所述第二候选型别组,所述第一交集为支持该其它型别的读段与所述支持第一候选型别的读段的交集,所述第二交集为支持该其它型别的读段与所述支持第二候选型别的读段的交集。According to an embodiment of the present invention, in the clustering module 1011, a read based on the support of the first candidate type and a read support of the second candidate type are performed on other types in the type set. The support case, the other types in the type set are respectively assigned to the first candidate type and the second candidate type, and the first candidate type group and the second candidate type are obtained. Another group includes: comparing, for each of the other types, a size of the first intersection and the second intersection, and assigning each of the other types to the first candidate type or the first according to a result of the comparison a second candidate type to obtain the first candidate type group and the second candidate type group, wherein the first intersection is a read that supports the other type and the read that supports the first candidate type The intersection of the segments, the second intersection being an intersection of the reads supporting the other types and the reads supporting the second candidate type.
根据本发明的一个实施例,主/次要型别确定模块1013用于进行以下:基于支持所述型别的读段和所述型别的第一分值,分别对所述第一候选型别组和所述第二候选型别组中的型别进行第二打分,基于获得的型别的第二分值确定出所述主要型别和所述次要型别。According to an embodiment of the present invention, the primary/secondary type determining module 1013 is configured to: perform the following on the first candidate based on the read of the type and the first score of the type The type in the other group and the second candidate type group is subjected to a second score, and the main type and the secondary type are determined based on the obtained second score of the type.
根据本发明的一个实施例,在主/次要型别确定模块1013中进行以下:利用以下公式确定所述型别的第二分值TScore_New,TScore_New=N*×TScore,其中,N*为所述第一候选型别组中的型别的支持读段落在所述支持第二候选型别的读段之外的数目,或者为所述第二候选型别组中的型别的支持读段落在所述支持第一候选型别的读段之外的数目;确定所述第一候选型别组和所述第二候选型别组中的第二分值最高的两个型别为所述主要型别和所述次要型别。 According to an embodiment of the present invention, the following is performed in the primary/secondary type determining module 1013: the second score TScore_New of the type is determined by the following formula, TScore_New = N * × TScore, where N * is The number of support read paragraphs of the type in the first candidate type group is outside the read support segment of the second candidate type, or the support read paragraph of the type in the second candidate type group a number other than the read segment supporting the first candidate type; determining two types having the highest second score in the first candidate type group and the second candidate type group as described The main type and the secondary type.
根据本发明的一个实施例,基因型确定模块1015用于进行以下:若仅支持所述次要型别的读段数目与仅支持所述主要型别的读段数目的比值大于0.1,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别,否则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别。According to an embodiment of the present invention, the genotype determining module 1015 is configured to: if the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the main type is greater than 0.1, then the determination The target region is heterozygous, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively, otherwise the target region is determined to be homozygous, and the composition is Both alleles of homozygotes are of the major type.
如图6所示,根据本发明的一个实施例,还包括拷贝数确定模块1016,用于进行以下:以至少一个拷贝数固定为2的区域的平均测序深度为基准深度,计算所述目标区域的测序深度与所述基准深度的比例,依据计算得的比例确定所述目标区域的拷贝数;依据所述目标区域的拷贝数,判定所述目标区域的基因型。As shown in FIG. 6, according to an embodiment of the present invention, a copy number determining module 1016 is further included for performing the following: calculating an average target depth of a region fixed at at least one copy number to 2 as a reference depth, and calculating the target region The ratio of the sequencing depth to the reference depth, the copy number of the target region is determined according to the calculated ratio; and the genotype of the target region is determined according to the copy number of the target region.
根据本发明的一个实施例,利用所述拷贝数确定模块1016进行依据目标区域的拷贝数,判定所述目标区域的基因型,包括:若所述目标区域的拷贝数为0,则判定所述目标区域不存在;若所述目标区域的拷贝数为1,则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别;若所述目标区域的拷贝数为2,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别。According to an embodiment of the present invention, the copy number determining module 1016 is configured to determine the genotype of the target region according to the copy number of the target region, including: if the copy number of the target region is 0, determining the The target area is absent; if the copy number of the target area is 1, it is determined that the target area is homozygous, and the two alleles constituting the homozygote are all the main type; if the target area If the copy number is 2, the target region is determined to be heterozygous, and the major allele and the minor allele constituting the heterozygote are the major type and the minor type, respectively.
上述任一实施例中的分型方法、装置和/或***通过构建目标区域的参考序列和基准序列,构建型别集合,基于读段信息构建单体型,对目标区域进行基因分型。该分型方法适用于任何区域的基因分型,特别适用于高度多态性区域的分型,例如适用于对HLA-DRB3、4和/或5的分型。该方法不需要针对目标基因进行PCR,降低了实验工作量和实验难度,提高了应用或研究时方案设计的灵活性。The typing method, apparatus and/or system in any of the above embodiments constructs a type set by constructing a reference sequence and a reference sequence of the target region, constructs a haplotype based on the read information, and performs genotyping on the target region. This typing method is suitable for genotyping of any region, and is particularly suitable for typing of highly polymorphic regions, for example for typing of HLA-DRB3, 4 and/or 5. The method does not require PCR for the target gene, which reduces the experimental workload and experimental difficulty, and improves the flexibility of the design of the application or research.
下面详细描述本发明的实施例,所述实施例中的示例图,其自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。The embodiments of the present invention are described in detail below, and the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below are illustrative only and are not to be construed as limiting the invention.
需要说明的是在本文中所使用的术语“第一”、“第二”等仅用于方便描述目的,而不能理解为指示或暗示相对重要性,也不能理解为之间有先后顺序关系。在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。It is to be understood that the terms "first", "second", and the like, as used herein, are used for convenience of description, and are not to be construed as indicating or implying a relative importance, or a prioritized relationship. In the description of the present invention, "a plurality" means two or more unless otherwise stated.
除另有交待,以下实施例中涉及的试剂、仪器或者软件,都是常规市售产品或者开源的,比如从Illumina公司购买测序文库制备试剂盒,依照试剂盒说明书进行建库等。Unless otherwise stated, the reagents, instruments, or software involved in the following examples are conventional commercial products or open source, such as purchasing a sequencing library preparation kit from Illumina, and building a library according to the kit instructions.
实施例1Example 1
对人类的HLA-DRB3,4,5进行分型,包括的一般步骤: General classification of human HLA-DRB3, 4, 5, including the general steps:
1.从IMGT/HLA数据库下载HLA基因型别和序列数据。序列数据包括编码区序列和基因全长序列。1. Download HLA genotypes and sequence data from the IMGT/HLA database. Sequence data includes the coding region sequence and the full length sequence of the gene.
2.修改下载的HLA基因型别和序列数据的格式,以便分析。将编码区序列按外显子分割,并从最相近型别的基因全长序列提取外显子两侧各100bp(或其他长度,取决于测序读段的长度)序列,加在对应的外显子序列两侧,构成参考序列。延伸外显子序列是为了保证比对时能保留外显子边缘的读段。选取最完整的基因全长序列作为基准序列,记录所有编码区序列和基因全长序列相对于基准序列的变异,将变异与型别相关联。2. Modify the format of the downloaded HLA genotype and sequence data for analysis. The coding region sequence is divided into exons, and the sequence of 100 bp (or other length depending on the length of the sequencing read) on both sides of the exon is extracted from the full-length sequence of the closest type of gene, and added to the corresponding explicit display. Both sides of the subsequence form a reference sequence. The exon sequences are extended to ensure that the reads of the exon edges are preserved during the alignment. The full-length full-length sequence of the gene was selected as the reference sequence, and the variation of all coding region sequences and full-length sequences of the gene relative to the reference sequence was recorded, and the variation was associated with the type.
3.比对样品测序数据至参考序列,得到比对上的读段的位置信息和变异信息,并转化为相对于基准序列的位置信息和变异信息。3. Aligning the sample sequencing data to the reference sequence, obtaining position information and variation information of the read segment on the comparison, and converting the position information and the variation information with respect to the reference sequence.
4.利用比对上的读段的位置信息和变异信息,对每个外显子,利用其中具有重叠部分并且重叠部分序列完全一致的读段进行组装,获得多个单体型。过滤对外显子覆盖度较低的单体型。获得单体型的支持读段数、覆盖度,根据支持读段数和覆盖度打分,按分值从高到底取出单体型,每取出一个单体型就去除与之序列冲突且冲突位点深度较低的其他单体型,得到候选单体型。比较候选单体型的变异和每个型别的变异,完全匹配的型别为此单体型支持的型别,一个单体型可能支持多个型别。4. Using the position information and the variation information of the read segment on the alignment, each exon is assembled using a read segment in which the overlapping portion and the overlapping portion sequence are completely identical, and a plurality of haplotypes are obtained. Filter haplotypes with lower exon coverage. Obtain the number of supported readings and coverage of the haplotype, and score according to the number of supported readings and coverage. The haplotypes are taken out from the bottom according to the scores. Each time a haplotype is removed, the sequence conflicts and the depth of the collision sites is removed. Low haplotypes give candidate haplotypes. Compare variants of candidate haplotypes and variants of each type. The exact match type is the type supported by this haplotype, and one haplotype may support multiple types.
5.提取所有外显子的候选单体型及其对应的型别和支持的读段。过滤支持的外显子数目偏少的型别。根据支持的单体型的分值和支持的读段数目,对型别打分。对所有型别的分值做两两组合,选取分值之和最高的组合作为候选型别。根据两个候选型别得到对应的两组候选读段,将所有可能的型别与此两组候选读段比较,依据候选读段的支持情况,将所有型别也分为两组并按候选读段重新打分,两组中分值分别最高的两个型别作为最佳型别。由读段数目更多的那组候选读段得到的最佳型别作为主要型别,另一个最佳型别作为次要型别。若仅支持次要型别的读段数目与仅支持主要型别的读段数目的比值较大,例如大于0.1,则判断此基因为杂合,保留主要和次要型别,否则判断为纯合,仅保留主要型别。5. Extract candidate haplotypes for all exons and their corresponding types and supported reads. Filters support a smaller number of exons. The type is scored based on the score of the supported haplotype and the number of supported reads. Make a pairwise combination of all types of scores, and select the combination with the highest sum of scores as the candidate type. According to the two candidate types, the corresponding two sets of candidate readings are obtained, and all possible types are compared with the two sets of candidate readings. According to the support situation of the candidate readings, all the types are also divided into two groups and candidate. The reading is re-score, and the two types with the highest scores in the two groups are the best type. The best type obtained from the set of candidate reads with a larger number of reads is the primary type, and the other best type is the secondary type. If the ratio of the number of reads that only support the secondary type is larger than the number of reads that only support the main type, for example, greater than 0.1, the gene is judged to be heterozygous, and the primary and secondary types are retained, otherwise it is judged to be homozygous. , only retain the main type.
以下步骤6和7为可选步骤。The following steps 6 and 7 are optional steps.
6.依据测序深度判断HLA-DRB3,4,5的拷贝数。与一般的基因不同,HLA-DRB3,4,5的拷贝数并不固定,对它们中的任何一个,一个人类样本有可能出现0,1,2三种基因拷贝数的情况。以拷贝数固定为2的区域(其他基因或非编码序列)的测序深度或某个较大区域的平均测序深度作为基准深度,计算HLA-DRB3,4,5的深度与基准深度的比例:若比例接近0,则拷贝数为0;若比例接近0.5,则拷贝数为1;若比例接近1,则拷贝数为2。 6. Determine the copy number of HLA-DRB3, 4, 5 based on the sequencing depth. Unlike general genes, the copy number of HLA-DRB3, 4, 5 is not fixed. For any of them, a human sample may have copy numbers of 0, 1, and 2 genes. Calculate the ratio of the depth of HLA-DRB3, 4, 5 to the reference depth by using the sequencing depth of the region where the copy number is fixed at 2 (other gene or non-coding sequence) or the average sequencing depth of a larger region as the reference depth: When the ratio is close to 0, the copy number is 0; if the ratio is close to 0.5, the copy number is 1; if the ratio is close to 1, the copy number is 2.
7.结合保留的型别信息与拷贝数信息,得到最终的HLA-DRB3,4,5基因型别。若拷贝数为0,则该基因不存在,亦无型别;若拷贝数为1,则选取保留的型别中的主要型别为最终型别;若拷贝数为2,则选取保留的所有型别为最终型别:仅有主要型别,为纯合,既有主要型别又有次要型别,为杂合。7. Combine the retained type information with the copy number information to obtain the final HLA-DRB3, 4, 5 genotype. If the copy number is 0, the gene does not exist and has no type; if the copy number is 1, the main type in the retained type is selected as the final type; if the copy number is 2, all the reserved ones are selected. The type is the final type: only the main type is pure, and there are both main and secondary types, which are heterozygous.
若只进行步骤1-5与进行上述全部步骤发现对目标区域的分型结果不一致,可以以进行步骤6和7后的分型结果为准。If only step 1-5 is performed and all the above steps are performed, it is found that the result of the classification of the target area is inconsistent, and the result of the classification after steps 6 and 7 can be taken as the standard.
实施例2Example 2
1.从IMGT/HLA数据库下载HLA基因型别和序列数据。序列数据包括编码区序列和基因全长序列。1. Download HLA genotypes and sequence data from the IMGT/HLA database. Sequence data includes the coding region sequence and the full length sequence of the gene.
2.修改下载的HLA基因型别和序列数据的格式,以便分析。将基因编码序列按外显子分割,并从最相近型别的基因全长序列提取外显子两侧各100bp(或其他长度,取决于测序读段的长度)序列,加在对应的外显子序列两侧,构成参考序列。延伸外显子序列是为了保证比对时能保留外显子边缘的读段。选取最完整的基因全长序列作为基准序列,记录所有编码区序列和基因全长序列相对于基准序列的变异,将变异与型别相关联。2. Modify the format of the downloaded HLA genotype and sequence data for analysis. The gene coding sequence was divided into exons, and the sequence of 100 bp (or other length depending on the length of the sequencing read) on both sides of the exon was extracted from the full-length sequence of the closest type of gene, and added to the corresponding explicit Both sides of the subsequence form a reference sequence. The exon sequences are extended to ensure that the reads of the exon edges are preserved during the alignment. The full-length full-length sequence of the gene was selected as the reference sequence, and the variation of all coding region sequences and full-length sequences of the gene relative to the reference sequence was recorded, and the variation was associated with the type.
3.使用一个YH cell line样品(炎黄计划中的细胞样品)的MHC区域捕获高通量测序数据。3. Capture high-throughput sequencing data using the MHC region of a YH cell line sample (cell sample in the Yanhuang plan).
4.比对测序数据至参考序列,得到比对上的读段的位置信息和变异信息,并转化为相对于基准序列的位置信息和变异信息。4. Aligning the sequenced data to the reference sequence, obtaining positional information and variation information of the read segment on the alignment, and converting the position information and the variation information with respect to the reference sequence.
5.利用比对上的读段的位置信息和变异信息,对每个外显子,利用其中具有重叠部分并且重叠部分序列完全一致的读段进行组装,获得多个单体型。过滤对外显子覆盖度低于95%的单体型。获得单体型的支持读段数、深度和覆盖度,计算单体型的分值Score。Score的计算公式:5. Using the position information and the variation information of the read on the alignment, each exon is assembled using a read having an overlapping portion and the overlapping portion sequences are completely identical, and a plurality of haplotypes are obtained. Filter haplotypes with an exon coverage below 95%. The number of reads, depth and coverage of the haplotype is obtained, and the score of the haplotype Score is calculated. Score calculation formula:
Figure PCTCN2016074027-appb-000005
Figure PCTCN2016074027-appb-000005
c:单体型对外显子的覆盖度;c: coverage of the haplotype exon;
N:支持此单体型的读段数;N: the number of reads supporting this haplotype;
R:单体型的可靠性,其计算公式如下:R: The reliability of the haplotype, which is calculated as follows:
Figure PCTCN2016074027-appb-000006
Figure PCTCN2016074027-appb-000006
Xi:单体型每个位置的深度;Xi: the depth of each position of the monomer type;
X:单体型的平均深度; X: average depth of the haplotype;
L:单体型的序列长度;L: the sequence length of the haplotype;
按Score从高到底取出单体型,每取出一个单体型就去除与之序列冲突且冲突位点深度低于取出的单体型的20%的其他单体型。最多取出4个单体型,作为候选单体型。比较候选单体型的变异和每个型别的变异,完全匹配的型别为此单体型支持的型别,一个单体型可能支持多个型别。The haplotype is taken from the high bottom according to Score, and each haplotype is removed to remove other haplotypes which conflict with the sequence and have a collision site depth lower than 20% of the extracted haplotype. A maximum of 4 monomer types were taken out as candidate haplotypes. Compare variants of candidate haplotypes and variants of each type. The exact match type is the type supported by this haplotype, and one haplotype may support multiple types.
6.合并支持相同型别的单体型的相关信息。过滤支持的外显子数目小于外显子总数30%的型别。计算型别的分值TScore。TScore的计算公式:6. Consolidate information about haplotypes that support the same type. The number of exons supported by the filter is less than 30% of the total number of exons. Calculate the type of score TScore. TScore's formula:
TScore=N×STScore=N×S
N:支持此型别的读段数;N: support the number of reads of this type;
S:支持此型别的单体型的分值之和;S: the sum of the scores of the haplotypes supporting this type;
对所有型别做两两组合,选取分值之和最高的组合作为候选型别。根据候选型别中分值较高的型别得到对应的第一组候选读段,分值较低的型别得到第二组候选读段。将所有型别与此两组候选读段比较,若支持某型别的读段落在第一组候选读段中的数目更多,则将此型别归为第一组,否则归为第二组,这样将所有型别基于候选也分为了两组。计算型别的分值TScore_New。TScore_New的计算公式:Make a pairwise combination of all types, and select the combination with the highest sum of scores as the candidate type. The corresponding first set of candidate reads is obtained according to the higher score of the candidate type, and the second set of candidate reads is obtained by the lower score. Compare all types with the two sets of candidate reads. If the number of read paragraphs supporting a type is more in the first set of candidate reads, then this type is classified as the first group, otherwise it is classified as the second. Group, so that all types are also divided into two groups based on candidates. Calculate the type score TScore_New. TScore_New calculation formula:
TScore_New=N*×TScoreTScore_New=N * ×TScore
N*:支持此型别的读段落在另一组候选读段之外的数目,N * : the number of readings of this type that are supported by another set of candidate reads,
两组中TScore_New分别最高的两个型别作为最佳型别。第一组型别中的最佳型别作为主要型别,第二组中的最佳型别作为次要型别。若仅支持次要型别的读段数目与仅支持主要型别的读段数目的比值大于0.1,判断此基因为杂合,保留主要和次要型别,否则判断为纯合,仅保留主要型别。The two highest types of TScore_New in the two groups are the best types. The best type in the first group is the main type, and the best type in the second group is the secondary type. If the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the main type is greater than 0.1, it is judged that the gene is heterozygous, and the primary and secondary types are retained, otherwise it is judged to be homozygous, and only the primary type is retained. do not.
7.以拷贝数固定为2的区域的平均测序深度作为基准深度,计算HLA-DRB3,4,5的深度与基准深度的比例,若比例接近0,则拷贝数为0,若比例接近0.5,则拷贝数为1,若比例接近1,则拷贝数为2。此样品HLA-DRB3拷贝数为1,HLA-DRB4拷贝数为0,HLA-DRB5拷贝数为1。7. Calculate the ratio of the depth of HLA-DRB3, 4, 5 to the reference depth by using the average sequencing depth of the region where the copy number is fixed as 2, and if the ratio is close to 0, the copy number is 0, if the ratio is close to 0.5, Then the copy number is 1, and if the ratio is close to 1, the copy number is 2. The HLA-DRB3 copy number of this sample was 1, the HLA-DRB4 copy number was 0, and the HLA-DRB5 copy number was 1.
8.结合保留的型别和拷贝数,得到YH cell line样品HLA-DRB3,4,5分型的最终结果如下表1所示,其中“Blank”表示等位基因缺失,拷贝数为0,则表示为Blank/Blank,拷贝数为1,则表示为DRB/Blank,拷贝数为2,则表示为DRB/DRB。8. Combine the retained type and copy number to obtain the HH-DRB3 sample of YH cell line. The final result of the 4,5 type is shown in Table 1 below, where "Blank" indicates that the allele is missing and the copy number is 0. It is expressed as Blank/Blank. If the copy number is 1, it is expressed as DRB/Blank. If the copy number is 2, it is expressed as DRB/DRB.
表1Table 1
HLA-DRB3HLA-DRB3 HLA-DRB4HLA-DRB4 HLA-DRB5HLA-DRB5
DRB3*02:02:01/BlankDRB3*02:02:01/Blank Blank/BlankBlank/Blank DRB5*01:01:01/BlankDRB5*01:01:01/Blank
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、“实施方式”或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of the present specification, a description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", "embodiment" or "some examples" and the like means that the embodiment or example is incorporated. The specific features, structures, materials, or characteristics described are included in at least one embodiment or example of the invention. In the present specification, the schematic representation of the above terms does not necessarily mean the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。 While the embodiments of the present invention have been shown and described, the embodiments of the invention may The scope of the invention is defined by the claims and their equivalents.

Claims (44)

  1. 一种分型方法,其特征在于,包括:A typing method, comprising:
    获取待测样品的测序数据,所述测序数据包括多个来自目标区域的读段;Obtaining sequencing data of the sample to be tested, the sequencing data comprising a plurality of reads from the target area;
    将参考序列组比对至基准序列组,获得型别集合,所述型别集合包含多个型别,Comparing the reference sequence groups to the reference sequence group to obtain a type set, the type set comprising a plurality of types,
    所述基准序列组包含一条或多条基准序列,所述基准序列组能够完全覆盖所述目标区域的所有基因序列,不同所述基准序列包含不同的基因全序列,The reference sequence group includes one or more reference sequences, and the reference sequence group can completely cover all gene sequences of the target region, and the reference sequence includes different full sequence of genes.
    所述参考序列组包括多条参考序列,所述参考序列组能够完全覆盖所述目标区域的编码区序列的所有外显子,不同所述参考序列包含不同的外显子;The reference sequence set includes a plurality of reference sequences, the reference sequence set capable of completely covering all exons of the coding region sequence of the target region, different reference sequences comprising different exons;
    将所述测序数据比对至所述参考序列组,获得比对结果;Comparing the sequencing data to the reference sequence set to obtain a comparison result;
    将所述比对结果转化为相对于所述基准序列组的比对结果,获得转化的比对结果;Converting the alignment result into a comparison result with respect to the reference sequence group, obtaining a converted alignment result;
    基于所述转化的比对结果,分别对比对上相同基准序列的读段进行组装,获得组装结果,所述组装结果包括多个单体型;And assembling, according to the comparison result of the transformation, the readings of the same reference sequence, respectively, to obtain an assembly result, the assembly result comprising a plurality of haplotypes;
    比较所述单体型上的变异和所述型别集合中的型别,以确定所述单体型支持的型别;Comparing the variation in the haplotype to the type in the set of types to determine the type supported by the haplotype;
    依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组;According to the haplotype supporting the type and the read segment supporting the type, the types are divided into two groups, and the first candidate type group and the second candidate type group are obtained;
    基于支持所述型别的单体型和支持所述型别的读段,分别对所述第一候选型别组和所述第二候选型别组中的型别进行筛选,以确定主要型别和次要型别;Screening the types in the first candidate type group and the second candidate type group to determine the main type based on the haplotype supporting the type and the read type supporting the type Don't and secondary types;
    基于支持所述主要型别和支持所述次要型别的读段的数目的差异,判定所述目标区域的基因型。The genotype of the target region is determined based on the difference in the number of reads supporting the primary type and supporting the secondary type.
  2. 权利要求1的方法,其特征在于,所述目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5中的至少之一。The method of claim 1 wherein said target region comprises at least one of HLA-DRB3, HLA-DRB4, and HLA-DRB5.
  3. 权利要求1的方法,其特征在于,所述目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5。The method of claim 1 wherein said target regions comprise HLA-DRB3, HLA-DRB4 and HLA-DRB5.
  4. 权利要求1的方法,其特征在于,所述参考序列组通过以下步骤构建获得:The method of claim 1 wherein said set of reference sequences is obtained by the following steps:
    获得包含所述目标区域的编码区序列和基因全序列;Obtaining a coding region sequence and a full gene sequence comprising the target region;
    将所述编码区序列按外显子分割,获得多个外显子序列;Dividing the coding region sequence into exons to obtain a plurality of exon sequences;
    从与所述外显子序列最接近型别的基因全序列提取该外显子序列两侧各K bp的序列,加到对应的外显子序列的两侧,获得所述参考序列组中的参考序列,其中K依据读段的长度来确定。 Extracting a sequence of K bp flanking the exon sequence from the entire sequence of the closest type of the exon sequence, and adding to both sides of the corresponding exon sequence to obtain the reference sequence set Reference sequence, where K is determined by the length of the read.
  5. 权利要求1的方法,其特征在于,一条所述基准序列为所述目标区域中的一个基因的基因全序列。The method of claim 1, wherein one of said reference sequences is a full sequence of genes of a gene in said target region.
  6. 权利要求1的方法,其特征在于,所述基于转化的比对结果,分别对比对上相同基准序列的读段进行组装,获得组装结果,所述组装结果包括多个单体型,包括:The method of claim 1 wherein said transform-based alignment results are compared to alignment of reads of the same reference sequence, respectively, to obtain an assembly result, said assembly result comprising a plurality of haplotypes, including:
    对比对上相同基准序列的读段,利用具有重叠部分并且所述重叠部分完全一致的读段进行所述组装,以获得所述多个单体型。The assembly is performed with respect to a read on the same reference sequence, using a read having overlapping portions and the overlapping portions are completely identical to obtain the plurality of haplotypes.
  7. 权利要求1的方法,其特征在于,在获得所述组装结果之后,过滤掉对所述外显子的覆盖度低于95%的单体型。The method of claim 1 wherein the haplotype having a coverage of said exon of less than 95% is filtered out after said assembly result is obtained.
  8. 权利要求1的方法,其特征在于,在确定单体型支持的型别之前进行以下:The method of claim 1 wherein the following is performed prior to determining the type of haplotype support:
    分别基于比对上相同基准序列的读段的组装结果,对每个所述单体型进行打分,基于获得的单体型的分值对所述单体型进行筛选,获得候选单体型;Each of the haplotypes is scored based on the assembly result of the reads of the same reference sequence on the alignment, and the haplotypes are screened based on the scores of the obtained haplotypes to obtain candidate haplotypes;
    以所述候选单体型替代所述单体型。The haplotype is replaced with the candidate haplotype.
  9. 权利要求8的方法,其特征在于,利用以下公式对每个所述单体型进行打分,以确定所述单体型的分值Score:The method of claim 8 wherein each of said haplotypes is scored using the following formula to determine a score of said haplotype:
    Figure PCTCN2016074027-appb-100001
    其中,
    Figure PCTCN2016074027-appb-100001
    among them,
    c为所述单体型对所述外显子的覆盖度,c is the coverage of the exon by the haplotype,
    N为支持所述单体型的读段数目,N is the number of reads supporting the haplotype,
    R表示所述单体型的可靠性,
    Figure PCTCN2016074027-appb-100002
    R represents the reliability of the haplotype,
    Figure PCTCN2016074027-appb-100002
    Xi为所述单体型的位置i的测序深度,i为所述单体型上的位置的编号,Xi is the sequencing depth of the position i of the haplotype, and i is the number of the position on the haplotype,
    X为所述单体型的平均深度,X is the average depth of the haplotype,
    L为所述单体型的长度。L is the length of the monomer type.
  10. 权利要求8的方法,其特征在于,基于获得的单体型的分值对所述单体型进行筛选,包括:The method of claim 8 wherein the screening of said haplotypes based on the scores of the obtained haplotypes comprises:
    对于比对上相同基准序列的读段的组装结果,取出其中的最高的单体型的分值所对应的单体型后,去除满足以下条件的所述组装结果中的单体型:与取出的所述单体型的序列不一致并且不一致位点的测序深度低于所述单体型相应位点的测序深度的20%;For the assembly result of the read of the same reference sequence on the alignment, after taking out the haplotype corresponding to the highest haplotype score, the haplotype in the assembly result satisfying the following conditions is removed: The sequence of the haplotype is inconsistent and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype;
    重复以上至最多取出4个所述单体型,获得所述候选单体型。The above haplotypes were obtained by repeating the above to a maximum of 4 of the haplotypes.
  11. 权利要求8的方法,其特征在于,所述比较每个单体型上的变异和所述型别集合 中的型别,以确定所述单体型支持的型别,包括:The method of claim 8 wherein said comparing the variation on each haplotype to said set of types The type in which the haplotype is supported, including:
    比较每个候选单体型上的变异和所述型别集合中的型别,若完全匹配,则确定所述候选单体型支持所述型别。The variation in each candidate haplotype and the type in the set of types are compared, and if they match exactly, it is determined that the candidate haplotype supports the type.
  12. 权利要求11的方法,其特征在于,确定所述候选单体型支持的型别之后,合并支持相同型别的所述候选单体型。The method of claim 11 wherein said candidate haplotypes of the same type are combined after determining the type supported by said candidate haplotype.
  13. 权利要求8的方法,其特征在于,在进行依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组之前,The method of claim 8 wherein said type is divided into two groups based on a haplotype supporting said type and a read support said type, obtaining a first candidate type group and Before the two candidate type groups,
    过滤掉支持的外显子数目小于外显子总数30%的型别。Filter out the types of supported exons that are less than 30% of the total number of exons.
  14. 权利要求8的方法,其特征在于,所述依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组,包括:The method of claim 8 wherein said type is divided into two groups based on the haplotypes supporting said type and said reads supporting said type, obtaining a first candidate type group and Two candidate type groups, including:
    依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别;According to the haplotype supporting the type and the reading supporting the type, the first score is performed on the type, and the type is selected based on the first score of the obtained type to obtain the first a candidate type and a second candidate type;
    基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述第二候选型别,以获得所述第一候选型别组和所述第二候选型别组。Generating other types in the type set based on the support of the read support of the first candidate type and the support of the second candidate type for other types in the type set And returning to the first candidate type and the second candidate type respectively to obtain the first candidate type group and the second candidate type group.
  15. 权利要求14的方法,其特征在于,所述依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别,包括:The method of claim 14 wherein said first rating is performed on said type based on said haplotype supporting said type and said reading supporting said type, based on the first obtained type The scores are used to filter the types to obtain the first candidate type and the second candidate type, including:
    利用以下公式确定所述型别的第一分值TScore,Determining the first score TScore of the type using the following formula,
    TScore=N×S,其中,TScore=N×S, where
    N为支持所述型别的读段数目,N is the number of reads that support the type,
    S为支持所述型别的候选单体型的分值的总和;S is the sum of the scores supporting the candidate haplotypes of the type;
    将所有所述型别的第一分值作两两组合,确定所述第一分值之和最高的组合中的型别分别为所述第一候选型别和所述第二候选型别。The first scores of all the types are combined into two or two, and the types in the combination in which the sum of the first scores is the highest are determined as the first candidate type and the second candidate type, respectively.
  16. 权利要求14的方法,其特征在于,所述基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述第二候选型别,获得所述第一候选型别组和所述第二候选型别组,包括:The method of claim 14 wherein said support for said other types in said type set is based on a read segment supporting said first candidate type and a read support said second candidate type And assigning the other types in the type set to the first candidate type and the second candidate type respectively, and obtaining the first candidate type group and the second candidate type group, include:
    对于每个所述其它型别,比较第一交集和第二交集的大小,依据比较的结果将每个所述其它型别归至所述第一候选型别或者所述第二候选型别,以获得所述第一候选型别组 和所述第二候选型别组,For each of the other types, comparing the sizes of the first intersection and the second intersection, and assigning each of the other types to the first candidate type or the second candidate type according to the result of the comparison, Obtaining the first candidate type group And the second candidate type group,
    所述第一交集为支持该其它型别的读段与所述支持第一候选型别的读段的交集,The first intersection is an intersection of a read segment supporting the other type and a read segment supporting the first candidate type.
    所述第二交集为支持该其它型别的读段与所述支持第二候选型别的读段的交集。The second intersection is an intersection of a read that supports the other type and a read that supports the second candidate type.
  17. 权利要求14的方法,其特征在于,所述基于支持所述型别的单体型和支持所述型别的读段,分别对所述第一候选型别组和所述第二候选型别组中的型别进行筛选,以确定主要型别和次要型别,包括:The method of claim 14 wherein said first candidate type group and said second candidate type are respectively based on said haplotype supporting said type and said read supporting said type The types in the group are screened to identify major and minor types, including:
    基于支持所述型别的读段和所述型别的第一分值,分别对所述第一候选型别组和所述第二候选型别组中的型别进行第二打分,基于获得的型别的第二分值确定出所述主要型别和所述次要型别。Performing a second scoring on the types in the first candidate type group and the second candidate type group based on the first score that supports the type of read and the type, respectively, based on obtaining The second score of the type identifies the primary type and the secondary type.
  18. 权利要求17的方法,其特征在于,包括:The method of claim 17 including:
    利用以下公式确定所述型别的第二分值TScore_New,Determining the second score TScore_New of the type using the following formula,
    TScore_New=N*×TScore,其中,TScore_New=N * ×TScore, where
    N*为所述第一候选型别组中的型别的支持读段落在所述支持第二候选型别的读段之外的数目,或者为所述第二候选型别组中的型别的支持读段落在所述支持第一候选型别的读段之外的数目;N * is the number of support read paragraphs of the type in the first candidate type group other than the read support segment of the second candidate type, or the type in the second candidate type group Supporting the reading of the paragraphs in addition to the number of reads supporting the first candidate type;
    确定所述第一候选型别组和所述第二候选型别组中的第二分值最高的两个型别为所述主要型别和所述次要型别。Determining two types of the second candidate with the highest second score in the first candidate type group and the second candidate type group are the primary type and the secondary type.
  19. 权利要求1的方法,其特征在于,所述基于支持所述主要型别和支持所述次要型别的读段的数目的差异,判定所述目标区域的基因型,包括:The method of claim 1 wherein said determining a genotype of said target region based on a difference in the number of reads supporting said primary type and supporting said secondary type comprises:
    若仅支持所述次要型别的读段数目与仅支持所述主要型别的读段数目的比值大于0.1,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别,If the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the primary type is greater than 0.1, determining that the target region is heterozygous, constitutes the major allele of the heterozygote and The minor alleles are the primary type and the secondary type, respectively.
    否则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别。Otherwise, it is determined that the target region is homozygous, and both alleles constituting the homozygote are the main type.
  20. 权利要求1-19任一方法,其特征在于,还包括:The method of any of claims 1-19, further comprising:
    以至少一个拷贝数固定为2的区域的平均测序深度为基准深度,计算所述目标区域的测序深度与所述基准深度的比例,依据计算得的比例确定所述目标区域的拷贝数;Calculating a ratio of a sequencing depth of the target region to the reference depth by using an average sequencing depth of a region fixed at at least one copy number to 2 as a reference depth, and determining a copy number of the target region according to the calculated ratio;
    依据所述目标区域的拷贝数,判定所述目标区域的基因型。The genotype of the target region is determined based on the copy number of the target region.
  21. 权利要求20的方法,其特征在于,所述依据目标区域的拷贝数,判定所述目标区域的基因型,包括:The method of claim 20, wherein said determining a genotype of said target region based on a copy number of the target region comprises:
    若所述目标区域的拷贝数为0,则判定所述目标区域不存在; If the copy number of the target area is 0, it is determined that the target area does not exist;
    若所述目标区域的拷贝数为1,则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别;If the copy number of the target region is 1, it is determined that the target region is homozygous, and both alleles constituting the homozygote are the main type;
    若所述目标区域的拷贝数为2,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别。If the copy number of the target region is 2, it is determined that the target region is heterozygous, and the major allele and the minor allele constituting the heterozygote are the primary type and the secondary type, respectively. do not.
  22. 一种计算机可读介质,其特征在于,用于存储计算机可执行程序,执行所述程序包括完成权利要求1-21任一方法。A computer readable medium, for storing a computer executable program, the executing the program comprising performing the method of any of claims 1-21.
  23. 一种分型装置,其特征在于,包括:A parting device, comprising:
    数据输入单元,用于输入数据;a data input unit for inputting data;
    数据输出单元,用于输出数据;a data output unit for outputting data;
    存储单元,用于存储数据,其中包括计算机可执行程序;a storage unit for storing data, including a computer executable program;
    处理器,与所述数据输入单元、所述数据输出单元和所述存储单元连接,用于执行所述计算机可执行程序,执行所述程序包括完成权利要求1-21任一方法。a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising performing the method of any of claims 1-21.
  24. 一种分型***,其特征在于,包括:A parting system, comprising:
    数据输入模块,用于输入待测样品的测序数据,所述测序数据包括多个来自目标区域的读段;a data input module, configured to input sequencing data of the sample to be tested, the sequencing data comprising a plurality of readings from the target area;
    比对模块,包括第一比对模块和第二比对模块,The comparison module includes a first comparison module and a second comparison module,
    所述第一比对模块用于将参考序列组比对至基准序列组,获得型别集合,所述型别集合包含多个型别,The first comparison module is configured to compare a reference sequence group to a reference sequence group to obtain a type set, where the type set includes multiple types,
    第二比对模块,用于将来自数据输入模块的测序数据比对至所述参考序列组,获得比对结果,a second comparison module, configured to compare the sequencing data from the data input module to the reference sequence group to obtain a comparison result,
    所述基准序列组包含一条或多条基准序列,所述基准序列组能够完全覆盖所述目标区域的所有基因序列,不同所述基准序列包含不同的基因全序列,The reference sequence group includes one or more reference sequences, and the reference sequence group can completely cover all gene sequences of the target region, and the reference sequence includes different full sequence of genes.
    所述参考序列组包括多条参考序列,所述参考序列组能够完全覆盖所述目标区域的编码区序列的所有外显子,不同所述参考序列包含不同的外显子;The reference sequence set includes a plurality of reference sequences, the reference sequence set capable of completely covering all exons of the coding region sequence of the target region, different reference sequences comprising different exons;
    转化模块,用于将所述比对结果转化为相对于所述基准序列组的比对结果,获得转化的比对结果;a conversion module, configured to convert the alignment result into a comparison result with respect to the reference sequence group, to obtain a converted alignment result;
    组装模块,用于基于所述转化的比对结果,分别对比对上相同基准序列的读段进行组装,获得组装结果,所述组装结果包括多个单体型;And an assembly module, configured to assemble a read segment of the same reference sequence based on the comparison result of the conversion, to obtain an assembly result, where the assembly result includes a plurality of haplotypes;
    单体型支持型别确定模块,用于比较所述单体型上的变异和所述型别集合中的型别,以确定所述单体型支持的型别; a haplotype support type determining module for comparing a variation in the haplotype and a type in the genre set to determine a type supported by the haplotype;
    聚类模块,用于依据支持所述型别的单体型和支持所述型别的读段,将所述型别分成两组,获得第一候选型别组和第二候选型别组;a clustering module, configured to divide the types into two groups according to a haplotype supporting the type and a read segment supporting the type, and obtain a first candidate type group and a second candidate type group;
    主/次要型别确定模块,用于基于支持所述型别的单体型和支持所述型别的读段,分别对所述第一候选型别组和所述第二候选型别组中的型别进行筛选,以确定出主要型别和次要型别;a primary/secondary type determining module, configured to respectively support the first candidate type group and the second candidate type group based on a haplotype supporting the type and a read type supporting the type The types in the screening are selected to determine the major and minor types;
    基因型确定模块,用于基于支持所述主要型别和支持所述次要型别的读段的数目的差异,判定所述目标区域的基因型。A genotype determination module for determining a genotype of the target region based on a difference in the number of reads supporting the primary type and supporting the secondary type.
  25. 权利要求24的***,其特征在于,所述目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5中的至少之一。The system of claim 24 wherein said target region comprises at least one of HLA-DRB3, HLA-DRB4, and HLA-DRB5.
  26. 权利要求24的***,其特征在于,所述目标区域包括HLA-DRB3、HLA-DRB4和HLA-DRB5。The system of claim 24 wherein said target regions comprise HLA-DRB3, HLA-DRB4 and HLA-DRB5.
  27. 权利要求24的***,其特征在于,比对模块中的参考序列组通过以下步骤构建获得:The system of claim 24 wherein the set of reference sequences in the alignment module is constructed by the following steps:
    获得包含所述目标区域的编码区序列和基因全序列;Obtaining a coding region sequence and a full gene sequence comprising the target region;
    将所述编码区序列按外显子分割,获得多个外显子序列;Dividing the coding region sequence into exons to obtain a plurality of exon sequences;
    从与所述外显子序列最接近型别的基因全序列提取该外显子序列两侧各K bp的序列,加到对应的外显子序列的两侧,获得所述参考序列组中的参考序列,其中K依据读段的长度来确定。Extracting a sequence of K bp flanking the exon sequence from the entire sequence of the closest type of the exon sequence, and adding to both sides of the corresponding exon sequence to obtain the reference sequence set Reference sequence, where K is determined by the length of the read.
  28. 权利要求24的***,其特征在于,比对模块和/或转化模块中的一条所述基准序列为所述目标区域中的一个基因的基因全序列。The system of claim 24 wherein said reference sequence in said alignment module and/or transformation module is a full sequence of genes for a gene in said target region.
  29. 权利要求24的***,其特征在于,利用所述组装模块进行以下:The system of claim 24 wherein said assembly module is utilized to:
    对比对上相同基准序列的读段,利用具有重叠部分并且所述重叠部分完全一致的读段进行所述组装,以获得所述多个单体型。The assembly is performed with respect to a read on the same reference sequence, using a read having overlapping portions and the overlapping portions are completely identical to obtain the plurality of haplotypes.
  30. 权利要求24的***,其特征在于,在所述组装模块中,在获得所述组装结果之后,过滤掉对所述外显子的覆盖度低于95%的单体型。The system of claim 24, wherein in said assembly module, after obtaining said assembly result, haplotypes having a coverage of said exon of less than 95% are filtered out.
  31. 权利要求24的***,其特征在于,还包括候选单体型确定模块,用于在利用所述单体型支持型别确定模块确定单体型支持的型别之前,进行以下:The system of claim 24, further comprising a candidate haplotype determining module for performing the following before determining the haplotype supported type using the haplotype support type determining module:
    分别基于比对上相同基准序列的读段的组装结果,对每个所述单体型进行打分,基于获得的单体型的分值对所述单体型进行筛选,获得候选单体型;Each of the haplotypes is scored based on the assembly result of the reads of the same reference sequence on the alignment, and the haplotypes are screened based on the scores of the obtained haplotypes to obtain candidate haplotypes;
    以所述候选单体型替代所述单体型。 The haplotype is replaced with the candidate haplotype.
  32. 权利要求31的***,其特征在于,在所述候选单体型确定模块中,利用以下公式对每个所述单体型进行打分,以确定所述单体型的分值Score:The system of claim 31, wherein in said candidate haplotype determining module, each of said haplotypes is scored using the following formula to determine a score of said haplotype:
    Figure PCTCN2016074027-appb-100003
    其中,
    Figure PCTCN2016074027-appb-100003
    among them,
    c为所述单体型对所述外显子的覆盖度,c is the coverage of the exon by the haplotype,
    N为支持所述单体型的读段数目,N is the number of reads supporting the haplotype,
    R表示所述单体型的可靠性,
    Figure PCTCN2016074027-appb-100004
    R represents the reliability of the haplotype,
    Figure PCTCN2016074027-appb-100004
    Xi为所述单体型的位置i的测序深度,i为所述单体型上的位置的编号,Xi is the sequencing depth of the position i of the haplotype, and i is the number of the position on the haplotype,
    X为所述单体型的平均深度,X is the average depth of the haplotype,
    L为所述单体型的长度。L is the length of the monomer type.
  33. 权利要求31的***,其特征在于,在所述候选单体型确定模块中进行基于获得的单体型的分值对所述单体型进行筛选,包括:The system of claim 31, wherein the screening of the haplotypes based on the scores of the obtained haplotypes is performed in the candidate haplotype determining module, comprising:
    对于比对上相同基准序列的读段的组装结果,取出其中的最高的单体型的分值所对应的单体型后,去除满足以下条件的所述组装结果中的单体型:与取出的所述单体型的序列不一致并且不一致位点的测序深度低于所述单体型相应位点的测序深度的20%;For the assembly result of the read of the same reference sequence on the alignment, after taking out the haplotype corresponding to the highest haplotype score, the haplotype in the assembly result satisfying the following conditions is removed: The sequence of the haplotype is inconsistent and the sequencing depth of the inconsistent site is less than 20% of the sequencing depth of the corresponding site of the haplotype;
    重复以上至最多取出4个所述单体型,获得所述候选单体型。The above haplotypes were obtained by repeating the above to a maximum of 4 of the haplotypes.
  34. 权利要求31的***,其特征在于,在所述单体型支持型别确定模块中进行以下:The system of claim 31 wherein said haplotype support type determination module performs the following:
    比较每个候选单体型上的变异和所述型别集合中的型别,若完全匹配,则确定所述候选单体型支持所述型别。The variation in each candidate haplotype and the type in the set of types are compared, and if they match exactly, it is determined that the candidate haplotype supports the type.
  35. 权利要求34的***,其特征在于,在所述单体型支持型别确定模块中,确定所述候选单体型支持的型别之后,合并支持相同型别的所述候选单体型。The system of claim 34, wherein in said haplotype support type determining module, said candidate haplotype supported by said same haplotype is combined to support said candidate haplotype of the same type.
  36. 权利要求31的***,其特征在于,还包括型别过滤模块,用于在利用所述聚类模块获得第一候选型别组和第二候选型别组之前,The system of claim 31, further comprising a type filtering module for obtaining the first candidate type group and the second candidate type group using the clustering module,
    过滤掉支持的外显子数目小于外显子总数30%的型别。Filter out the types of supported exons that are less than 30% of the total number of exons.
  37. 权利要求31的***,其特征在于,所述聚类模块用于进行以下:The system of claim 31 wherein said clustering module is operative to:
    依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别;According to the haplotype supporting the type and the reading supporting the type, the first score is performed on the type, and the type is selected based on the first score of the obtained type to obtain the first a candidate type and a second candidate type;
    基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述 第二候选型别,以获得所述第一候选型别组和所述第二候选型别组。Generating other types in the type set based on the support of the read support of the first candidate type and the support of the second candidate type for other types in the type set Returning to the first candidate type and the respectively a second candidate type to obtain the first candidate type group and the second candidate type group.
  38. 权利要求37的***,其特征在于,在所述聚类模块中进行依据支持所述型别的单体型和支持所述型别的读段,对所述型别进行第一打分,基于获得的型别的第一分值对所述型别进行筛选,获得第一候选型别和第二候选型别,包括:The system of claim 37, wherein said first clustering of said type is performed based on obtaining a haplotype supporting said type and supporting said type in said clustering module, based on obtaining The first score of the type of the type filters the type to obtain the first candidate type and the second candidate type, including:
    利用以下公式确定所述型别的第一分值TScore,Determining the first score TScore of the type using the following formula,
    TScore=N×S,其中,TScore=N×S, where
    N为支持所述型别的读段数目,N is the number of reads that support the type,
    S为支持所述型别的候选单体型的分值的总和;S is the sum of the scores supporting the candidate haplotypes of the type;
    将所有所述型别的第一分值作两两组合,确定所述第一分值之和最高的组合中的型别分别为所述第一候选型别和所述第二候选型别。The first scores of all the types are combined into two or two, and the types in the combination in which the sum of the first scores is the highest are determined as the first candidate type and the second candidate type, respectively.
  39. 权利要求37的***,其特征在于,在所述聚类模块中进行基于支持所述第一候选型别的读段和支持所述第二候选型别的读段对所述型别集合中的其它型别的支持情况,将所述型别集合中的其它型别分别归至所述第一候选型别和所述第二候选型别,获得所述第一候选型别组和所述第二候选型别组,包括:The system of claim 37, wherein said reading in said clustering module is based on a read that supports said first candidate type and a read that supports said second candidate type in said set of types For other types of support cases, the other types in the type set are respectively assigned to the first candidate type and the second candidate type, and the first candidate type group and the first Two candidate type groups, including:
    对于每个所述其它型别,比较第一交集和第二交集的大小,依据比较的结果将每个所述其它型别归至所述第一候选型别或者所述第二候选型别,以获得所述第一候选型别组和所述第二候选型别组,For each of the other types, comparing the sizes of the first intersection and the second intersection, and assigning each of the other types to the first candidate type or the second candidate type according to the result of the comparison, Obtaining the first candidate type group and the second candidate type group,
    所述第一交集为支持该其它型别的读段与所述支持第一候选型别的读段的交集,The first intersection is an intersection of a read segment supporting the other type and a read segment supporting the first candidate type.
    所述第二交集为支持该其它型别的读段与所述支持第二候选型别的读段的交集。The second intersection is an intersection of a read that supports the other type and a read that supports the second candidate type.
  40. 权利要求37的***,其特征在于,所述主/次要型别确定模块用于进行以下:The system of claim 37 wherein said primary/secondary type determining module is operative to:
    基于支持所述型别的读段和所述型别的第一分值,分别对所述第一候选型别组和所述第二候选型别组中的型别进行第二打分,基于获得的型别的第二分值确定出所述主要型别和所述次要型别。Performing a second scoring on the types in the first candidate type group and the second candidate type group based on the first score that supports the type of read and the type, respectively, based on obtaining The second score of the type identifies the primary type and the secondary type.
  41. 权利要求40的***,其特征在于,在所述主/次要型别确定模块中进行以下:The system of claim 40 wherein the following is performed in said primary/secondary type determination module:
    利用以下公式确定所述型别的第二分值TScore_New,Determining the second score TScore_New of the type using the following formula,
    TScore_New=N*×TScore,其中,TScore_New=N * ×TScore, where
    N*为所述第一候选型别组中的型别的支持读段落在所述支持第二候选型别的读段之外的数目,或者为所述第二候选型别组中的型别的支持读段落在所述支持第一候选型别的读段之外的数目;N * is the number of support read paragraphs of the type in the first candidate type group other than the read support segment of the second candidate type, or the type in the second candidate type group Supporting the reading of the paragraphs in addition to the number of reads supporting the first candidate type;
    确定所述第一候选型别组和所述第二候选型别组中的第二分值最高的两个型别为所述 主要型别和所述次要型别。Determining that the two types with the second highest score in the first candidate type group and the second candidate type group are the The main type and the secondary type.
  42. 权利要求24的***,其特征在于,所述基因型确定模块用于进行以下:The system of claim 24 wherein said genotyping module is operative to:
    若仅支持所述次要型别的读段数目与仅支持所述主要型别的读段数目的比值大于0.1,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别,If the ratio of the number of reads supporting only the secondary type to the number of reads supporting only the primary type is greater than 0.1, determining that the target region is heterozygous, constitutes the major allele of the heterozygote and The minor alleles are the primary type and the secondary type, respectively.
    否则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别。Otherwise, it is determined that the target region is homozygous, and both alleles constituting the homozygote are the main type.
  43. 权利要求24-42任一***,其特征在于,还包括拷贝数确定模块,用于进行以下:A system according to any of claims 24-42, further comprising a copy number determining module for performing the following:
    以至少一个拷贝数固定为2的区域的平均测序深度为基准深度,计算所述目标区域的测序深度与所述基准深度的比例,依据计算得的比例确定所述目标区域的拷贝数;Calculating a ratio of a sequencing depth of the target region to the reference depth by using an average sequencing depth of a region fixed at at least one copy number to 2 as a reference depth, and determining a copy number of the target region according to the calculated ratio;
    依据所述目标区域的拷贝数,判定所述目标区域的基因型。The genotype of the target region is determined based on the copy number of the target region.
  44. 权利要求43的***,其特征在于,利用所述拷贝数确定模块进行依据目标区域的拷贝数,判定所述目标区域的基因型,包括:The system of claim 43 wherein said copy number determination module is operative to determine a genotype of said target region based on a copy number of the target region, comprising:
    若所述目标区域的拷贝数为0,则判定所述目标区域不存在;If the copy number of the target area is 0, it is determined that the target area does not exist;
    若所述目标区域的拷贝数为1,则判定所述目标区域为纯合子,组成所述纯合子的两个等位基因均为所述主要型别;If the copy number of the target region is 1, it is determined that the target region is homozygous, and both alleles constituting the homozygote are the main type;
    若所述目标区域的拷贝数为2,则判定所述目标区域为杂合子,组成所述杂合子的主要等位基因和次要等位基因分别为所述主要型别和所述次要型别。 If the copy number of the target region is 2, it is determined that the target region is heterozygous, and the major allele and the minor allele constituting the heterozygote are the primary type and the secondary type, respectively. do not.
PCT/CN2016/074027 2016-02-18 2016-02-18 Typing method and device WO2017139945A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/074027 WO2017139945A1 (en) 2016-02-18 2016-02-18 Typing method and device
CN201680067128.7A CN108350498B (en) 2016-02-18 2016-02-18 Parting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/074027 WO2017139945A1 (en) 2016-02-18 2016-02-18 Typing method and device

Publications (1)

Publication Number Publication Date
WO2017139945A1 true WO2017139945A1 (en) 2017-08-24

Family

ID=59624680

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/074027 WO2017139945A1 (en) 2016-02-18 2016-02-18 Typing method and device

Country Status (2)

Country Link
CN (1) CN108350498B (en)
WO (1) WO2017139945A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113373208A (en) * 2021-07-14 2021-09-10 上海序祯达生物科技有限公司 Human leukocyte antigen typing system and method based on next generation sequencing
CN117265091A (en) * 2023-10-31 2023-12-22 江苏伟禾生物科技有限公司 Primer group, kit and application for HLA-DRB3/4/5 genotyping

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112825267B (en) * 2019-11-21 2024-05-14 深圳华大基因科技服务有限公司 Method for determining a collection of small nucleic acid sequences and use thereof
CN111798924B (en) * 2020-07-07 2024-03-26 博奥生物集团有限公司 Human leukocyte antigen typing method and device
CN112634991B (en) * 2020-12-18 2022-07-19 长沙都正生物科技股份有限公司 Genotyping method, genotyping device, electronic device, and storage medium
CN114334006B (en) * 2021-12-29 2022-11-29 纳昂达(南京)生物科技有限公司 Method and device for introducing noise in enzyme digestion library building mode
CN117746980A (en) * 2023-12-18 2024-03-22 广州凯普医学检验所有限公司 Automatic rapid typing method, device, equipment and medium for influenza virus

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103074444A (en) * 2013-02-25 2013-05-01 苏州晶因生物科技有限公司 HLA (histocompatibility locus antigen) genetic typing method of HLA determinant gene through high-throughput sequencing

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103221551B (en) * 2010-11-23 2015-10-07 深圳华大基因股份有限公司 HLA gene type-SNP interlocking data storehouse, its construction process and HLA classifying method
CN102329876B (en) * 2011-10-14 2014-04-02 深圳华大基因科技有限公司 Method for measuring nucleotide sequence of disease associated nucleic acid molecules in sample to be detected
CN103198238B (en) * 2012-01-06 2017-04-05 深圳华大基因股份有限公司 Build method and its application of drug reaction related gene standard type data base
CN104834833B (en) * 2014-02-12 2017-12-05 深圳华大基因科技有限公司 The detection method and device of SNP

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103074444A (en) * 2013-02-25 2013-05-01 苏州晶因生物科技有限公司 HLA (histocompatibility locus antigen) genetic typing method of HLA determinant gene through high-throughput sequencing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, C.: "High-throughput, high-fidelity HLA genotyping with deep sequencing", PNAS, vol. 109, no. 22, 29 May 2012 (2012-05-29), pages 8676 - 8681, XP055184258, DOI: doi:10.1073/pnas.1206614109 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113373208A (en) * 2021-07-14 2021-09-10 上海序祯达生物科技有限公司 Human leukocyte antigen typing system and method based on next generation sequencing
CN117265091A (en) * 2023-10-31 2023-12-22 江苏伟禾生物科技有限公司 Primer group, kit and application for HLA-DRB3/4/5 genotyping

Also Published As

Publication number Publication date
CN108350498A (en) 2018-07-31
CN108350498B (en) 2021-10-19

Similar Documents

Publication Publication Date Title
WO2017139945A1 (en) Typing method and device
Carmi et al. Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins
KR102514024B1 (en) Methods and processes for non-invasive assessment of genetic variations
KR102540202B1 (en) Methods and processes for non-invasive assessment of genetic variations
US20220101944A1 (en) Methods for detecting copy-number variations in next-generation sequencing
Yan et al. Local adaptation and archaic introgression shape global diversity at human structural variant loci
US20190338349A1 (en) Methods and systems for high fidelity sequencing
Hills et al. BAIT: Organizing genomes and mapping rearrangements in single cells
CN113056563A (en) Method and system for identifying gene abnormality in blood
CN115052994A (en) Method for determining base type of predetermined site in chromosome of embryonic cell and application thereof
JP7362789B2 (en) Systems, computer programs and methods for determining genetic relationships between sperm donors, oocyte donors and their respective conceptuses
US20180119210A1 (en) Fetal haplotype identification
CN113981070B (en) Method, device, equipment and storage medium for detecting embryo chromosome microdeletion
CN112639129A (en) Method and apparatus for determining the genetic status of a new mutation in an embryo
JP7446343B2 (en) Systems, computer programs and methods for determining genome ploidy
Caroselli et al. Improved clinical utility of preimplantation genetic testing through the integration of ploidy and common pathogenic microdeletions analyses
Niehus et al. PopDel identifies medium-size deletions jointly in tens of thousands of genomes
US20180179595A1 (en) Fetal haplotype identification
RU2806429C1 (en) Whole genome sequencing data processing system
RU2804535C1 (en) Whole genome sequencing data processing system
Veeramachaneni Data Analysis in Rare Disease Diagnostics
over GRCh38 et al. A complete reference genome improves short-read analysis of human genetic variation
Wiener Establishing a baseline for developmental disorder diagnosis by evaluating current processes and mapping common benign copy number variation in Africa
Parejo Feuz et al. amelHap: Leveraging drone whole-genome sequence data to create a honey bee HapMap
Hedges Bioinformatics of Human Genetic Disease Studies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16890179

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16890179

Country of ref document: EP

Kind code of ref document: A1