CN111755068A - Method and device for identifying tumor purity and absolute copy number based on sequencing data - Google Patents

Method and device for identifying tumor purity and absolute copy number based on sequencing data Download PDF

Info

Publication number
CN111755068A
CN111755068A CN202010567812.XA CN202010567812A CN111755068A CN 111755068 A CN111755068 A CN 111755068A CN 202010567812 A CN202010567812 A CN 202010567812A CN 111755068 A CN111755068 A CN 111755068A
Authority
CN
China
Prior art keywords
purity
copy number
allole
data
baf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010567812.XA
Other languages
Chinese (zh)
Other versions
CN111755068B (en
Inventor
黄毅
杨玲
罗梓文
裴士美
易鑫
刘久成
吴玲清
李俊
刘青峰
林浩翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Guiinga Medical Laboratory
Original Assignee
Shenzhen Guiinga Medical Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Guiinga Medical Laboratory filed Critical Shenzhen Guiinga Medical Laboratory
Priority to CN202010567812.XA priority Critical patent/CN111755068B/en
Publication of CN111755068A publication Critical patent/CN111755068A/en
Application granted granted Critical
Publication of CN111755068B publication Critical patent/CN111755068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application discloses a method and a device for identifying tumor purity and absolute copy number based on sequencing data. Comparing the offline data after quality control to a reference genome, performing mutation detection and crowd database annotation, and testing the preprocessed data of the tumor and normal samples by using purity prediction software to obtain a purity and copy number information model; for the models which accord with normal distribution, the model with the highest probe support number of the subclone region with high tumor cell fraction is further screened, and the optimal model is defined by combining the matching rate of BAF and the copy numbers of allele1 and allele 2. The method quickly and efficiently corrects the model of the purity detection software, and can more accurately obtain the purity and absolute copy number information of the tumor; the accuracy is guaranteed, meanwhile, the complex process of manual checking is avoided, the labor cost is saved, and a foundation is laid for the follow-up tumor genome evolution and the research of heterogeneity in tumors.

Description

Method and device for identifying tumor purity and absolute copy number based on sequencing data
Technical Field
The application relates to the technical field of tumor research, in particular to a method and a device for identifying tumor purity and absolute copy number based on sequencing data.
Background
Estimating tumor purity and ploidy facilitates tumor genomic evolution and study of intratumoral heterogeneity. Tumor development is accompanied by a large number of genomic variations, and defining chromosome copy number and allele ratio is the basis for understanding the structure and history of the tumor genome. Current genomic identification techniques measure the somatic changes of tumor samples in genomic units, i.e., DNA mass, the significance of this measurement depends on the purity and overall ploidy of the tumor. Tumor purity affects the calculation of DNA Copy Number Variation (CNV) to some extent, and ideally, copy number should be measured in terms of copy number per tumor cell. The use of gene chips to measure Somatic Copy Number Alterations (SCNAs) has become the standard method for copy number analysis.
With the wide application of DNA Next Generation Sequencing (NGS) in the field of genomics, CNV results obtained directly from NGS data have been increasingly used in scientific research and clinical testing. The CNV calculations, whether using gene chips or NGS, are derived from two data, log2Ratio and B-Allole Frequency (BAF). Among them, log2Ratio was used to calculate CNV fragments, BAF was used to calculate Loss of heterozygote (LOH) and allelic imbalance (Allelicidimbance).
The difficulty in predicting absolute copy number stems from the following three aspects: (1) tumor cells are almost always mixed with an unknown proportion of normal cells; (2) the actual DNA content of the tumor cells, unknown due to total and structural chromosomal abnormalities; (3) tumor cell populations may be heterogeneous due to continued subclone evolution. In principle, the absolute copy number can be inferred by rearranging the relative data based on cytological measurements of the DNA mass of each tumor cell or single cell sequencing methods. However, such methods are not suitable for large scale use in the interpretation of tumor genomes.
Currently, the methods that can provide a prediction of tumor purity are mostly limited to the data generated by SNParray. For example, purity estimation software such as ABSOLUTE can predict tumor purity from low-coverage whole genome sequencing data samples, and is one of the most common methods for estimating tumor purity at present; the method utilizes the CNV information of the tumor sample to estimate the tumor purity; however, due to the complexity of tumor samples, the method simultaneously needs to combine the SNV information to estimate the tumor purity so as to achieve higher accuracy, and software automatically scores cancer karyotypes, somatic mutation frequencies and somatic copy number changes which are designed in advance and selects a model with the highest score; however, the highest score is not necessarily the optimal model, and manual verification of the result is required, which increases the complexity of obtaining the purity information of the tumor sample.
Therefore, there is a need to develop a more effective scheme for accurately identifying tumor purity and absolute copy number to meet clinical and scientific needs.
Disclosure of Invention
The application aims to provide a novel method and a novel device for identifying tumor purity and absolute copy number based on sequencing data.
In order to achieve the purpose, the following technical scheme is adopted in the application:
a first aspect of the application discloses a method for identifying tumor purity and absolute copy number based on sequencing data, comprising the steps of:
the data preprocessing step comprises the steps of performing quality control on offline data of the tumor and normal samples, comparing the data after quality control to a reference genome, performing mutation site detection on comparison files of the paired tumor and normal samples, and performing swarm database annotation on the mutation detection sites;
a purity and copy number identification step, which comprises the step of taking the data obtained in the data preprocessing step as an input file of purity prediction software to obtain a purity and copy number information model;
judging whether the model accords with the normal distribution step, wherein the step comprises the steps of further judging whether the purity and copy number information model accords with the normal distribution through the comparison of the model multiplied probe support number distribution and the whole genome doubled WGD, and deleting the purity and copy number information model which does not accord with the normal distribution;
the high tumor cell fraction subclone region statistics step comprises the steps of performing subclone region screening on a purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone region, and accumulating to obtain a high tumor cell fraction subclone region;
the method comprises the steps of calculating the copy number matching rate of BAF, allele1 and allele2, carrying out consistency statistics on the copy numbers of BAF, allele1 and allele2 obtained by calculating purity prediction software to obtain the proportion of consistent fragments, and calculating a formula shown as a formula I,
formula one M ═ f ÷ (f + b)
In the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers;
the optimal model judging step comprises the steps of multiplying the accumulated value of the probe support number of the high-tumor-cell-fraction subcloned region by the matching rate of BAF with the allole 1 and the allole 2 copy number, as shown in the formula II, and counting the final score S, wherein the highest score is an optimal purity and copy number information model, so as to obtain accurate tumor purity and absolute copy number data,
formula di S ═ R × M
In the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
The method provided by the application starts from sequencing data, particularly second-generation sequencing data, and initially obtains a purity and copy number information model through reference gene comparison, mutation site detection and evaluation of purity prediction software; and the obtained model is used for comparing and judging probe support number distribution and whole genome doubling (abbreviated WGD), and further, performing high-tumor cell fraction subclone region probe support number accumulation calculation, BAF and allele1 and allele2 copy number matching rate calculation, and quickly and efficiently correcting the optimal model of the purity prediction software, so that accurate tumor purity and absolute copy number data are obtained.
The method can be understood that the optimal purity and copy number information model is obtained through data analysis, the accuracy is guaranteed, meanwhile, the complex process of manual verification is avoided, the labor cost is greatly saved, the problem that a result model generated by the existing purity estimation software needs manual verification is solved, and the foundation is laid for follow-up tumor genome evolution and intra-tumor heterogeneity research.
It should be further noted that, one of the keys of the method of the present application is to perform optimal model identification on the purity and copy number information model obtained by the existing purity prediction software, so as to obtain more accurate and effective tumor purity and absolute copy number data; the specific purity prediction software and the prediction method thereof, as well as the previous data processing, can refer to the existing purity prediction software, and are not limited herein. Of course, in order to ensure the accuracy and effectiveness of the method of the present application, the steps of the preferred embodiment of the present application are defined in detail, and the details are described in the following technical solutions.
Preferably, in the data preprocessing step of the method of the present application, the mutation site detection comprises single nucleotide site mutation detection and/or insertion deletion mutation detection.
Preferably, the detection of the variant sites further comprises K value filtration of the sequencing depth of the sites, wherein K is more than or equal to 30 x.
Preferably, the software used for annotation of the crowd database is VEP.
Preferably, the population database comprises at least one of an ESP6500 database, a thousand human genome project database, and an ExAC human exome integration database.
Preferably, before annotating the crowd database on the variation detection sites, filtering the crowd database to remove the variation sites with the crowd frequency n, wherein n is more than or equal to 1% and less than or equal to 5%.
Preferably, in the purity and copy number identification step of the method of the present application, the purity prediction software is ABSOLUTE, PureCN, Sequenza, absCN-seq or ASCAT.
Preferably, the step of determining whether the model conforms to the normal distribution includes, if WGD is 0, the peak of the doubled probe-supported number distribution should be at ploidy 2, and if WGD is 1, the peak of the doubled probe-supported number distribution should be at ploidy 2 and ploidy 4; if WGD is 2, the peak of the probe support number distribution of the duplex should be at ploidy 4 and ploidy 8, and so on; if the information model does not accord with the rule, the purity and copy number information model is judged to be not in accordance with the normal distribution, and the information model is deleted.
It is noted that the human chromosome is doubled and cell proliferation is increased exponentially, e.g., 2, 4, 8, 16, etc.; therefore, WGD ≧ 3 is a very low probability event. If WGD is 3, the peak of the probe support number distribution of the doublet should be at ploidy 8 and ploidy 16.
Preferably, the step of counting the subcloned regions with high tumor cell fraction in the method of the present application specifically includes determining a region with 0. sulcyl. allele1 and 0. sulcyl. allele2 as the subcloned region; screening N values of all subcloned regions, wherein if N is more than or equal to 0.9, the subcloned regions with high tumor cell fraction are defined; and performing probe number accumulation calculation on all the screened high-tumor-cell-fraction subclone regions to obtain a probe support number accumulation value of the high-tumor-cell-fraction subclone region.
Preferably, in the step of calculating the copy number matching rate of BAF with allole 1 and allole 2, the conditions that the copy numbers of BAF with allole 1 and allole 2 are matched are that BAF is 0.5, and the copy number of allole 1 is allole 2, and the match is judged; or, BAF! 0.5 and allele1 copy number! Determining as matching if the number of allole 2 copies is equal; the remaining types are mismatches.
Preferably, in the data preprocessing step of the method of the present application, the off-line data of the tumor and normal sample is subjected to quality control, specifically including filtering the adaptor sequence in the off-line data, and screening the filtered data for base quality greater than 20%, base quality greater than 30%, GC content, N content, average read length, and filtered base percentage.
It is understood that, the above screening items and the parameter thresholds thereof can be set according to the specific test requirements by referring to the existing off-line data processing method, and are not limited specifically herein.
Preferably, when the quality-controlled data are aligned to the reference genome, the method further comprises the steps of realigning the regions with potential sequence insertions or sequence deletions found in the alignment process, and correcting the base quality value of the realigned files. In an implementation manner of the present application, the GATK solider targetcreator module and the indelraligner module are specifically used to re-compare regions with potential sequence insertion or sequence deletion found in the comparison process, and the GATK BaseRecalibrator module is used to correct the base quality value of the re-compared file.
Preferably, after the data after quality control are compared to the reference genome, the method further comprises the steps of performing de-duplication and sequencing processing on the comparison result, and screening the comparison rate, the capture efficiency, the pollution rate and the average sequencing depth of the target region on the data on the comparison result. Similarly, the above screening items may set parameter thresholds according to specific test requirements, and are not limited herein.
The second aspect of the application discloses a device for identifying tumor purity and absolute copy number based on sequencing data, which comprises a data preprocessing module, a purity and copy number identification module, a module for judging whether a model accords with normal distribution or not, a high tumor cell fraction subclone region statistics module, a BAF and allele1 and allele2 copy number matching rate calculation module, and an optimal model judgment module:
the data preprocessing module is used for performing quality control on offline data of the tumor and normal samples, comparing the quality-controlled data to a reference genome, performing mutation site detection on paired comparison files of the tumor and normal samples, and performing swarm database annotation on mutation detection sites;
the purity and copy number identification module comprises a purity and copy number information model which is used for taking the data obtained by the data preprocessing module as an input file of the purity prediction software to obtain purity and copy number information;
judging whether the model accords with a normal distribution module, wherein the judging module is used for further judging whether the purity and copy number information model accords with the normal distribution through the comparison of the model multiplied probe support number distribution and the whole genome multiplied WGD, and deleting the purity and copy number information model which does not accord with the normal distribution;
the high tumor cell fraction subclone region statistics module is used for screening the subclone region of the purity and copy number information model which accords with normal distribution, screening the purity of the screened subclone region, and accumulating to obtain the high tumor cell fraction subclone region;
the BAF and allole 1 and allole 2 copy number matching rate calculation module comprises a module for carrying out consistency statistics on the copy numbers of the BAF, allole 1 and allole 2 obtained by the calculation of the purity prediction software to obtain the proportion of consistent fragments, wherein the calculation formula is shown as a formula I,
formula one M ═ f ÷ (f + b)
In the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers;
an optimal model judging module, which comprises a probe support number accumulated value used for multiplying the probe support number accumulated value of the high tumor cell fraction subclone region by the matching rate of the BAF with the allel 1 and the allel 2 copy number, as shown in the formula II, and counting the final score S, wherein the highest score is an optimal purity and copy number information model, so as to obtain accurate tumor purity and absolute copy number data,
formula di S ═ R × M
In the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
It should be noted that, in the apparatus of the present application, actually, each step of the method of the present application is implemented by each module, and therefore, reference may be made to the method of the present application for specific implementation manners or parameter conditions of each module in the apparatus of the present application. For example, the data preprocessing module can refer to the data preprocessing steps of the present application to perform mutation site detection, K value filtering, crowd database annotation, quality control, and the like; the purity and copy number identification module can refer to the purity and copy number information model obtaining mode of the purity and copy number identification step; the module for judging whether the model accords with the normal distribution can refer to a specific judging mode of judging whether the model accords with the normal distribution step; the high tumor cell fraction subclone region statistics module can also refer to the subclone region judgment mode and N value screening of the high tumor cell fraction subclone region statistics step of the application; the BAF and allole 1 and allole 2 copy number matching rate calculation module may refer to the BAF and allole 1 and allole 2 copy number matching rate calculation steps of the present application to perform matching judgment and the like.
It should be noted that, in the apparatus of the present application, each module may be subdivided into various units according to the actual operation condition and the execution function, for example, the data preprocessing module may include a quality control unit, a comparison unit, a mutation site detection unit and a crowd database annotation unit, which are determined according to the product design or the use requirement, and are not limited herein.
A third aspect of the application discloses an apparatus for identifying tumor purity and absolute copy number based on sequencing data, the apparatus comprising a memory and a processor; a memory for storing a program; a processor for implementing the method of identifying tumor purity and absolute copy number based on sequencing data of the present application by executing a program stored in a memory.
A fourth aspect of the present application discloses a computer readable storage medium comprising a program executable by a processor to implement the method of identifying tumor purity and absolute copy number based on sequencing data of the present application.
Due to the adoption of the technical scheme, the beneficial effects of the application are as follows:
according to the method for identifying the tumor purity and the absolute copy number based on the sequencing data, the purity and copy number information model output by the purity detection software is corrected quickly and efficiently, and the tumor purity and absolute copy number information can be obtained more accurately; the method avoids the complex process of manual calibration while ensuring the accuracy, saves the labor cost and solves the problem that the result model generated by the existing purity detection software needs manual calibration; lays a foundation for the follow-up tumor genome evolution and the research of the heterogeneity in the tumor.
Drawings
FIG. 1 is a block flow diagram of a method for identifying tumor purity and absolute copy number based on sequencing data according to an embodiment of the present application;
FIG. 2 is a graph showing the optimal model fold-type probe support distribution obtained by the method for identifying tumor purity and absolute copy number based on sequencing data according to the embodiment of the present application.
FIG. 3 is a block diagram of an apparatus for identifying tumor purity and absolute copy number based on sequencing data according to an embodiment of the present disclosure.
Detailed Description
The present application will be described in further detail below with reference to the accompanying drawings by way of specific embodiments. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in this specification in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they can be fully understood from the description in this specification and the general knowledge of the art.
Purity estimation software, also called purity prediction software or purity detection software, can obtain a purity and copy number information model through sequencing data analysis, and score the model; however, in the practical application process, it is found that the purity prediction software usually outputs a plurality of models simultaneously, and the model with the highest score of the purity prediction software is usually not the optimal model, and the optimal model needs to be manually selected from a plurality of results according to a certain rule; the process consumes a large amount of labor cost, and has the problems of low efficiency, influence of human factors on results and the like.
Therefore, the method and the device creatively provide that if the purity and copy number information model output by the purity prediction software can be re-analyzed and scored and evaluated, the highest scoring person is the optimal purity and copy number information model, manual verification can be avoided, and manual selection is not needed; therefore, labor cost is saved, and accuracy of results can be guaranteed to the maximum extent.
Based on the above research and recognition, the present application has developed a method for identifying tumor purity and absolute copy number based on sequencing data, as shown in fig. 1, which includes a data preprocessing step 11, a purity and copy number identification step 12, a step 13 of judging whether a model meets normal distribution, a high tumor cell fraction subclone region statistics step 14, a BAF and allele1 and allele2 copy number matching rate calculation step 15, and an optimal model judgment step 16.
The data preprocessing step 11 includes performing quality control on offline data of the tumor and normal samples, comparing the quality-controlled data to a reference genome, performing mutation site detection on comparison files of the paired tumor and normal samples, and performing swarm database annotation on the mutation detection sites.
In one implementation of the present application, the quality control of the offline data specifically includes filtering the linker sequence in the sequencing data, and screening the filtered data for a percentage of base mass greater than 20, a percentage of base mass greater than 30, GC content, N content, average read length, and filtered base fraction, and selecting data that meets a set threshold. And during alignment, the method also comprises the steps of re-aligning the regions with potential sequence insertion or sequence deletion found in the alignment process by utilizing the GATKRealigner TargetCreator module and the IndelRealigner module, and correcting the base quality value of the re-aligned files by utilizing the GATKBaseRecalibrator module. And after the comparison is finished, performing duplicate removal and sequencing treatment on the result by comparison, screening the comparison rate, the capture efficiency, the pollution rate and the average sequencing depth of the target area on the compared data, and selecting the data which meets the set threshold.
Furthermore, the detection of the variation site mainly comprises single nucleotide site mutation and/or insertion deletion mutation, and after the detection of the variation site is carried out, K value filtration is carried out on the sequencing depth of each site, wherein K is more than or equal to 30 x. When the pedestrian group database is annotated, software adopted is VEP; wherein the crowd database comprises an ESP6500 database, a thousand human genome planning database and an ExAC human exome integration database; and filtering each database to remove variation sites with the crowd frequency of n, wherein n is more than or equal to 1% and less than or equal to 5%.
And a purity and copy number identification step 12, which comprises using the data obtained in the data preprocessing step as an input file of the purity prediction software to obtain a purity and copy number information model.
In one implementation of the present application, the purity prediction software specifically adopted is ABSOLUTE, PureCN, Sequenza, absCN-seq, or ASCAT.
And 13, judging whether the model accords with the normal distribution, namely further judging whether the purity and copy number information model accords with the normal distribution or not by comparing the probe support number distribution of the model double type with the whole genome double WGD, and deleting the purity and copy number information model which does not accord with the normal distribution.
In one implementation of the present application, the model generated by the sample is determined, and if WGD of the sample model is equal to 0, the peak value of the probe-supported number distribution of the double type should be equal to 2 at ploidy, and if WGD is equal to 1, the peak value of the probe-supported number distribution of the double type should be equal to 2 at ploidy and 4 at ploidy; if WGD is 2, the peak of the probe support number distribution of the duplex should be at ploidy 4 and ploidy 8, and so on; if the model does not conform to the judgment model distribution calculation method, the model is excluded.
And a high tumor cell fraction subcloned region counting step 14, which comprises the steps of performing subcloned region screening on the purity and copy number information model which accords with normal distribution, performing purity screening on the screened subcloned region, and accumulating to obtain the high tumor cell fraction subcloned region.
In one implementation of the present application, specifically, the subcloned region needs to satisfy the allel 1 condition decision and the allel 2 condition decision, that is, subbearing.allel 1 is 0, and subbearing.allel 2 is 0; high tumor cell fraction definitions need to satisfy ccf. allele1> -N or ccf. allele2> -N, N ≧ 0.9.
The BAF and allole 1 and allole 2 copy number matching rate calculating step 15 comprises the steps of carrying out consistency statistics on the copy numbers of the BAF, allole 1 and allole 2 obtained by calculating purity prediction software to obtain the proportion of consistent fragments, wherein the calculation formula is shown as a formula I,
formula one M ═ f ÷ (f + b)
In the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers.
In one implementation of the present application, the condition that BAF matches the copy numbers of allel 1 and allel 2 is: BAF is 0.5, and the allole 1 copy number is allole 2 copy number, and the match is judged; or, BAF! 0.5 and allele1 copy number! Determining as matching if the number of allole 2 copies is equal; the other type is mismatch.
The optimal model judging step 16 comprises the step of multiplying the accumulated value of the probe support numbers of the high-tumor cell fraction subclone region by the matching rate of the BAF with the copy numbers of the allele1 and the allele2, as shown in the formula II; then, counting the final score S, wherein the highest score is an optimal purity and copy number information model, so as to obtain accurate tumor purity and absolute copy number data;
formula di S ═ R × M
In the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
The method is based on second-generation sequencing data, provides a method for automatically selecting a tumor model and further quickly obtaining the tumor purity and copy number aiming at the clinically urgent tumor purity and copy number information, and obtains a comparison result by comparing the sequencing data with a reference genome; performing variation detection based on the comparison result to obtain site frequency; obtaining purity and copy number information through purity and copy number identification; then, judging whether the model generated by the sample accords with normal distribution, further performing high tumor cell fraction subclone region probe support number accumulation calculation, and calculating the copy number matching rate of BAF and allole 1 and allole 2; through the steps, the optimal model of the purity estimation software is corrected quickly and efficiently; tumor purity and absolute copy number information can be obtained more accurately. According to the method, the accuracy is guaranteed, meanwhile, the complex process of manual verification is avoided, the labor cost is greatly saved, and the difficulty that a result model generated by the existing purity estimation software needs manual verification is solved.
Those skilled in the art will appreciate that all or part of the functions of the above-described methods may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above method are implemented by means of a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.
Therefore, based on the method of the present application, the present application proposes an apparatus for identifying tumor purity and absolute copy number based on sequencing data, as shown in fig. 3, which includes a data preprocessing module 31, a purity and copy number identification module 32, a module for determining whether the model conforms to normal distribution 33, a high tumor cell fraction subclone region statistics module 34, a BAF and allele1 and allele2 copy number matching rate calculation module 35 and an optimal model determination module 36.
The data preprocessing module 31 is used for performing quality control on offline data of the tumor and normal samples, comparing the quality-controlled data to a reference genome, performing mutation site detection on paired comparison files of the tumor and normal samples, and performing swarm database annotation on mutation detection sites; a purity and copy number identification module 32, which is used for taking the data obtained by the data preprocessing module as an input file of the purity prediction software to obtain a purity and copy number information model; a module 33 for judging whether the model conforms to the normal distribution, which comprises a module for further judging whether the purity and copy number information model conforms to the normal distribution by comparing the probe support number distribution of the model doubling type with the whole genome doubling WGD, and deleting the purity and copy number information model which does not conform to the normal distribution; the high tumor cell fraction subclone region statistics module 34 is used for performing subclone region screening on the purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone region, and accumulating to obtain a high tumor cell fraction subclone region; the BAF and allole 1 and allole 2 copy number matching rate calculation module 35 is used for carrying out consistency statistics on the copy numbers of the BAF, allole 1 and allole 2 obtained by the purity prediction software calculation to obtain the proportion of consistent fragments, the calculation formula is shown as a formula I,
formula one M ═ f ÷ (f + b)
In the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers;
an optimal model judging module 36, which comprises a module for multiplying the accumulated value of the probe support numbers of the high tumor cell fraction subcloned region by the matching rate of the BAF with the allole 1 and the allole 2 copy numbers, as shown in the formula II; and counting the final score S, wherein the highest score is the optimal purity and copy number information model, thereby obtaining accurate tumor purity and absolute copy number data,
formula di S ═ R × M
In the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
The device can realize the identification of the tumor purity and the absolute copy number based on the sequencing data by utilizing the mutual coordination of the modules, and particularly, the modules of the device can realize the corresponding steps in the method for identifying the tumor purity and the absolute copy number based on the sequencing data, thereby realizing the automatic, fast and efficient correction of the optimal model of the purity detection software result.
There is also provided, in another implementation of the present application, an apparatus for identifying tumor purity and absolute copy number based on sequencing data, the apparatus comprising a memory and a processor; a memory including a memory for storing a program; a processor comprising instructions for implementing the following method by executing a program stored in a memory: the data preprocessing step comprises the steps of performing quality control on offline data of the tumor and normal samples, comparing the data after quality control to a reference genome, performing mutation site detection on comparison files of the paired tumor and normal samples, and performing swarm database annotation on the mutation detection sites; a purity and copy number identification step, which comprises the step of taking the data obtained in the data preprocessing step as an input file of purity prediction software to obtain a purity and copy number information model; judging whether the model accords with the normal distribution step, wherein the step comprises the steps of further judging whether the purity and copy number information model accords with the normal distribution through the comparison of the model multiplied probe support number distribution and the whole genome doubled WGD, and deleting the purity and copy number information model which does not accord with the normal distribution; the high tumor cell fraction subclone region statistics step comprises the steps of performing subclone region screening on a purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone region, and accumulating to obtain a high tumor cell fraction subclone region; calculating the copy number matching rate of BAF, allele1 and allele2, wherein the method comprises the steps of carrying out consistency statistics on the copy numbers of BAF, allele1 and allele2 obtained by calculating purity prediction software to obtain the proportion of consistent fragments, and the calculation formula is shown as a formula I; and the optimal model judgment step comprises the steps of multiplying the accumulated value of the probe support number of the high-tumor-cell-fraction subcloned region by the matching rate of BAF with the allole 1 and the allole 2 copy number, wherein the result is shown in the formula II, and the final score S is counted, and the score with the highest score is an optimal purity and copy number information model, so that accurate tumor purity and absolute copy number data are obtained.
There is also provided, in another implementation, a computer-readable storage medium including a program, the program being executable by a processor to perform a method comprising: the data preprocessing step comprises the steps of performing quality control on offline data of the tumor and normal samples, comparing the data after quality control to a reference genome, performing mutation site detection on comparison files of the paired tumor and normal samples, and performing swarm database annotation on the mutation detection sites; a purity and copy number identification step, which comprises the step of taking the data obtained in the data preprocessing step as an input file of purity prediction software to obtain a purity and copy number information model; judging whether the model accords with the normal distribution step, wherein the step comprises the steps of further judging whether the purity and copy number information model accords with the normal distribution through the comparison of the model multiplied probe support number distribution and the whole genome doubled WGD, and deleting the purity and copy number information model which does not accord with the normal distribution; the high tumor cell fraction subclone region statistics step comprises the steps of performing subclone region screening on a purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone region, and accumulating to obtain a high tumor cell fraction subclone region; calculating the copy number matching rate of BAF, allele1 and allele2, wherein the method comprises the steps of carrying out consistency statistics on the copy numbers of BAF, allele1 and allele2 obtained by calculating purity prediction software to obtain the proportion of consistent fragments, and the calculation formula is shown as a formula I; and the optimal model judgment step comprises the steps of multiplying the accumulated value of the probe support number of the high-tumor-cell-fraction subcloned region by the matching rate of BAF with the allole 1 and the allole 2 copy number, wherein the result is shown in the formula II, and the final score S is counted, and the score with the highest score is an optimal purity and copy number information model, so that accurate tumor purity and absolute copy number data are obtained.
The terms and their abbreviations of the present application have the following meanings:
BAF, an abbreviation for balllee Frequency, translates to allele Frequency. Copy number changes are abbreviated CNV, somatic copy number changes are abbreviated SCNAs, whole genome duplication is abbreviated WGD, allele1, allele1, allele2, allele2, tumor cell fraction, ccf, ploidy.
The present application is described in further detail below with reference to specific embodiments and the attached drawings. The following examples are intended to be illustrative of the present application only and should not be construed as limiting the present application.
Examples
In this example, for an example of melanoma, numbered 179008702TD, exon data from normal and tumor tissues were obtained and based on sequencing data, melanoma tumor purity and absolute copy number were identified as follows:
and a data preprocessing step, namely performing quality control on offline data of the tumor and normal samples, comparing the data after quality control to a reference genome to obtain a comparison data file in a BAM format, performing mutation site detection on the comparison files of the paired tumor and normal samples, and performing crowd database annotation on mutation detection sites.
The off-line data quality control comprises removing a sequencing joint sequence from the obtained sequencing data to obtain filtering data, performing quality control on the obtained filtering data by using fastp software, and selecting data meeting the following set thresholds: q20> 90%, Q30> 85%, GC content > 40% and < 60%, N content < 10.00%, average read length >90 bp and < 110bp and Clean _ base _ ratio > 80%. Comparing the quality-controlled data to a human reference genome (GRCh37) through BWA-mem software, and directly performing de-duplication and sequencing on the comparison result; and then, the GATK Realigner TargetCreator module and the IndelRealigner module are used for re-comparing the regions with potential sequence insertion or sequence deletion found in the comparison process, and the GATKBaseRecalibrator module is used for correcting the base quality value of the re-compared files.
The mutation site detection of the embodiment adopts a mutation detection method based on second-generation sequencing to perform mutation detection on the processed BAM to obtain site frequency information; the GATK MuTect2 matched sample module is used for detecting mutation sites of 1 example of melanoma samples, single nucleotide site mutation and insertion deletion mutation are generated, a result file in a VCF format is obtained, and the mutation sites are subjected to condition filtration of more than or equal to 30 times. And carrying out ESP6500 database annotation, thousand human genome planning database annotation and ExAC human exome integration database annotation on the filtered VCF file by using annotation software VEP, and filtering the variant sites with the frequency of the human groups being more than or equal to 1 per thousand in any human group database.
And a purity and copy number identification step, in the embodiment, purity detection software is used for analyzing the sample to obtain a purity and copy number information model, and specifically, the file after condition filtering is used as an input file of tumor prediction software ABSOLUTE, and a program is operated to obtain the purity and copy number information model.
Judging whether the model accords with normal distribution, wherein the step comprises the step of judging the model generated by the ABSOLUTE software, and if the model does not accord with the normal distribution, the model is abandoned; and if the normal distribution is met, performing subsequent screening model calculation.
In this example, whether the purity and copy number information model is consistent with the normal distribution is further determined by comparing the model-doubled probe support number distribution with the genome-wide doubled WGD, specifically, if WGD is 0, the peak of the doubled probe support number distribution should be at ploidy 2, and if WGD is 1, the peak of the doubled probe support number distribution should be at ploidy 2 and ploidy 4; if WGD is 2, the peak value of the probe support number distribution of the double type should be 4 at ploidy and 8 at ploidy, and so on; if the information model does not accord with the rule, the purity and copy number information model is judged to be not in accordance with the normal distribution, and the information model is deleted.
And counting the subcloned regions with high tumor cell fraction, including the support number accumulation calculation of the probe in the subcloned high tumor cell fraction region, namely performing subclone region screening on the purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone regions, and accumulating to obtain the subclone regions with high tumor cell fraction.
In this example, a region in which subclonal No. allel 1 is 0 and subclonal No. allel 2 is 0 is determined as a subcloned region; screening N values of all subcloned regions, wherein if N is more than or equal to 0.9, the subcloned regions with high tumor cell fraction are defined; and performing probe number accumulation calculation on all the screened high-tumor-cell-fraction subclone regions to obtain a probe support number accumulation value of the high-tumor-cell-fraction subclone region.
The method comprises the steps of calculating the copy number matching rate of BAF, allele1 and allele2, carrying out consistency statistics on the copy numbers of BAF, allele1 and allele2 obtained by calculating purity prediction software to obtain the proportion of consistent fragments, and calculating a formula shown as a formula I,
formula one M ═ f ÷ (f + b)
In the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers.
In this example, the conditions that BAF matches the copy numbers of allel 1 and allel 2 are that BAF is 0.5, and the copy number of allel 1 is the copy number of allel 2, and it is determined that BAF matches; or, BAF! 0.5 and allele1 copy number! Determining as matching if the number of allole 2 copies is equal; the remaining types are mismatches.
Finally, obtaining an optimal model through an optimal model judging step, specifically, multiplying the accumulated value of the probe support number of the high-tumor cell fraction subcloned region by the matching rate of the BAF with the allole 1 and the allole 2 copy number, as shown in the formula II, counting the final score S, wherein the highest score is the optimal purity and copy number information model, thereby obtaining accurate tumor purity and absolute copy number data,
formula di S ═ R × M
In the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
The results of calculations for all models of melanoma test specimens of this example are shown in table 3. The Model-fold-type supporting probe number of Model _2 is in accordance with the normal distribution, as shown in fig. 2, which is a distribution diagram of the Model-fold-type supporting probe number of the test sample, and the fold-type peak values are distributed at 2 and 4, which are consistent with WGD 1, and are in accordance with the judgment of the normal distribution of the Model. Model _2 has an accumulated calculation of the probe support number of the subcloned high tumor cell fraction region of 62835.59, a match rate of copy number of BAF with allel 1 and allel 2 of 0.9542, and a score of 59957.72; model _1 has an accumulated calculation of the probe support number of the subcloned high tumor cell fraction region of 3840.22, a match rate of BAF with the copy numbers of allele1 and allele2 of 1, and a score of 3840.22; compared with the Model _2 score > Model _1 score, and the calculated value of Model _2 is the maximum value among all models of the test sample, the Model 2 is determined as the sample best Model, i.e. the precision is 0.66, and the ploidy is 2.9.
Therefore, the optimal Model obtained according to the method of this example is Model _2, and the detailed information is shown in table 2, where precision is 0.66 and ploidy is 2.9.
Meanwhile, in this example, the same sample to be tested was processed using the purity prediction software ABSOLUTE, the Model with the highest score was Model _1, and the detailed information is shown in table 1, where purity is 0.36 and ploidy is 9.31.
TABLE 1 model with highest ABSOLUTE score for purity prediction software
Figure BDA0002548168360000141
Figure BDA0002548168360000151
Figure BDA0002548168360000161
Figure BDA0002548168360000171
Figure BDA0002548168360000181
TABLE 2 optimal models obtained by the method for identifying tumor purity and absolute copy number based on sequencing data
Figure BDA0002548168360000182
Figure BDA0002548168360000191
Figure BDA0002548168360000201
Figure BDA0002548168360000211
Figure BDA0002548168360000221
Figure BDA0002548168360000231
TABLE 3 statistical results of all model calculations for test samples
choose_ index model_ name alleleMatch_ ratio alleleMatch_ length alleleMisMatch_ length high_ccf_ length ploidy_ main puri ty ploi dy
1 Model_2 0.9542 1442.37 69.2973 62835.59 2 0.66 2.9
2 Model_3 1 1281.30 0 51403.32 6 0.39 8.08
3 Model_5 0.9864 2392.00 32.9407 35014.16 4 0.49 5.72
4 Model_7 0.997 1282.36 3.8978 26546.38 5 0.43 7
5 Model_6 0.9968 1222.81 3.8978 24137.71 7 0.75 8.73
6 Model_1 1 767.55 0 3840.22 8 0.36 9.31
7 Model_4 0.9911 2295.87 20.6861 0 6 0.96 7.65
Compared with the prediction result of the ABSOLUTE software octaploid, the tumor purity and the ABSOLUTE copy number obtained by the method for identifying the tumor purity and the ABSOLUTE copy number based on the sequencing data are more true and credible.
The foregoing is a more detailed description of the present application in connection with specific embodiments thereof, and it is not intended that the present application be limited to the specific embodiments thereof. It will be apparent to those skilled in the art from this disclosure that many more simple derivations or substitutions can be made without departing from the spirit of the disclosure.

Claims (10)

1. A method for identifying tumor purity and absolute copy number based on sequencing data, comprising: comprises the following steps of (a) carrying out,
the data preprocessing step comprises the steps of performing quality control on offline data of the tumor and normal samples, comparing the data after quality control to a reference genome, performing mutation site detection on comparison files of the paired tumor and normal samples, and performing swarm database annotation on the mutation detection sites;
a purity and copy number identification step, which comprises the step of taking the data obtained in the data preprocessing step as an input file of purity prediction software to obtain a purity and copy number information model;
judging whether the model accords with the normal distribution step, wherein the step comprises the steps of further judging whether the purity and copy number information model accords with the normal distribution through the comparison of the model multiplied probe support number distribution and the whole genome doubled WGD, and deleting the purity and copy number information model which does not accord with the normal distribution;
the high tumor cell fraction subclone region statistics step comprises the steps of performing subclone region screening on a purity and copy number information model which accords with normal distribution, performing purity screening on the screened subclone region, and accumulating to obtain a high tumor cell fraction subclone region;
the method comprises the steps of calculating the copy number matching rate of BAF, allele1 and allele2, carrying out consistency statistics on the copy numbers of BAF, allele1 and allele2 obtained by calculating purity prediction software to obtain the proportion of consistent fragments, and calculating a formula shown as a formula I,
formula one M ═ f ÷ (f + b)
In the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers;
an optimal model judging step, which comprises the step of multiplying the accumulated value of the probe support number of the high-tumor cell fraction subclone region by the matching rate of the BAF with the allole 1 and the allole 2 copy number, as shown in the formula II, counting the final score S, wherein the highest score is an optimal purity and copy number information model, so as to obtain accurate tumor purity and absolute copy number data,
formula di S ═ R × M
In the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
2. The method of claim 1, wherein: in the data preprocessing step, the detection of variation sites comprises the detection of single nucleotide site mutation and/or the detection of insertion deletion mutation;
preferably, the detection of the variant locus further comprises filtering the sequencing depth of the locus by K value, wherein K is more than or equal to 30 x;
preferably, software adopted for annotation of the crowd database is VEP;
preferably, the demographic database comprises at least one of an ESP6500 database, a thousand human genome project database, and an ExAC human exome integration database;
preferably, before annotating the crowd database on the variation detection sites, the method further comprises filtering the crowd database to remove variation sites with the crowd frequency n, wherein n is greater than or equal to 1% and less than or equal to 5%.
3. The method of claim 1, wherein: in the purity and copy number identification step, the purity prediction software is ABSOLUTE, PureCN, sequennza, absCN-seq or ASCAT.
4. The method of claim 1, wherein: the step of determining whether the model conforms to the normal distribution includes that if WGD is 0, the peak of the doubled probe-supported number distribution should be at ploidy 2, and if WGD is 1, the peak of the doubled probe-supported number distribution should be at ploidy 2 and ploidy 4; if WGD is 2, the peak of the probe support number distribution of the duplex should be at ploidy 4 and ploidy 8, and so on; if the information model does not accord with the rule, the purity and copy number information model is judged to be not in accordance with the normal distribution, and the information model is deleted.
5. The method of claim 1, wherein: the step of counting the high tumor cell fraction subcloned region specifically comprises the step of judging a region with a subclonal number of Subclonal.allel 1 ═ 0 and a subclonal number of Subclonal.allel 2 ═ 0 as a subcloned region; screening N values of all subcloned regions, wherein if N is more than or equal to 0.9, the subcloned regions with high tumor cell fraction are defined; and performing probe number accumulation calculation on all the screened high-tumor-cell-fraction subclone regions to obtain a probe support number accumulation value of the high-tumor-cell-fraction subclone region.
6. The method of claim 1, wherein: in the step of calculating the copy number matching rate of the BAF with the allole 1 and the allole 2, the condition that the copy number of the BAF with the allole 1 and the allole 2 is matched is that the BAF is 0.5, and the copy number of the allole 1 is the allole 2 copy number, and the match is judged; or, BAF! 0.5 and allele1 copy number! Determining as matching if the number of allole 2 copies is equal; the remaining types are mismatches.
7. The method according to any one of claims 1-6, wherein: in the data preprocessing step, the off-line data of the tumor and normal samples are subjected to quality control, specifically comprising the steps of filtering a linker sequence in the off-line data, and screening the filtered data, wherein the base quality of the filtered data is more than 20 percent, the base quality of the filtered data is more than 30 percent, the GC content, the N content, the average read length and the filtered base proportion;
preferably, when the data after quality control is compared with the reference genome, the method further comprises the steps of re-comparing the regions with potential sequence insertions or sequence deletions found in the comparison process, and correcting the base quality value of the re-compared files;
preferably, after the data after quality control are compared to the reference genome, the method further comprises the steps of performing de-duplication and sequencing processing on the comparison result, and screening the comparison rate, the capture efficiency, the pollution rate and the average sequencing depth of the target region on the data on the comparison result.
8. An apparatus for identifying tumor purity and absolute copy number based on sequencing data, comprising: comprises a data preprocessing module, a purity and copy number identification module, a module for judging whether the model accords with normal distribution or not, a high tumor cell fraction subclone region statistical module, a BAF and allele1 and allele2 copy number matching rate calculation module and an optimal model judgment module,
the data preprocessing module is used for performing quality control on offline data of the tumor and normal samples, comparing the quality-controlled data to a reference genome, performing mutation site detection on paired comparison files of the tumor and normal samples, and performing swarm database annotation on mutation detection sites;
the purity and copy number identification module is used for taking the data obtained by the data preprocessing module as an input file of purity prediction software to obtain a purity and copy number information model;
the module for judging whether the model accords with the normal distribution comprises a module for further judging whether the purity and copy number information model accords with the normal distribution through the comparison of the model multiplied probe support number distribution and the whole genome doubled WGD, and deleting the purity and copy number information model which does not accord with the normal distribution;
the high tumor cell fraction subclone region statistics module is used for screening the subclone region of the purity and copy number information model which accords with normal distribution, screening the purity of the screened subclone region, and accumulating to obtain the high tumor cell fraction subclone region;
the BAF and allole 1 and allole 2 copy number matching rate calculation module comprises a module for carrying out consistency statistics on the copy numbers of the BAF, allole 1 and allole 2 obtained by the calculation of the purity prediction software to obtain the proportion of consistent fragments, wherein the calculation formula is shown as a formula I,
formula one M ═ f ÷ (f + b)
In the formula I, M represents the matching rate of BAF with allole 1 and allole 2 copy numbers, f represents the probe support number of BAF matched with allole 1 and allole 2 copy numbers, and b represents the probe support number of BAF not matched with allole 1 and allole 2 copy numbers;
the optimal model judging module comprises a module for multiplying the accumulated value of the probe support number of the high-tumor cell fraction subcloned region by the matching rate of BAF with allole 1 and allole 2 copy number, as shown in formula II, counting the final score S, wherein the highest score is an optimal purity and copy number information model, so as to obtain accurate tumor purity and absolute copy number data,
formula di S ═ R × M
In the second expression, S represents the final score of model judgment, R represents the high tumor cell fraction subclone region probe support number accumulated value, and M represents the BAF and allele1 and allele2 copy number matching rate.
9. An apparatus for identifying tumor purity and absolute copy number based on sequencing data, comprising: the apparatus includes a memory and a processor;
the memory including a memory for storing a program;
the processor comprising a program for implementing the method of any one of claims 1-7 by executing the program stored by the memory.
10. A computer-readable storage medium characterized by: the storage medium includes a program therein, the program being executable by a processor to implement the method of any one of claims 1-7.
CN202010567812.XA 2020-06-19 2020-06-19 Method and device for identifying tumor purity and absolute copy number based on sequencing data Active CN111755068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010567812.XA CN111755068B (en) 2020-06-19 2020-06-19 Method and device for identifying tumor purity and absolute copy number based on sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010567812.XA CN111755068B (en) 2020-06-19 2020-06-19 Method and device for identifying tumor purity and absolute copy number based on sequencing data

Publications (2)

Publication Number Publication Date
CN111755068A true CN111755068A (en) 2020-10-09
CN111755068B CN111755068B (en) 2021-02-19

Family

ID=72676360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010567812.XA Active CN111755068B (en) 2020-06-19 2020-06-19 Method and device for identifying tumor purity and absolute copy number based on sequencing data

Country Status (1)

Country Link
CN (1) CN111755068B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669906A (en) * 2020-11-25 2021-04-16 深圳华大基因股份有限公司 Detection method, device, terminal device and computer-readable storage medium for measuring genome instability
CN113658638A (en) * 2021-08-20 2021-11-16 江苏先声医学诊断有限公司 Detection method and quality control system for homologous recombination defects based on NGS platform
CN113889187A (en) * 2021-09-24 2022-01-04 上海仁东医学检验所有限公司 Single-sample allele copy number variation detection method, probe set and kit
CN115404275A (en) * 2022-08-17 2022-11-29 中山大学·深圳 Method for evaluating tumor purity based on nanopore sequencing technology
WO2023087553A1 (en) * 2021-11-18 2023-05-25 上海思路迪生物医学科技有限公司 Cnv determination processing method and apparatus, and electronic device and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010033578A2 (en) * 2008-09-20 2010-03-25 The Board Of Trustees Of The Leland Stanford Junior University Noninvasive diagnosis of fetal aneuploidy by sequencing
CN103014029A (en) * 2012-12-28 2013-04-03 贵州省烟草科学研究院 Nicotiana tabacum isoflavone reductase-like (NtIRL) gene and application thereof
CN103619871A (en) * 2011-04-22 2014-03-05 惠氏有限责任公司 Compositions relating to a mutant clostridium difficile toxin and methods thereof
US20160019336A1 (en) * 2006-05-03 2016-01-21 Population Diagnostics, Inc. Evaluating genetic disorders
CN108256294A (en) * 2016-12-29 2018-07-06 安诺优达基因科技(北京)有限公司 A kind of device for being used to detect somatic mutation
CN108473975A (en) * 2016-11-17 2018-08-31 领星生物科技(上海)有限公司 The system and method for detecting tumor development
CN108733975A (en) * 2018-03-29 2018-11-02 深圳裕策生物科技有限公司 Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations
CN109033749A (en) * 2018-06-29 2018-12-18 深圳裕策生物科技有限公司 A kind of Tumor mutations load testing method, device and storage medium
CN109196359A (en) * 2016-02-29 2019-01-11 基础医疗股份有限公司 For assessing the method and system of Tumor mutations load
US10381105B1 (en) * 2017-01-24 2019-08-13 Bao Personalized beauty system
CN110189796A (en) * 2019-05-27 2019-08-30 新疆农业大学 A kind of sheep full-length genome resurveys sequence analysis method
CN111212849A (en) * 2017-06-30 2020-05-29 纪念斯隆-凯特林癌症中心 Compositions and methods for adoptive cell therapy of cancer

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160019336A1 (en) * 2006-05-03 2016-01-21 Population Diagnostics, Inc. Evaluating genetic disorders
WO2010033578A2 (en) * 2008-09-20 2010-03-25 The Board Of Trustees Of The Leland Stanford Junior University Noninvasive diagnosis of fetal aneuploidy by sequencing
CN103619871A (en) * 2011-04-22 2014-03-05 惠氏有限责任公司 Compositions relating to a mutant clostridium difficile toxin and methods thereof
CN103014029A (en) * 2012-12-28 2013-04-03 贵州省烟草科学研究院 Nicotiana tabacum isoflavone reductase-like (NtIRL) gene and application thereof
CN109196359A (en) * 2016-02-29 2019-01-11 基础医疗股份有限公司 For assessing the method and system of Tumor mutations load
CN108473975A (en) * 2016-11-17 2018-08-31 领星生物科技(上海)有限公司 The system and method for detecting tumor development
CN108256294A (en) * 2016-12-29 2018-07-06 安诺优达基因科技(北京)有限公司 A kind of device for being used to detect somatic mutation
US10381105B1 (en) * 2017-01-24 2019-08-13 Bao Personalized beauty system
CN111212849A (en) * 2017-06-30 2020-05-29 纪念斯隆-凯特林癌症中心 Compositions and methods for adoptive cell therapy of cancer
CN108733975A (en) * 2018-03-29 2018-11-02 深圳裕策生物科技有限公司 Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations
CN109033749A (en) * 2018-06-29 2018-12-18 深圳裕策生物科技有限公司 A kind of Tumor mutations load testing method, device and storage medium
CN110189796A (en) * 2019-05-27 2019-08-30 新疆农业大学 A kind of sheep full-length genome resurveys sequence analysis method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CAO X等: "De novo, transcriptome sequencing and analysis of Euphorbia pekinensis Rupr and identification of genes involved in diterpenoid biosynthesis", 《PLANT GENE》 *
慧芳等: "转录组测序技术在药用植物研究中的应用", 《中草药》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669906A (en) * 2020-11-25 2021-04-16 深圳华大基因股份有限公司 Detection method, device, terminal device and computer-readable storage medium for measuring genome instability
CN113658638A (en) * 2021-08-20 2021-11-16 江苏先声医学诊断有限公司 Detection method and quality control system for homologous recombination defects based on NGS platform
CN113658638B (en) * 2021-08-20 2022-06-03 江苏先声医学诊断有限公司 Detection method and quality control system for homologous recombination defects based on NGS platform
CN113889187A (en) * 2021-09-24 2022-01-04 上海仁东医学检验所有限公司 Single-sample allele copy number variation detection method, probe set and kit
WO2023087553A1 (en) * 2021-11-18 2023-05-25 上海思路迪生物医学科技有限公司 Cnv determination processing method and apparatus, and electronic device and storage medium
CN115404275A (en) * 2022-08-17 2022-11-29 中山大学·深圳 Method for evaluating tumor purity based on nanopore sequencing technology

Also Published As

Publication number Publication date
CN111755068B (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN111755068B (en) Method and device for identifying tumor purity and absolute copy number based on sequencing data
CN108573125B (en) Method for detecting genome copy number variation and device comprising same
CN105886616B (en) Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof
CN107480470B (en) Known variation detection method and device based on Bayesian and Poisson distribution test
CN109411015B (en) Tumor mutation load detection device based on circulating tumor DNA and storage medium
RU2654575C2 (en) Method for detecting chromosomal structural abnormalities and device therefor
CN109949861B (en) Tumor mutation load detection method, device and storage medium
US20220130488A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN108920899B (en) Single exon copy number variation prediction method based on target region sequencing
CN110808081B (en) Model construction method for identifying tumor purity sample and application
JP6066924B2 (en) DNA sequence data analysis method
CN111916150A (en) Method and device for detecting genome copy number variation
CN110060733B (en) Second-generation sequencing tumor somatic variation detection device based on single sample
KR20200107774A (en) How to align targeting nucleic acid sequencing data
CN115064209B (en) Malignant cell identification method and system
CN110621785A (en) Method and device for typing diploid genome haploid based on third generation capture sequencing
CN113674803A (en) Detection method of copy number variation and application thereof
CN105483210A (en) RNA (ribonucleic acid) editing locus detection method
CN113789371A (en) Method for detecting copy number variation based on batch correction
KR101770962B1 (en) A method and apparatus of providing information on a genomic sequence based personal marker
CN114067908B (en) Method, device and storage medium for evaluating single-sample homologous recombination defects
CN114990202B (en) Application of SNP (Single nucleotide polymorphism) locus in evaluation of genome abnormality and method for evaluating genome abnormality
WO2017083310A1 (en) A normalization method for sample assays
CN114067909B (en) Method, device and storage medium for correcting homologous recombination defect score
Kielpinski et al. Reproducible analysis of sequencing-based RNA structure probing data with user-friendly tools

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant