CN115064211A - ctDNA prediction method based on whole genome methylation sequencing and application thereof - Google Patents

ctDNA prediction method based on whole genome methylation sequencing and application thereof Download PDF

Info

Publication number
CN115064211A
CN115064211A CN202210977623.9A CN202210977623A CN115064211A CN 115064211 A CN115064211 A CN 115064211A CN 202210977623 A CN202210977623 A CN 202210977623A CN 115064211 A CN115064211 A CN 115064211A
Authority
CN
China
Prior art keywords
sample
methylation
block
principal component
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210977623.9A
Other languages
Chinese (zh)
Other versions
CN115064211B (en
Inventor
韩天澄
李苏星
李溪
李宇龙
洪媛媛
张琦
黄宇
陈维之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhenyue Biotechnology Jiangsu Co ltd
Zhenhe Beijing Biotechnology Co ltd
Original Assignee
Zhenyue Biotechnology Jiangsu Co ltd
Zhenhe Beijing Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhenyue Biotechnology Jiangsu Co ltd, Zhenhe Beijing Biotechnology Co ltd filed Critical Zhenyue Biotechnology Jiangsu Co ltd
Priority to CN202210977623.9A priority Critical patent/CN115064211B/en
Publication of CN115064211A publication Critical patent/CN115064211A/en
Application granted granted Critical
Publication of CN115064211B publication Critical patent/CN115064211B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a ctDNA prediction method based on whole genome methylation sequencing and application thereof. The prediction method comprises the following steps: dividing a genome into blocks with equal length according to a human reference genome FASTA file and a BED file, and generating a block list file capable of calculating methylation levels; calculating the methylation level of a training set sample and a sample to be tested on each block, reducing the methylation level of the training set sample and the sample to be tested, calculating a rotation matrix according to the training set sample by using a principal component analysis method, generating a training set principal component matrix and a sample principal component matrix to be tested by using the rotation matrix, constructing a methylation model by using the training set principal component matrix, and predicting ctDNA in the sample to be tested by using the trained model. The method overcomes the defects of inaccurate methylation level calculation and low sensitivity in low-depth methylation sequencing, and has important application value in tumor screening or real-time monitoring.

Description

ctDNA prediction method based on whole genome methylation sequencing and application thereof
Technical Field
The invention belongs to the technical field of biomedicine, and particularly relates to a ctDNA prediction method based on whole genome methylation sequencing and application thereof.
Background
Circulating tumor DNA (ctDNA) is one of circulating cell-free DNA (cfDNA) that is produced by tumor cells due to secretion, apoptosis, or necrosis. The half-life of the ctDNA in blood is short, and the ctDNA carries the characteristic of partial tumor cells, has good compliance compared with tissue biopsy, and can be used for early screening or real-time monitoring of tumor patients. Methylation, which is an important link in regulation of gene expression, can also affect the stability of the genome, in addition to Single Nucleotide Polymorphisms (SNPs), insertion-deletion markers (indels), and Copy Number Variations (CNVs). There may be significant differences between ctDNA of tumor patients and cfDNA of healthy people for methylation status at some specific sites or regions. In addition to the methylation status of specific sites or regions, there are also reports in the literature of differences in the global methylation levels between tumor patients and healthy people across the whole genome. Thus, by detecting methylation status from plasma, the presence of ctDNA in plasma can be identified at an early stage of tumorigenesis or in post-tumor monitoring, providing a data basis for early diagnosis or recurrence prediction of subsequent cancer.
In recent years, methylation sequencing has been applied to improve the detection sensitivity of ctDNA to some extent, but mostly achieved by reducing the sequencing region and increasing the sequencing depth. Under the strategy, the number of detected methylation sites and the size of the region are limited, and some important regions with detection significance are easy to miss. The whole genome methylation sequencing has lower depth but more complete coverage sites, and can easily capture the change of methylation level in the whole genome.
Since each sequencing mode requires bisulfite or enzymatic conversion of the DNA prior to sequencing, the conversion efficiency has a large impact on the accuracy of the methylation level calculation. Since the occupancy of ctDNA in patient plasma samples is usually very low, the signal of ctDNA is very easily buried in noise, and there is still a large optimization space for the sensitivity and specificity of detection based on low-depth whole-genome methylation sequencing.
In addition, the jagged end phenomenon has been reported in some studies. This phenomenon occurs mainly in cfDNA samples, mainly reflected by the inconsistency in DNA double strand break positions. In the double-strand library construction, the sawtooth ends are trimmed or repaired into flat ends (the starting and ending positions of double strands are consistent), and when the repair is completed, the used cytosines are all cytosines without methyl group modification. Without additional processing, these cytosines would be judged as unmethylated in the sequencing, resulting in an underestimation of the methylation level at the corresponding position.
Therefore, the method for predicting the ctDNA can be provided quickly, conveniently and accurately, the probability of the ctDNA existing in the sample is evaluated, and the method has important significance for early screening or real-time monitoring of tumor patients.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a ctDNA prediction method based on whole genome methylation sequencing and application thereof. The prediction method can be applied to a plasma sample, and overcomes the technical defects of inaccurate methylation level calculation and low sensitivity in the conventional low-depth methylation sequencing.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a ctDNA prediction method based on whole genome methylation sequencing, the prediction method comprising the following steps:
dividing a genome into blocks with equal length according to a human reference genome FASTA file and a BED file, and generating a block list file capable of calculating methylation levels;
calculating the methylation level of a training set sample and a sample to be detected on each block, reducing the dimension of the methylation levels of the training set sample and the sample to be detected, calculating a rotation matrix according to the training set sample by using a principal component analysis method, and generating a training set principal component matrix and a sample to be detected by using the rotation matrix;
and (3) carrying out methylation model construction by utilizing the training set principal component matrix, and predicting ctDNA in the sample to be tested by using the trained model.
Preferably, the dividing of the genome into blocks of equal length is performed by:
dividing the genome into blocks with equal length according to a FASTA file and a BED file of the human reference genome, and filtering according to the position information of all CpG sites in the whole genome range and the proportion of multiple alignment regions to generate a block list capable of calculating the methylation level.
Preferably, the calculation method of the methylation level comprises:
respectively obtaining capture sequencing FASTQ files of training set samples and samples to be tested, carrying out sequence comparison, respectively generating training set samples and original Bam files of the samples to be tested, carrying out sequencing and marking repeated processing on the original Bam files, respectively generating sequenced and de-duplicated Bam files of the training set samples and the samples to be tested, and recording the sequenced and de-duplicated Bam files as sample samplesjAfter reorderingA Bam file;
mixing the samplejCalculating samples by using the reordered Bam file and the block list capable of calculating the methylation level as input filesjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Corrected methylation levels on (v), and combining corrected methylation levels of all samples to oneILine ofJMethylation level matrix of columnsX
Preferably, the training set samples are plasma samples of normal population and plasma samples of patients; the sample to be detected is a plasma sample of a detected person;
preferably, the calculation of the methylation level further comprises the step of subjecting the sample tojIn a blockR i Upper fragment setS 1 ,⋯,S N Each fragment inS n Cutting off the front of the beginning of sequence 2 on each fragment XM tagAInformation of individual bases.
In the invention, the block is cut off by the sawtooth tail end so as to reduce the sawtooth tail end phenomenonR i Average methylation level ofbeta i,j Block, blockR i Average conversion efficiency ofBS i,j Fragment, fragmentS n Methylation level ofbeta i,j,n Fragment of (A) A. sup. beta. CelastrusS n Transformation efficiency ofBS i,j,n The impact of the calculation.
Preferably, the samples are calculatedjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Methods for correcting the level of post-methylation on (H) } include:
the first scheme is as follows: calculating block conversion efficiency according to the state of cytosine on the non-methylation site, and correcting the block methylation level by using the block conversion efficiency;
or scheme two: calculating the conversion efficiency of the fragments according to the state of cytosine on the non-methylation site of each original gene fragment, and calculating the methylation level of the block after filtering the fragments with low conversion rate.
Preferably, in the first embodiment, the formula for calculating the methylation level is as follows:
Figure 372017DEST_PATH_IMAGE001
wherein,beta_corrected i,j is a samplejIn a blockR i Corrected average methylation level of;
beta i,j is a samplejIn a blockR i Average methylation level of;
BS i,j is a samplejIn a blockR i Average conversion efficiency of (a).
Preferably, in the second scheme, the calculation formula of the methylation level is as follows:
Figure 329609DEST_PATH_IMAGE002
Figure 535462DEST_PATH_IMAGE003
wherein,MFR i,j is a samplejIn a blockR i The ratio of methylated fragments of (a);
UFR i,j is a samplejIn a blockR i The fraction of unmethylated fragments of (a);
N convertedi,j is composed of
Figure 527689DEST_PATH_IMAGE004
The statistical number of segments of (a) is,BS i,j,n is a samplejIn a blockR i Upper segmentnThe conversion efficiency of (a);
N meth i,j is the statistical number of methylated fragments;
N unmeth i,j is the statistical number of unmethylated fragments.
Preferably, the criteria for the methylated fragments and unmethylated fragments are:
Figure 648878DEST_PATH_IMAGE005
is judged to be a methylated fragment,
Figure 93766DEST_PATH_IMAGE006
is judged to be a non-methylated fragment,beta i,j,n is a samplejIn a blockR i Upper segmentnThe average methylation level of (a) is,
Figure 86999DEST_PATH_IMAGE007
preferably, the training set principal component matrix is obtained by the following method:
methylation level matrix of training set samplesXAs an input file, performing principal component analysis and outputting a corresponding rotation matrixVAnd training set principal component matrixY(ii) a Calculating each principal componentY p Contribution to the total varianceλ p And before calculationmThe total contribution rate of the principal components is extracted to exceedtFront ofmA number of principal components into a list of principal components,
Figure 199311DEST_PATH_IMAGE008
(ii) a Output rotation matrixVTraining set principal component matrixYAnd a list of principal components.
Preferably, the principal component matrix of the sample to be detected is obtained by the following method:
methylation level matrix of sample to be testedXAnd the rotation matrixVAs an input file, performing principal component analysis and outputting a corresponding principal component matrix of the sample to be detectedY
Preferably, the methylation model construction by using the training set principal component matrix is carried out by the following steps:
firstly, a principal component matrix of a training set is extracted according to the principal component listYBefore all samples inmPrincipal component to new matrixY'New matrix will be formedY'As the input characteristic of the support vector machine model, modeling is carried out by adopting a bagging method to obtain a model based onKIntegrated model established by weak classifiersMModel (C)MIn the method, the weight of each weak classifier is equal, and the final predicted value isKThe average of the weak classifier predictors.
Preferably, the prediction of ctDNA in the sample to be tested is performed by the following steps:
according to the trained modelMExtracting principal component matrix of sample to be tested according to the principal component listYBefore all samples to be testedmPrincipal component to new matrixY'And will beY'Input to the trained modelMAnd obtaining the predicted value of each sample to be tested.
In a second aspect, the present invention provides a ctDNA prediction apparatus based on whole genome methylation sequencing, the prediction apparatus comprising:
a methylation block partitioning module: dividing a genome into blocks with equal length according to a human reference genome FASTA file and a BED file, and generating a block list file capable of calculating methylation levels;
block methylation level calculation and correction module: calculating the methylation level of the training set sample and the sample to be tested on each block;
a methylation block feature dimension reduction module: reducing the dimensionality of the methylation levels of the training set samples and the samples to be detected, calculating a rotation matrix according to the training set samples by using a principal component analysis method, and generating a training set principal component matrix and a sample principal component matrix to be detected by using the rotation matrix;
methylation model construction module: carrying out methylation model construction by utilizing a training set principal component matrix, and training a model;
methylation model prediction module: and predicting ctDNA in the sample to be tested by using the trained model.
In the present invention, the input file of the methylation block partitioning module includes: human reference genomic sequence FASTA files and reference genomic chromosome BED files.
The methylation block division module divides the human reference genome according to input parametersLDividing the corresponding block length to obtain blocksR i And filtering according to the CpG site number and the proportion of the multiple alignment areas and outputting.
In the present invention, the output file of the methylation block dividing module comprises: a block list file, which is preserved after two-step filteringIGreat-length blockR 1 ,⋯,R I }。
Preferably, in the methylation block partitioning module, the partitioning of the genome into blocks of equal length is performed by the following steps:
dividing the genome into blocks with equal length according to a FASTA file and a BED file of the human reference genome, and filtering according to the position information of all CpG sites in the whole genome range and the proportion of multiple alignment regions to generate a block list capable of calculating the methylation level.
In the present invention, the input file of the block methylation level calculation and correction module includes: sample(s)jThe method comprises the steps of reordering Bam files and block list files generated by a methylated block partitioning module, wherein the samples comprise training set samples and samples to be detected;
in the present invention, the output file of the block methylation level calculation and correction module includes: a methylation level matrix, the methylation level matrixXIs oneILine ofJA matrix of columns, where rows represent blocks and columns represent samples.
Preferably, in the block methylation level calculation and correction module, the calculation method of the methylation level comprises:
respectively obtaining capture sequencing FASTQ files of a training set sample and a sample to be tested, and performing sequence comparisonRespectively generating training set samples and original Bam files of samples to be detected, sequencing and marking the original Bam files repeatedly, respectively generating training set samples and sequenced and de-duplicated Bam files of samples to be detected, and recording the sequenced and de-duplicated Bam files as samplesjRemoving the reordered Bam file;
mixing the samplejCalculating samples by using the reordered Bam file and the block list capable of calculating the methylation level as input filesjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Corrected methylation levels on (v), and combining corrected methylation levels of all samples to oneILine ofJMethylation level matrix of columnsX
Preferably, the training set samples are plasma samples of a normal population and plasma samples of a patient; the sample to be detected is a plasma sample of a detected person;
preferably, the calculation of the methylation level further comprises the step of subjecting the sample tojIn a blockR i Upper fragment setS 1 ,⋯,S N Each fragment inS n Cutting off the front of the beginning of sequence 2 on each fragment XM tagAInformation of individual bases.
Preferably, the samples are calculatedjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Methods for correcting the level of post-methylation on (H) } include:
the first scheme is as follows: calculating block conversion efficiency according to the state of cytosine on the non-methylation site, and correcting the block methylation level by using the block conversion efficiency;
or scheme two: calculating the fragment conversion efficiency according to the state of cytosine on the non-methylation site of each original gene fragment, and calculating the methylation level of the block after filtering the low-conversion-rate fragment.
Preferably, in the first embodiment, the formula for calculating the methylation level is as follows:
Figure 472161DEST_PATH_IMAGE010
wherein,beta_corrected i,j is a samplejIn a blockR i Corrected average methylation level of;
beta i,j is a samplejIn a blockR i Average methylation level of;
BS i,j is a samplejIn a blockR i Average conversion efficiency of (2).
Preferably, in the second scheme, the calculation formula of the methylation level is as follows:
Figure 420657DEST_PATH_IMAGE011
Figure 968313DEST_PATH_IMAGE012
wherein,MFR i,j is a samplejIn a blockR i The ratio of methylated fragments of (a);
UFR i,j is a samplejIn a blockR i The fraction of unmethylated fragments of (a);
N convertedi,j is composed of
Figure 935131DEST_PATH_IMAGE004
The statistical number of segments of (a) is,BS i,j,n is a samplejIn a blockR i Upper segmentnThe conversion efficiency of (a);
N meth i,j is the statistical number of methylated fragments;
N unmeth i,j is the statistical number of unmethylated fragments.
Preferably, the criteria for the methylated fragments and unmethylated fragments are:
Figure 175620DEST_PATH_IMAGE013
the fragment of (a) is judged to be a methylated fragment,
Figure 844368DEST_PATH_IMAGE006
is judged to be a non-methylated fragment,
Figure 461294DEST_PATH_IMAGE014
in the present invention, the input file of the methylation block feature dimension reduction module comprises: training set sample methylation level matrixX(ii) a The output file includes: training set principal component matrixYPrincipal component list and rotation matrixV
Or, the input file of the methylated block feature dimension reduction module includes: methylation level matrix of sample to be testedXRotation matrixV(ii) a The output file includes: principal component matrix of sample to be testedYAnd a principal component list.
Preferably, in the methylated block feature dimension reduction module, the training set principal component matrix is obtained by the following method:
methylation level matrix of training set samplesXAs an input file, performing principal component analysis and outputting a corresponding rotation matrixVAnd training set principal component matrixY(ii) a Calculating each principal componentY p Contribution to the total varianceλ p And before calculationmThe total contribution rate of the principal components is extracted to exceedtFront ofmA plurality of principal components into a list of principal components,
Figure 282619DEST_PATH_IMAGE008
(ii) a Output rotation matrixVTraining set principal component matrixYAnd a list of principal components.
The calculation formula of the cumulative contribution rate is shown in formula 8:
Figure 959588DEST_PATH_IMAGE015
preferably, in the methylation block feature dimension reduction module, the principal component matrix of the sample to be detected is obtained by the following method:
methylation level matrix of sample to be testedXAnd the rotation matrixVAs an input file, performing principal component analysis and outputting a corresponding principal component matrix of the sample to be detectedY
In the present invention, the input file of the methylation model building module comprises: training set principal component matrixYThe method comprises a main component list and a sample list, wherein the sample list needs to contain grouping information of all training set samples, the sample list comprises sample names and sample grouping two columns, and each row records one sample.
In the present invention, the output file of the methylation model building module comprises: trained modelMAnd predicting probability results of the training set.
Preferably, in the methylation model building module, the methylation model building by using the training set principal component matrix is performed by the following steps:
firstly, a principal component matrix of a training set is extracted according to the principal component listYBefore all samples inmPrincipal component to new matrixY'New matrix will be formedY'As the input characteristic of the support vector machine model, modeling is carried out by adopting a bagging method to obtain a model based onKIntegrated model established by weak classifiersMModel (C)MIn the method, the weight of each weak classifier is equal, and the final predicted value isKThe average of the weak classifier predictors.
In the present invention, the input file of the methylation model prediction module comprises: principal component matrix of sample to be testedYTrained modelMAnd a sample list to be tested, wherein the sample list file to be tested is used as an input file, and all samples to be tested need to be contained.
In the present invention, the output file of the methylation model prediction module comprises: the prediction value of the sample to be tested, which is a value between 0 and 1, represents the probability that the sample is derived from the patient's plasma group, i.e. the result of the probability that plasma contains ctDNA.
Preferably, in the methylation model prediction module, the prediction of ctDNA in the sample to be tested is performed by the following steps:
according to the trained modelMExtracting principal component matrix of the sample to be tested according to the principal component listYBefore all samples to be testedmA principal component to a new matrixY'And will beY'Input to the trained modelMAnd obtaining the predicted value of each sample to be tested.
In the present invention, the key points of the ctDNA detection device based on whole genome methylation sequencing are as follows:
(1) capturing genome-wide fluctuations in methylation levels for low-depth genome-wide methylation sequencing designs
Compared with high-depth capture sequencing, the low-depth whole genome methylation sequencing method has the advantages that the experimental operation is simpler and more convenient, the sequencing cost is not high, the coverage area is more complete, the problem of signal omission caused by too few capture intervals can be avoided, and the methylation level in the whole genome range can be better reflected.
(2) Methylation levels were calculated in equal-length blocks
The invention carries out scheme design aiming at the data characteristics of the whole genome methylation sequencing, and considers that the depth of the whole genome methylation sequencing is low, and the accuracy of calculating the methylation level of a single methylation site or a shorter interval is lower under the low depth. According to the method, the regions in the whole genome range are divided into blocks according to the same length, the coverage area of each block is wider, the number of methylation sites is larger, the requirement on the depth is reduced by increasing the number of the sites, and the defect of low accuracy rate caused by low depth in the traditional analysis method is overcome. In addition, compared with the calculation of the methylation level of a single methylation site, the block methylation level calculation speed is higher, the number of features is reduced, and the time cost is greatly saved.
(3) Removing blocks containing multiple comparison regions to reduce adverse effects of comparison errors
When the equilong blocks are divided, the invention also carries out statistics on the length of the multiple comparison area on each block, removes the blocks with over high occupation ratio of the multiple comparison area, reduces the noise caused by comparison errors, influences the methylation level calculation and the model construction, and improves the accuracy of the model.
(4) Excision of the jagged ends for double-stranded library construction
The block methylation level calculation and correction module considers the influence of the sawtooth tail end when calculating the methylation level, cuts off the base of the repair and alignment end of each fragment in advance, avoids the underestimation of the methylation level caused by the repair step, improves the calculation accuracy of the methylation level and improves the sensitivity of the model.
(5) Various protocols are provided for correcting methylation levels based on conversion efficiency
The block methylation level calculation and correction module provided by the invention additionally considers the influence of the conversion efficiency on the methylation level on the basis of the traditional methylation level calculation mode, and provides two solutions. The scheme corrects the block average methylation level, and counteracts a part of influence of conversion efficiency on the calculation of the methylation level; and in the second scheme, the low-conversion-rate fragments are filtered, so that the influence of the conversion efficiency on judging the methylation state of the fragments is reduced. Both schemes improve the accuracy of methylation level calculation and the sensitivity of the model.
(6) Principal component analysis and dimension reduction, model construction time reduction and over-fitting avoidance
Compared with common methylation data analysis methods and models, the method does not perform differential screening on methylation regions, does not distinguish healthy human plasma from patient plasma, performs feature dimension reduction by using principal component analysis in a methylation block feature dimension reduction module, extracts principal components with the strongest difference interpretability in all samples to model, can avoid the problem of overfitting to a certain extent, and greatly shortens the time for model construction and prediction.
(7) The plasma sample is used for training the support vector machine, the model has good performance, and the prediction result is easy to interpret
In the invention, healthy human plasma and patient plasma are used as training sets in a methylation model building module, principal components obtained after principal component analysis dimensionality reduction are used as variables, a support vector machine integration model is trained, and the similarity between a sample to be tested and the healthy human or patient plasma is quantified. The finally obtained model has high sensitivity and specificity and strong generalization capability. The predicted value output by the model represents the probability that the sample to be tested contains ctDNA, and whether the sample to be tested contains ctDNA can be judged according to the threshold value obtained in the training set.
In a third aspect, the present invention provides a ctDNA prediction method based on whole genome methylation sequencing in the first aspect and/or a ctDNA prediction model based on whole genome methylation sequencing in the second aspect, and an application of the ctDNA prediction method based on whole genome methylation sequencing in preparation of a ctDNA detection product.
In a fourth aspect, the present invention provides a terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the ctDNA prediction method based on whole genome methylation sequencing of the first aspect when executing the computer program.
In a fifth aspect, the present invention provides a computer readable storage medium storing a computer program, which when executed by a processor, implements the steps of the whole genome methylation sequencing-based ctDNA prediction method of the first aspect.
Compared with the prior art, the invention has the following beneficial effects:
the invention carries out targeted optimization on the methylation level calculation of the low-depth whole genome methylation sequencing data of the plasma sample: by cutting off the sawtooth tail end and using the corrected block methylation level, the real methylation level of the sample in the whole genome range can be quickly and accurately calculated; on the basis of optimizing the calculation of the methylation level, the methylation level in the whole genome range is used for extracting principal components for dimensionality reduction, and a support vector machine integration model is constructed according to a small number of principal components, so that the probability of ctDNA existing in a plasma sample to be detected can be predicted more quickly and accurately, and overfitting is avoided. The invention was validated using a retrospective cohort of low-depth plasma whole genome methylation sequencing data. The test results show that at low depth, the invention can obtain higher sensitivity and specificity in the test set.
Drawings
Fig. 1 is a main flow diagram of the present invention.
FIG. 2 is a detection ROC curve for lung cancer in the test set.
FIG. 3 is a ROC curve for detection of colorectal cancer in the test set.
FIG. 4 is a detection ROC curve for pancreatic cancer in the test pool.
FIG. 5 is a detection ROC curve for gastric cancer in the test set.
FIG. 6 is a ROC curve for detection of liver cancer in the test set.
FIG. 7 is a detection ROC curve for breast cancer in the test set.
FIG. 8 is a detection ROC curve for esophageal cancer in the test pool.
FIG. 9 is a detection ROC curve for ovarian cancer in the test panel.
FIG. 10 is a detection ROC curve for a subset of cancer species tested.
Detailed Description
The technical solution of the present invention is further explained by the following embodiments. It should be understood by those skilled in the art that the examples are only for the understanding of the present invention and should not be construed as the specific limitations of the present invention.
The examples do not show the specific techniques or conditions, according to the technical or conditions described in the literature in the field, or according to the product specifications. The reagents or apparatus used are conventional products commercially available from normal sources, not indicated by the manufacturer.
Detailed Description
The technical principle of the invention is as follows: the method aims at the characteristics of the whole genome methylation sequencing data, divides the whole genome into blocks with equal length, and calculates the block methylation level aiming at each block after removing the blocks with poor comparison quality. In order to overcome the influence of the jagged ends on the calculation of the methylation level, the invention specially aims at the excision of the jagged ends of a double-stranded library-building and double-end sequencing sample.
To overcome the effect of conversion efficiency on methylation levels, the present invention provides two approaches in calculating block methylation levels:
the first scheme is as follows: calculating block conversion efficiency according to the state of cytosine on the non-methylation site, and correcting the block methylation level by using the block conversion efficiency;
scheme II: calculating the fragment conversion efficiency according to the state of cytosine on the non-methylation site of each original gene fragment, and calculating the methylation level of the block after filtering the low-conversion-rate fragment.
After the corrected block methylation level is obtained, the method uses a group of healthy human plasma samples and a group of tumor patient plasma samples as training sets, uses a principal component analysis method to reduce the dimension, and uses a small number of features after the dimension reduction as input variables of a support vector machine to establish a methylation model.
And calculating the block methylation level and dimension reduction after correction for each sample to be detected in the same way as the training set, predicting the sample by using a methylation model, obtaining the probability of ctDNA in the sample to be detected, and prompting the possibility of ctDNA in the sample.
The implementation process of the invention comprises the following steps: the main flow diagram of the solution of the invention is shown in fig. 1.
The technical scheme of the invention mainly comprises the following steps:
(1) acquiring capture sequencing FASTQ files of plasma samples of normal people (subsequently expressed as baseline samples) and plasma samples of detected persons (subsequently expressed as samples to be detected), and performing sequence comparison by using a genome comparison tool Bismark to generate original Bam files;
(2) sequencing and marking repeated processing are carried out on the original Bam file by utilizing SAMtools and a Bismark tool, and the sequenced and deduplicated Bam file is generated;
(3) by utilizing the methylation block dividing module, a genome is divided into blocks with equal length according to a human reference genome FASTA file and a BED file, the blocks with too few CpG sites and poor comparison quality are removed, and a block list capable of calculating the methylation level is generated;
(4) the block methylation level calculation and correction module of the invention is utilized to calculate the corrected methylation levels of the training set samples and the samples to be detected on each block. In the calculation process, the invention eliminates the influence of terminal sawtooth, and corrects the methylation level according to the conversion efficiency. The corrected calculation results are two matrices: correcting the block methylation level matrix of the training set sample and correcting the block methylation level matrix of the sample to be detected;
(5) the method comprises the steps of utilizing a methylation block characteristic dimension reduction module to reduce the dimension of a block methylation level after correction of a training set sample, using a principal component analysis method to calculate a rotation matrix according to the training set sample, and utilizing the rotation matrix to generate a training set principal component matrix and a to-be-detected sample principal component matrix;
(6) training the model according to the principal component matrix of the training set sample by using the methylation model construction module to construct a methylation model;
(7) by utilizing the methylation model prediction module, the methylation model is used for predicting the sample to be detected, and the probability score of the ctDNA contained in the sample to be detected is finally given, wherein the probability score can reflect whether the methylation level of the sample to be detected is abnormal or not and prompts whether the ctDNA exists in the plasma or not.
The technical scheme of the invention mainly comprises the following modules:
(1) methylation block division module
This module uses as input files the human reference genomic sequence FASTA file and the reference genomic chromosome BED file (recording the start and stop positions of each chromosome on the genome), and requires that the block length be provided as an input parameterL. Dividing the human reference genome according to a certain block length, filtering according to a certain rule, and outputting a block list for recording block chromosomes, start and end position informationA file.
First, using a human reference genome sequence FASTA file, positional information of all CpG sites within the whole genome is extracted.
The FASTA file was base-converted twice separately: converting cytosine (C) into thymine (T) at one time, keeping other bases unchanged, and recording as a reference sequence A; guanine (Guanine, G) is converted into Adenine (Adenine, a) at another time, and other bases are kept unchanged and recorded as a reference sequence B. Counting the number of occurrences of every 100 base sequences in the whole reference sequence A (or reference sequence B) by sliding with 100 base units, and recording the position of the region where multiple alignment phenomena (window sequences occur more than 2 times) exist.
The autosomes on the whole genome are divided into a plurality of blocks with equal length according to the input block length.
For each blockR i Counting blocks according to the position information of all CpG sites in the whole genomeR i Number of CpG sites covered above. If it isR i If the number of CpG in (A) exceeds 100, the number of CpG in (A) is increasedR i Is reserved otherwiseR i Is removed.
For retentionR i Then, according to the position of the region with multiple comparison phenomena, statistics is performedR i The proportion of the above multiple alignment regions. If the ratio is less than 5%, thenR i Is reserved otherwiseR i Is removed.
After two-step filtration, the filtrate is retainedIGreat-length blockR 1 ,⋯,R I The chromosomal, start and stop positions of each block will be according to "chromosome: the format of start-stop is output in list form.
(2) Block methylation level calculation and correction module
The module uses the samplejReordered Bam files (alignment software)Requires using BisMark comparisons) and the block list file generated by the inventive methylated block partitioning module as input files, sample calculation is performed according to a specific methodjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Corrected methylation levels on.
In calculating samplesjIn each blockR i At the methylation level of (A), first from the samplejExtracting and comparing blocks from the Bam fileR i And combining two sequences derived from the same original gene fragment (i.e., sequence 1 and sequence 2, generated by paired-end sequencing and recording the sequence of the same original gene fragment): the XM tags of sequence 1 and sequence 2 were combined into fragment XM tags according to their alignment to the start and stop positions on the reference genome.
The XM tag is a tag which is simultaneously output to a Bam file when the BisMark performs alignment, and records the base state of a sequence at each cytosine (C) position. Depending on the downstream bases, the distribution of C over the genome comprises three forms: CpG, CHH and CHG. Wherein CpG represents that one base downstream of C is Guanine (Guanine, G), CHH represents that two bases downstream of C are both H (bases except Guanine), and CHG represents that two bases downstream of C are H and G. The different forms of C differ in the letter references, the upper case representing the base of the sequence at the corresponding position as C, the lower case representing the base at the corresponding position as thymine (T): c at CpG is represented by Z or Z, C at CHH by H or H, C at CHG by X or X.
To reduce the number of fragments of the sawtooth-end phenomenon pairs in the double-stranded library constructionS n Methylation level ofbeta i,j,n Fragment, fragmentS n Transformation efficiency ofBS i,j,n Block, blockR i Average methylation level ofbeta i,j And blockR i Average conversion efficiency ofBS i,j The resulting effect is calculated for each pair of sequences 1And cutting off the front of one end which is possibly filled in the library building process by using the fragment XM label generated by combining the sequence 2AInformation of individual bases.
For a set of segments over a blockS 1 ,⋯,S N Each fragment inS n Calculating fragments from the CpG tag after base removalS n Methylation level ofbeta i,j,n
Figure 866364DEST_PATH_IMAGE016
From the base-removed CHH and CHG tags, fragments were countedS n Transformation efficiency ofBS i,j,n
Figure 21402DEST_PATH_IMAGE017
WhereinZ n z n H n h n X n Andx n respectively represent a fragmentS n Number of labels above Z, Z, H, H, X and X.
According to the samplejIn a blockR i Last allS 1 ,⋯,S N }, calculate samplesjIn a blockR i Average methylation level ofbeta i,j
Figure 710616DEST_PATH_IMAGE018
And a samplejIn a blockR i Average conversion efficiency ofBS i,j
Figure 558486DEST_PATH_IMAGE019
Sample(s)jIn a blockR i Corrected average methylation level ofbeta_corrected i,j Can be based onbeta i,j AndBS i,j and (3) calculating:
Figure 952559DEST_PATH_IMAGE020
in addition, the block methylation level can be calculated by calculating samplesjIn a blockR i Ratio of methylated fragments ofMFR i,j And fraction of unmethylated fragmentsUFR i,j And (4) obtaining.
In the calculation, firstly, the conversion rate higher than the threshold value (taking 0.95 as an example, meeting the requirement of the threshold value) is selected
Figure 911287DEST_PATH_IMAGE021
) The statistical number of which is recorded asN convertedi,j . Then according toN convertedi,j Of segments of stripsbeta i,j,n Whether or not it is higher than the threshold valuebeta methy
Figure 441626DEST_PATH_IMAGE022
Taking the 0.9 threshold as an example, namely, satisfying
Figure 709665DEST_PATH_IMAGE023
) Determining whether each fragment is a methylated fragment or not, or is below a thresholdbeta unmethy
Figure 591033DEST_PATH_IMAGE024
Take a threshold of 0.1 as an example, that is, satisfy
Figure 353453DEST_PATH_IMAGE025
) Judging whether the fragment is an unmethylated fragment. The number of methylated fragments is recorded asN meth i,j The number of unmethylated fragments is recorded asN unmeth i,j
Finally, the samplejIn a blockR i The methylation fragment ratio formula is as follows:
Figure 3877DEST_PATH_IMAGE026
sample(s)jIn a blockR i The ratio of unmethylated fragments of (A) is as follows:
Figure 927971DEST_PATH_IMAGE027
for each samplejThe final output corrected block methylation level is one length ofIThe vector of (2). If an average methylation level is used, then the output:
Figure 47368DEST_PATH_IMAGE028
if the methylated fragment fraction is used, the following is output:
Figure 347899DEST_PATH_IMAGE029
if a proportion of unmethylated fragments is used, the output is:
Figure 118409DEST_PATH_IMAGE030
merging the corrected block methylation levels of all samples to oneILine ofJMatrix of columnsXWherein, the rows represent the blocks, the columns represent the samples, and the samples are written into the file and output to the file.
(3) Methylation block feature dimension reduction module
When the methylation block feature dimension reduction module is used for the first time, a methylation level matrix output by the block methylation level calculation and correction module needs to be providedXThe file serves as an input file. All subsequent building blocks need to be included in the matrixTraining set samples of type.
To pairIAnalyzing the main component of the methylation level of each block, and recording the corresponding rotation matrixVAnd each training set sample isPPrincipal component matrix of values on the principal componentsY
In obtaining a rotation matrixVAnd a principal component matrixYThen, each principal component is calculatedY p Contribution to the total varianceλ p And before calculationmThe cumulative contribution ratio of each principal component, and the calculation formula of the cumulative contribution ratio is formula 8. Record the cumulative contribution rate of extraction exceedst
Figure 213404DEST_PATH_IMAGE031
Front ofmA principal component into a principal component list.
Figure 69364DEST_PATH_IMAGE032
The methylation block feature dimension reduction module outputs a rotation matrix when being used for the first timeVPrincipal component matrixYAnd a principal component list.
When obtaining the rotation matrixVThereafter, the module can be subsequently re-run at the time of inputXAdditionally providing rotation matrix at the same time of matrix fileVA file. At this time, the module can be directly usedVArray methylation levelXConversion into principal component matrixY. At this time, the module outputs only the principal component matrixYInto an output file.
(4) Methylation model building block
The module uses principal component matrix output by the methylation block feature dimension reduction moduleYAnd the main component list and the sample list are used as input files. In the matrixYIt is necessary to contain all training set samples used to construct the model, including healthy human plasma and patient plasma. The principal component list contains the topmThe name of each principal component specifies the number of principal components used. The sample list needs to contain grouping information of all training set samples, including sample name, two columns of sample grouping, eachThe line records one sample.
Before training the model, the principal component matrix is extracted firstYFront of all training set samplesmPrincipal component to new matrixY'. Will new matrixY'As input features for the support vector machine model.
During modeling, a bagging method is adopted, and the modeling is respectively carried outKRandom sampling with playback, buildKA support vector machine weak classifier. And performing parameter optimization, training and prediction on each weak classifier independently. The final model is based onKIntegrated model established by weak classifiersM. Model (model)MIn the method, the weight of each weak classifier is equal, and the final predicted value isKThe average of the weak classifier predictors.
And when the model is trained, the methylation model building module predicts each training set sample and records a corresponding predicted value. The predictive value is a value between 0 and 1 representing the probability that the sample originates from a patient plasma group, i.e. the probability that plasma contains ctDNA.
The methylation model construction module finally outputs a trained modelMAnd the prediction probability result of the training set is sent to a file. Wherein the modelMThe method is characterized in that a specific format is adopted to store the data as a binary file, the prediction probability of a training set is a table containing three columns, the three columns are respectively a sample name, a group to which the sample belongs and the sample prediction probability, and information of a training set sample is recorded in each row.
(5) Methylation model prediction module
The methylation model prediction module uses the principal component matrix output by the methylation block feature dimension reduction moduleYModel output by file and methylation model building moduleMAnd the list file of the samples to be detected is used as an input file, and all the samples to be detected need to be contained in the matrix.
The methylation model prediction module can be based on the modelMExtracting principal component matrix from the recorded model variablesYBefore all samples to be testedmPrincipal component to new matrixY'And will beY'Input into the modelMAnd obtaining the predicted value of each sample to be tested. The prediction value is between 0 and 1A value representing the probability that the sample was derived from the patient's plasma group, i.e. the probability that the plasma contains ctDNA.
And the methylation model prediction module finally outputs the prediction probability result of the sample to be detected to a file. The file contains two columns of tables, namely a sample name and a sample prediction probability, and each row records information of a sample to be detected.
The technical improvement points of the invention are as follows: the methylation level is calculated by using the block as a unit, so that the calculation efficiency is improved, and the calculation of the methylation level in the low-depth sample is more stable; the sawtooth tail end is cut off, and the influence of tail end repair on methylation level calculation is reduced; correcting according to the conversion efficiency, and improving the accuracy of the calculation of the plasma methylation level; before modeling, feature dimensionality reduction is carried out, overfitting is avoided, and modeling and prediction speeds are improved; and the sensitivity and specificity are higher by using a support vector machine integration model.
497 healthy human plasma without cancer history and 801 plasma of various cancer patients with different stages are selected retrospectively for verification of technical scheme and randomly grouped into training set and verification set. The cancer types of the patient include breast cancer, colorectal cancer, esophageal cancer, gastric cancer, liver cancer, lung cancer, pancreatic cancer and ovarian cancer.
The training set included 352 healthy people and 561 cancer patients (46 breast, 105 colorectal, 43 esophageal, 80 stomach, 79 liver, 110 lung, 83 pancreatic, 15 ovarian).
The validation set included 145 healthy people and 240 cancer patients (20 breast, 45 colorectal, 18 esophageal, 34 gastric, 34 liver, 47 lung, 36 pancreatic, 6 ovarian).
Example 1 procedure for verifying technical solutions
The experimental procedure is as follows:
1. plasma cfDNA extraction
1.1 Each subject 10mL whole blood in Kangshi EDTA blood collection tube, at 4 degrees C and 1600g speed centrifugal 10min to make plasma, blood cells layer. The supernatant plasma was transferred to a fresh centrifuge tube and centrifuged again at 12000rpm at 4 ℃ for 15min to remove cell debris. 4mL of plasma was obtained and stored frozen at-80 ℃ until use.
1.2 plasma samples were thawed and 15. mu.L Proteinase K (Proteinase K, 20mg/mL, theroscientific cat # EO 0492) and 50. mu.L SDS (20%) were added to each 1mL sample. The plasma volume was less than 4mL and was made up with PBS.
1.3 turn over and mix evenly, incubate 20min at 60 ℃, then ice-wash for 5 min.
1.4 extraction of cfDNA Using MagMAX Cell-Free DNA Isolation kit (thermosientific cat # A29319).
1.5 extracted concentration and quality of cfDNA was measured using Bioanalyzer 2100 (Agilent Technologies).
2. cfDNA library construction
Using the methylation library construction Kit NEBNext Enzymatic Methyl-seq Kit (NEB, cat # E7120), 5-methylcytosine (5-mC) was converted to 5-formylcytosine (5-fC) and 5-carboxycytosine (5-caC) by TET2 enzyme at a starting amount of 5-30ng cfDNA, and unmethylated cytosine (C) was deaminated to uracil (U) by APOBEC enzyme, followed by amplification and pooling.
The specific library construction process is as follows:
2.1 preparation of internal reference
50 μ L of CpG fully methylated pUC19 DNA and 50 μ L of CpG fully unmethylated lambda DNA were mixed and added to a 100 μ L disruption tube and disrupted using an M220 disruptor (Covaris). When constructing a library, 0.001ng of pUC19 DNA and 0.02 ng of lambda DNA were added to cfDNA to be tested.
2.2 preparation of cfDNA samples
The initial amount of cfDNA sample was 5-30ng, with no interruption required.
2.3 end repair
2.3.1 mixing on ice, the reaction system is shown in Table 1;
TABLE 1
Figure 157275DEST_PATH_IMAGE033
The Enzyme mixture described in Table 1 is NEBNext Ultra II End Prep Enzyme Mix.
2.3.2 the reaction system was placed on a PCR instrument and the end-point repair reaction was carried out under the conditions shown in Table 2.
TABLE 2
Figure 782291DEST_PATH_IMAGE034
2.4 connection adapter (Joint)
2.4.1 working on ice, the following components were added to the 60. mu.L reaction system of the previous step as shown in Table 3;
TABLE 3
Figure 48188DEST_PATH_IMAGE035
Incubate at 2.4.220 deg.C for 15 min.
2.5 post ligation purification
2.5.1 after the reaction of the previous step is completed, the Sample is taken out, 110. mu.L of NEBNext Sample Purification Beads are added, and the mixture is immediately blown and mixed by using a pipette.
2.5.2 incubate for 5min at room temperature.
2.5.3 the centrifuge tube is placed on a magnetic frame for 5min to clarify the liquid, and the supernatant is discarded.
2.5.4 Add 200. mu.L of Ready-made 80% ethanol, incubate for 30 s and discard. The 200 μ L80% ethanol wash step was repeated once.
2.5.5 sucking out residual ethanol from the bottom of the centrifuge tube with 10 μ L pipette, and drying at room temperature for 3-5min until ethanol is completely volatilized.
2.5.6 the tube was removed from the magnetic frame, 29. mu.L of Elution Buffer was added, mixed well by shaking, and incubated at room temperature for 1 min.
2.5.7 centrifuging for a short time, placing the centrifuge tube on a magnetic frame for 3 min to clarify the liquid, and placing 28 μ L into a new PCR tube.
2.65-Methylcytosine and 5-hydroxymethylcytosine Oxidation reactions
The following reaction procedure was carried out using NEBNext enzymic Methyl-seq Kit (NEB, cat # E7120).
2.6.1 TET2 Reaction Buffer Supplement Dry powder 400. mu.L TET2 Reaction Buffer was added and mixed well.
2.6.2 on ice the following components were added to 28. mu.L of adaptor-linked DNA described above as shown in Table 4:
TABLE 4
Figure 391444DEST_PATH_IMAGE036
2.6.3A 500 mM Fe (II) solution was diluted 1: 1250. The prepared Fe (II) solution is added into the product mixed in the previous step, and the dosage of each component is shown in the table 5.
TABLE 5
Figure 33778DEST_PATH_IMAGE037
Mix well and incubate at 37 ℃ for 1 h.
2.6.4 after the reaction is complete, move to ice and add 1. mu.L of Stop Reagent and mix well. The amounts of the components used are shown in table 6.
TABLE 6
Figure 778880DEST_PATH_IMAGE038
2.6.5 the mixed samples of 2.6.4 were incubated under the conditions shown in Table 7.
TABLE 7
Figure 981059DEST_PATH_IMAGE039
2.7 post-Oxidation purification
2.7.1 after the reaction of the previous step is completed, the Sample is taken out, 90. mu.L of NEBNext Sample Purification Beads are added, and the mixture is immediately blown and mixed by using a pipette.
2.7.2 incubate for 5min at room temperature.
2.7.3 placing the centrifuge tube on a magnetic frame for 5min to clarify the liquid, and discarding the supernatant.
2.7.4 mu.L of freshly prepared 80% ethanol was added, incubated for 30 s and discarded. The 200 μ L80% ethanol wash step was repeated once.
2.7.5 sucking out residual ethanol from the bottom of the centrifuge tube with 10 μ L pipette, and drying at room temperature for 3-5min until ethanol is completely volatilized.
2.7.6 the tube was removed from the magnetic stand, 17. mu.L of Elution Buffer (Elution Buffer) was added, and the mixture was shaken and mixed. Incubate at room temperature for 1 min.
2.7.7 centrifuging for a short time, placing the centrifuge tube on a magnetic frame for 3 min to clarify the liquid, and placing 16 μ L into a new PCR tube.
2.8 DNA denaturation
2.8.1 fresh 0.1M NaOH was prepared.
2.8.2 preheating the PCR apparatus to a temperature of 50 ℃ in advance.
2.8.3 Add 4. mu.L of 0.1M NaOH to 16. mu.L of purified product from above and mix well.
Incubation was carried out at 2.8.450 ℃ for 10 min.
2.8.5 immediately after the reaction was completed, the mixture was put on ice.
2.9 cytosine deamination
2.9.1 the following components were added to the above denatured DNA of 20. mu.L on ice and mixed well. The amounts of the components used are shown in Table 8.
TABLE 8
Figure 811611DEST_PATH_IMAGE040
2.9.2 the reaction was stopped by incubation on a PCR instrument at 37 ℃ for 3h and then switched to 4 ℃.
2.10 post-deamination purification
2.10.1 after the reaction in the previous step is completed, the Sample is taken out, 100. mu.L of magnetic Beads (NEBNext Sample Purification Beads) are added, and the mixture is immediately pipetted and mixed.
2.10.2 incubate for 5min at room temperature.
2.10.3 placing the centrifuge tube on a magnetic frame for 5min to clarify the liquid, and discarding the supernatant.
2.10.4 mu.L of freshly prepared 80% ethanol was added, incubated for 30 s and discarded. The 200. mu.L 80% ethanol wash step was repeated once.
2.10.5 sucking out residual ethanol from the bottom of the centrifuge tube with 10 μ L pipette, and drying at room temperature for 3-5min until ethanol is completely volatilized.
2.10.6 the tube was removed from the magnetic stand, 21. mu.L of Elution Buffer (Elution Buffer) was added, mixed well with shaking, and incubated at room temperature for 1 min.
2.10.7 centrifuging for a short time, placing the centrifuge tube on a magnetic frame for 3 min to clarify the liquid, and placing 20 μ L into a new PCR tube.
2.11 library PCR amplification
2.11.1 the following components were added to 20. mu.L of the DNA deaminated in the above step on ice in the amounts shown in Table 9.
TABLE 9
Figure 992057DEST_PATH_IMAGE041
2.11.2 after mixing well, PCR reactions were carried out on a PCR machine, the procedure of which is shown in Table 10.
TABLE 10
Figure 591666DEST_PATH_IMAGE042
2.12 post PCR purification
2.12.1 after the reaction in the previous step is completed, the Sample is taken out, 45. mu.L of magnetic Beads (NEBNext Sample Purification Beads) are added, and the mixture is immediately pipetted and mixed.
2.12.2 was incubated at room temperature for 5 min.
2.12.3 placing the centrifuge tube on a magnetic frame for 5min to clarify the liquid, and discarding the supernatant.
2.12.4 mu.L of freshly prepared 80% ethanol was added, incubated for 30 s and discarded. The 200 μ L80% ethanol wash step was repeated once.
2.12.5 sucking out residual ethanol from the bottom of the centrifuge tube with 10 μ L pipette, and drying at room temperature for 3-5min until ethanol is completely volatilized.
2.12.6 the tube was removed from the magnetic frame, 21. mu.L of agitation Buffer was added, and the mixture was shaken and mixed. Incubate at room temperature for 1 min.
2.12.7 centrifuging for a short time, placing the centrifuge tube on a magnetic frame for 3 min to clarify the liquid, and placing 20 μ L into a new PCR tube.
2.13 library quantification
The constructed library was quantified using a Qubit high sensitivity reagent (theroscientific cat # Q32854) with a library yield of greater than 400ng for subsequent on-board sequencing.
3. Library sequencing
100ng of the library was mixed with 10% PhiX DNA (Illumina cat # FC-110-3001) and loaded on the machine for PE100 sequencing on the Novaseq 6000 (Illumina) platform.
Example 2 raw letter analysis procedure
1. Processing off-line FASTQ data into Bam files usable by each module
1.1 removing the joint
Calling Trimmomatic-0.36 to align each pair of FASTQ files as paired Reads to hg19 human reference genome sequence, and generating initial Bam files without using other parameter options except M parameter and ID of specified Reads Group.
1.2 alignment
Call Bismark-v0.19.0 align each pair of FASTQ files after linker removal as paired reads to hg19 human reference genomic sequence and Lambda DNA reference genomic sequence, generating the initial Bam file.
1.3 De-weighting
And calling a deduplicate module of Bismark-v0.19.0 to perform deduplication processing on the initial Bam file to generate a deduplicated Bam file.
1.4 ordering tags
And calling a sort module of SAMtools-1.3, sorting the duplicate-removed Bam files, and generating the sorted Bam files. Then, an AddOrRepleReadGroups module of Picard-2.1.0 is called to mark and group the sorted Bam files.
1.5 screening
And calling a clipOverlap module of the BamHutil-1.0.14 to screen the Bam files after the marks are grouped, removing overlapped pair reads, and generating the Bam files. And calling SAMtools-1.3 view to filter the comparison quality of the overlapped Bam files, and generating the Bam files with the filtered comparison quality by adopting 'q 20' as a parameter.
1.6 extraction of mate sequences
In order to keep the data volume of each sample consistent, 30,000,000 pairs of matching sequences are randomly extracted from the Bam file after filtering and aligning the quality of each sample to a new Bam file to serve as a final Bam file.
1.7 building an index
And calling an index module of SAMtools-1.3 to establish an index for the finally generated Bam file, and generating a bai file paired with the finally generated Bam file.
2. Methylation analysis
2.1 partitioning the methylation Block
Using human reference genome sequence FASTA file and reference genome chromosome BED file (recording start and end positions of each chromosome on the genome) as input files of the methylation block partitioning module, and setting block length parametersL=1,000,000。
Firstly, the position information of all CpG sites in the whole genome range is extracted by using a human reference genome sequence FASTA file, and the position information of all regions in which the multiple alignment phenomenon exists in the whole genome range is calculated.
The human reference genome is divided into 2,734 blocks, each block comprising 1,000,000 bases. Counting the number of CpG sites in each block and the length of the area with multiple alignment phenomena, filtering the blocks with the number of CpG sites less than 100 or the total length of multiple alignment areas more than 50000 bases, and finally keeping 1,846 blocks. These blocks will be used for subsequent analysis and saved in a block list file.
2.2 Generation of corrected Block average methylation level matrices for training and test set samples, respectively
And taking the final Bam file of each sample in the training set and the sample in the test set and the block list file generated in 2.1 as block methylation level calculation and correction module input files.
From each training set or test set sample in turnjRespectively extracting each block compared in the Bam from the Bam fileR i And combining XM tags derived from two matched sequences of the same original gene fragment as a fragment XM tag.
The XM tag is a tag which is simultaneously output to a Bam file when the BisMark performs alignment, and records the base state of a sequence at each cytosine (C) position. Depending on the downstream bases, the distribution of C over the genome comprises three forms: CpG, CHH and CHG. Wherein CpG represents that one base downstream of C is Guanine (Guanine, G), CHH represents that two bases downstream of C are both H (bases except Guanine), and CHG represents that two bases downstream of C are H and G. The different forms of C differ in their label letters, with upper case letters representing the base of the sequence at the corresponding position as C, and lower case letters representing the base at the corresponding position as thymine (T): c at CpG is represented by Z or Z, C at CHH by H or H, C at CHG by X or X.
For the samplejIn a blockR i Upper fragment setS 1 ,⋯,S N Each fragment inS n Excising the first 50 bases of the start of sequence 2 on each fragment XM tag to reduce the jagged end phenomenon for the blockR i Average methylation level ofbeta i,j And blockR i Average conversion efficiency ofBS i,j The impact of the calculation.
According to the samplejIn a blockR i Upper fragment setS 1 ,⋯,S N Calculate samples }jIn a blockR i Average methylation level ofbeta i,j
Figure 199365DEST_PATH_IMAGE043
And samplesjIn a blockR i Average conversion efficiency ofBS i,j
Figure 766481DEST_PATH_IMAGE044
WhereinZ n z n H n h n X n Andx n respectively represent a fragmentS n Number of labels above Z, Z, H, H, X and X.
Sample(s)jIn a blockR i Corrected average methylation level ofbeta_corrected i,j Can be based onbeta i,j And withBS i,j And (3) calculating:
Figure 111137DEST_PATH_IMAGE045
for each samplejThe final output corrected block average methylation level is one length ofIThe vector of (a):
Figure 752203DEST_PATH_IMAGE046
the corrected block average methylation levels for all training set samples are combined into a 1,864 row 913 column matrix, and the corrected block average methylation levels for all test set samples are combined into a 1,864 row 385 column matrix, where rows represent blocks and columns represent samples. The two matrixes are respectively output as a training set corrected block average methylation level matrix file and a test set corrected block average methylation level matrix file.
2.3 generating the corrected block methylated segment proportion matrix of the training set and the test set samples respectively
And taking the final Bam file of each sample in the training set and the sample in the test set and the block list file generated in 2.1 as block methylation level calculation and correction module input files.
From each training set or test set sample in turnjRespectively extracting each block compared in the Bam from the Bam fileR i And merging XM tags derived from two matched sequences of the same original gene fragmentAs fragment XM tags.
For the samplejIn a blockR i Upper fragment setS 1 ,⋯,S N Each fragment inS n Cutting off the first 50 bases of the start of sequence 2 on the XM tag of each fragment to reduce the jagged end phenomenon on the fragmentS n Methylation level ofbeta i,j,n Fragment of (A) A. sup. beta. CelastrusS n Transformation efficiency ofBS i,j The impact of the calculation.
Counting fragments based on CpG on XM tag of the fragment after base removalS n Methylation level ofbeta i,j,n
Figure 530803DEST_PATH_IMAGE047
Counting fragments based on CHH and CHG tags on XM tags of fragments after removal of basesS n Transformation efficiency ofBS i,j,n
Figure 70369DEST_PATH_IMAGE048
WhereinZ n z n H n h n X n Andx n respectively represent a fragmentS n Number of labels above Z, Z, H, H, X and X.
In calculating samplesjIn a blockR i Proportion of methylated fragments ofMFR i,j When the conversion rate is higher than the threshold value (taking 0.95 as an example, the conversion rate is satisfied with
Figure 858196DEST_PATH_IMAGE004
) The statistical number of which is recorded asN convertedi,j . Then according toN convertedi,j Strip sheetOf segmentsbeta i,j,n Whether it is higher than the threshold (for example, 0.9 threshold, i.e., satisfied
Figure 166818DEST_PATH_IMAGE049
) And judging whether each fragment is a methylated fragment. The number of methylated fragments is recorded asN meth i,j
Finally, the samplejIn a blockR i Ratio of methylated fragments of
Figure 864122DEST_PATH_IMAGE050
For each samplejThe final output corrected block methylation fragment ratio is a length ofIThe vector of (a):
Figure 625404DEST_PATH_IMAGE051
the corrected block methylation fragment occupancy of all training set samples is combined into a matrix of 1,864 rows and 913 columns, and the corrected block methylation fragment occupancy of all test set samples is combined into a matrix of 1,864 rows and 385 columns, where rows represent blocks and columns represent samples. The two matrixes are respectively output as a block methylated segment proportion matrix file after the training set is corrected and a methylated segment proportion matrix file after the test set is corrected.
2.4 using the training set to reduce the dimension of the methylation block feature to obtain the rotation matrix and the principal component matrix of the training set
And (3) using the block methylated segment proportion matrix file after the training set obtained in the step 2.3 is corrected as an input file of the methylated block feature dimension reduction module. Principal component analysis was performed on the corrected methylation fragment ratios of 1,864 blocks of 913 samples in the training set, yielding 913 principal components in total. This step simultaneously generates a rotation matrixVAnd a 913 rows and 913 columns of training set principal component matricesYAnd generating a training set principal component file. Where the rows are training set samples and the columns are principal components.
At the moment of obtaining rotationMatrix arrayVAnd a principal component matrixYThen, each principal component is calculatedY p Contribution to the total varianceλ p And before calculationmThe cumulative contribution ratio of each principal component, and the calculation formula of the cumulative contribution ratio is formula 8. And recording and extracting the first 66 principal components with the accumulated contribution rate exceeding 95% into a principal component list file. These 66 principal components will be used for the construction of the methylation model.
Figure 216923DEST_PATH_IMAGE052
2.5 obtaining principal component matrix of the test set after dimension reduction
And using the block methylated segment ratio matrix file after the correction of the test set obtained in the step 2.3 and the rotation matrix file obtained in the step 2.4 as input files of the methylated block characteristic dimension reduction module. The corrected methylation fragment ratios of the 1,864 blocks of 385 samples in the test set were converted to 913 principal components. This step simultaneously generates a rotation matrixVAnd a 385 rows 913 columns of test set principal component matrices and generates a test set principal component matrix file. Where the rows are training set samples and the columns are principal components.
2.6 methylation modeling Using principal Components
And using the principal component matrix file and the principal component list file of the training set obtained in the step 2.4 as input files of the methylation model building module, and simultaneously providing a grouping information list of the training set.
First, a principal component matrix is extractedYThe first 66 principal components of all the training set samples in the matrix are transformed into a new matrixY'In (3), the new matrix isY'As an input feature. The training set is converted into two classification variables, namely healthy human 0 and patient 1, which are used as dependent variables of the model.
During modeling, a bagging method is adopted, random sampling with replacement is carried out for 13 times respectively, and 13 weak classifiers of the support vector machine are established. And performing parameter optimization, training and prediction on each weak classifier independently. The final model is an integrated model established based on 13 weak classifiersM. Model (model)MIn each case weakThe weights of the classifiers are equal, and the final predicted value is the average value of the predicted values of the 13 weak classifiers and represents the probability of ctDNA existing in the sample.
And predicting each training set sample while training the model, and recording and outputting the predicted value of each training set sample to a training set prediction result file.
The support vector machine integration model generated in the stepMWill be stored as a binary file for subsequent prediction of test set samples.
2.7 prediction of test set samples with trained models
Using the test set principal component matrix file obtained in 2.5 and the integration model obtained in 2.6MAs an input file for the methylation model prediction module.
First 66 principal components of all training set samples in the training set principal component matrix are extracted into a new matrix, and the new matrix is used as an input feature. Support vector machine integration modelMThe 13 weak classifiers predict each prediction set sample according to 66 main components respectively, and the final prediction value is the average value of the prediction values of the 13 weak classifiers and represents the probability of ctDNA existing in the sample. The predicted values are recorded and output to a prediction set prediction results file.
Under the condition that the specificity of the training set is 95%, the specificity of the kit for healthy people in the test set reaches 97.24%, the detection sensitivity for a single cancer species is shown in Table 11, and the detection sensitivity for all 8 cancer species reaches 72.92%.
TABLE 11
Figure 380051DEST_PATH_IMAGE053
As shown in fig. 2-10, the ROC curve Area (AUC) for detection of a single cancer species in the test set of the present invention reached 84.94% -100%, and the AUC performance reached 93.31% for detection of all 8 cancer species, indicating good cancer detection performance.
In conclusion, the ctDNA detection device of the present invention can quickly, easily and accurately calculate the methylation level in the whole genome range, and evaluate the probability of ctDNA existing in the sample. The detection device can be applied to plasma samples, overcomes the technical defects of inaccurate methylation level calculation and low sensitivity in the conventional low-depth methylation sequencing, and has important application value in early screening or real-time monitoring of tumors.
The applicant declares that the above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and it should be understood by those skilled in the art that any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are within the scope and disclosure of the present invention.

Claims (10)

1. A ctDNA prediction method based on whole genome methylation sequencing is characterized by comprising the following steps:
dividing a genome into blocks with equal length according to a human reference genome FASTA file and a BED file, and generating a block list file capable of calculating methylation levels;
calculating the methylation level of a training set sample and a sample to be detected on each block, reducing the methylation level of the training set sample and the sample to be detected, calculating a rotation matrix according to the training set sample by using a principal component analysis method, and generating a training set principal component matrix and a sample to be detected by using the rotation matrix;
and (3) carrying out methylation model construction by utilizing the training set principal component matrix, and predicting ctDNA in the sample to be tested by using the trained model.
2. The ctDNA prediction method based on whole genome methylation sequencing of claim 1, wherein the dividing of the genome into blocks of equal length is performed by the following steps:
dividing the genome into blocks with equal length according to a FASTA file and a BED file of the human reference genome, and filtering according to the position information of all CpG sites in the whole genome range and the proportion of multiple alignment regions to generate a block list capable of calculating the methylation level.
3. The whole genome methylation sequencing-based ctDNA prediction method according to claim 1, wherein the methylation level calculation method comprises:
respectively obtaining capture sequencing FASTQ files of a training set sample and a sample to be tested, carrying out sequence comparison, respectively generating a training set sample and an original Bam file of the sample to be tested, carrying out sequencing and marking repeated processing on the original Bam file, respectively generating a training set sample and a Bam file of the sample to be tested after sequencing and de-duplicating, and recording as a samplejRemoving the reordered Bam file;
mixing the samplejCalculating samples by using the reordered Bam file and the block list capable of calculating the methylation level as input filesjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Corrected methylation levels on (v), and combining corrected methylation levels of all samples to oneILine ofJMethylation level matrix of columnsX
The training set samples are plasma samples of normal people and plasma samples of patients; the sample to be detected is a plasma sample of a detected person;
calculating the methylation level also includesjIn a blockR i Upper fragment setS 1 ,⋯,S N Each fragment inS n Cutting off the front of the beginning of sequence 2 on each fragment XM tagAInformation of individual bases;
computing samplesjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Methods for correcting the level of post-methylation on (H) } include:
the first scheme is as follows: calculating block conversion efficiency according to the state of cytosine on the non-methylation site, and correcting the block methylation level by using the block conversion efficiency;
or scheme two: calculating the conversion efficiency of the fragments according to the state of cytosine on the non-methylation site of each original gene fragment, and calculating the methylation level of the block after filtering the fragments with low conversion rate.
4. The whole genome methylation sequencing-based ctDNA prediction method according to claim 3, wherein in the first scheme, the calculation formula of methylation level is as follows:
Figure 519985DEST_PATH_IMAGE001
wherein,beta_corrected i,j is a samplejIn a blockR i Corrected average methylation level of;
beta i,j is a samplejIn a blockR i Average methylation level of;
BS i,j is a samplejIn a blockR i Average conversion efficiency of (a);
in the second scheme, the formula for calculating the methylation level is as follows:
Figure 171547DEST_PATH_IMAGE002
Figure 420125DEST_PATH_IMAGE003
wherein,MFR i,j is a samplejIn a blockR i The ratio of methylated fragments of (a);
UFR i,j is a samplejIn a blockR i The fraction of unmethylated fragments of (a);
N convertedi,j is composed of
Figure 291699DEST_PATH_IMAGE004
The statistical number of segments of (a) is,BS i,j,n is a samplejIn a blockR i Upper segmentnThe conversion efficiency of (a);
N meth i,j is the statistical number of methylated fragments;
N unmeth i,j is the statistical number of unmethylated fragments;
the judgment standard of the methylated fragment and the unmethylated fragment is as follows:
Figure 574912DEST_PATH_IMAGE005
the fragment of (a) is judged to be a methylated fragment,
Figure 866217DEST_PATH_IMAGE006
is judged to be a non-methylated fragment,beta i,j,n is a samplejIn a blockR i Upper segmentnThe average methylation level of (a) is,
Figure 867671DEST_PATH_IMAGE007
5. the ctDNA prediction method based on whole genome methylation sequencing of claim 1, wherein the training set principal component matrix is obtained by:
methylation level matrix of training set samplesXAs an input file, performing principal component analysis and outputting a corresponding rotation matrixVAnd training set principal component matrixY(ii) a Calculating each principal componentY p Contribution to the total varianceλ p And before calculationmThe total contribution rate of the principal components is extracted to exceedtFront ofmA plurality of principal components into a list of principal components,
Figure 535412DEST_PATH_IMAGE008
(ii) a Output rotation matrixVTraining set principal component matrixYAnd a list of principal components;
the principal component matrix of the sample to be detected is obtained by adopting the following method:
methylation level matrix of sample to be testedXAnd the rotation matrixVAs an input file, performing principal component analysis and outputting a corresponding principal component matrix of the sample to be detectedY
The methylation model construction by utilizing the training set principal component matrix is carried out by adopting the following steps:
firstly, a principal component matrix of a training set is extracted according to the principal component listYBefore all samples inmPrincipal component to new matrixY'New matrix will be formedY'As the input characteristic of the support vector machine model, modeling is carried out by adopting a bagging method to obtain a model based onKIntegrated model established by weak classifiersMModel (A)MIn the method, the weight of each weak classifier is equal, and the final predicted value isKAverage value of the weak classifier predicted values;
the prediction of ctDNA in a sample to be detected is carried out by adopting the following steps:
according to the trained modelMExtracting principal component matrix of sample to be tested according to the principal component listYBefore all samples to be testedmPrincipal component to new matrixY'And will beY'Input to the trained modelMAnd obtaining the predicted value of each sample to be tested.
6. A ctDNA prediction apparatus based on whole genome methylation sequencing, the prediction apparatus comprising:
a methylation block partitioning module: dividing a genome into blocks with equal length according to a human reference genome FASTA file and a BED file, and generating a block list file capable of calculating methylation levels;
block methylation level calculation and correction module: used for calculating the methylation level of the training set sample and the sample to be tested on each block;
a methylation block feature dimension reduction module: reducing the dimensionality of the methylation levels of the training set samples and the samples to be detected, calculating a rotation matrix according to the training set samples by using a principal component analysis method, and generating a training set principal component matrix and a sample principal component matrix to be detected by using the rotation matrix;
methylation model construction module: constructing a methylation model by utilizing a training set principal component matrix, and training the model;
methylation model prediction module: and predicting ctDNA in the sample to be tested by using the trained model.
7. The ctDNA prediction apparatus based on whole genome methylation sequencing according to claim 6, wherein in the methylation block partitioning module, the partitioning of the genome into blocks with equal length is performed by the following steps:
dividing a genome into blocks with equal length according to a FASTA file and a BED file of a human reference genome, and filtering according to the position information of all CpG sites in the whole genome range and the ratio of multiple comparison areas to generate a block list capable of calculating the methylation level;
in the block methylation level calculation and correction module, the calculation method of the methylation level comprises the following steps:
respectively obtaining capture sequencing FASTQ files of a training set sample and a sample to be tested, carrying out sequence comparison, respectively generating a training set sample and an original Bam file of the sample to be tested, carrying out sequencing and marking repeated processing on the original Bam file, respectively generating a training set sample and a Bam file of the sample to be tested after sequencing and de-duplicating, and recording as a samplejRemoving the reordered Bam file; mixing the samplejCalculating samples by using the reordered Bam file and the block list capable of calculating the methylation level as input filesjRecorded in a list fileIBlock of equal lengthR 1 ,⋯,R I Corrected methylation levels on (v), and combining corrected methylation levels of all samples to oneILine ofJFirst of a lineRudimentary horizontal matrixX
The training set samples are plasma samples of normal people and plasma samples of patients; the sample to be detected is a plasma sample of a detected person;
calculating the methylation level also includesjIn a blockR i Upper fragment setS 1 ,⋯,S N Each fragment inS n Cutting off the front of the beginning of sequence 2 on each fragment XM tagAInformation of individual bases;
computing samplesjRecorded in a list fileIGreat-length blockR 1 ,⋯,R I Methods for correcting the level of post-methylation on (H) } include: the first scheme is as follows: calculating block conversion efficiency according to the state of cytosine on the non-methylation site, and correcting the block methylation level by using the block conversion efficiency;
or scheme two: calculating fragment conversion efficiency according to the state of cytosine on the non-methylation site of each original gene fragment, and calculating the methylation level of a block after filtering the low-conversion-rate fragment;
in the methylation block feature dimension reduction module, the training set principal component matrix is obtained by adopting the following method:
methylation level matrix of training set samplesXAs an input file, performing principal component analysis and outputting a corresponding rotation matrixVAnd training set principal component matrixY(ii) a Calculating each principal componentY p Contribution to the total varianceλ p And before calculationmThe total contribution rate of the principal components is extracted to exceedtFront ofmA plurality of principal components into a list of principal components,
Figure 187979DEST_PATH_IMAGE008
(ii) a Output rotation matrixVTraining set principal component matrixYAnd a principal component list;
in the methylation block feature dimension reduction module, the principal component matrix of the sample to be detected is obtained by adopting the following method:
methylation level matrix of sample to be testedXAnd the rotation matrixVAs an input file, performing principal component analysis and outputting a corresponding principal component matrix of the sample to be testedY
8. Use of the whole genome methylation sequencing-based ctDNA prediction method of any one of claims 1-5 and/or the whole genome methylation sequencing-based ctDNA prediction apparatus of claim 6 or 7 for the preparation of a ctDNA detection product.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the whole genome methylation sequencing-based ctDNA prediction method of claims 1-5 when executing the computer program.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the whole genome methylation sequencing-based ctDNA prediction method of claims 1-5.
CN202210977623.9A 2022-08-15 2022-08-15 ctDNA prediction method and device based on whole genome methylation sequencing Active CN115064211B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210977623.9A CN115064211B (en) 2022-08-15 2022-08-15 ctDNA prediction method and device based on whole genome methylation sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210977623.9A CN115064211B (en) 2022-08-15 2022-08-15 ctDNA prediction method and device based on whole genome methylation sequencing

Publications (2)

Publication Number Publication Date
CN115064211A true CN115064211A (en) 2022-09-16
CN115064211B CN115064211B (en) 2023-01-24

Family

ID=83208023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210977623.9A Active CN115064211B (en) 2022-08-15 2022-08-15 ctDNA prediction method and device based on whole genome methylation sequencing

Country Status (1)

Country Link
CN (1) CN115064211B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115323058A (en) * 2022-10-17 2022-11-11 深圳市睿法生物科技有限公司 Cancer species localization method based on ctDNA methylation mode
CN115376616A (en) * 2022-10-24 2022-11-22 臻和(北京)生物科技有限公司 Multi-classification method and device based on cfDNA (cfDNA) multiomics
CN115678964A (en) * 2022-11-08 2023-02-03 广州女娲生命科技有限公司 Noninvasive screening method of preimplantation embryos based on embryo culture solution
CN116153418A (en) * 2023-04-18 2023-05-23 臻和(北京)生物科技有限公司 Method, apparatus, device and storage medium for correcting whole genome methylation sequencing data batch effect
CN117423388A (en) * 2023-12-19 2024-01-19 北京求臻医疗器械有限公司 Methylation-level-based multi-cancer detection system and electronic equipment

Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120149593A1 (en) * 2009-01-23 2012-06-14 Hicks James B Methods and arrays for profiling dna methylation
WO2014043763A1 (en) * 2012-09-20 2014-03-27 The Chinese University Of Hong Kong Non-invasive determination of methylome of fetus or tumor from plasma
US20150087529A1 (en) * 2013-09-20 2015-03-26 The Chinese University Of Hong Kong Sequencing analysis of circulating dna to detect and monitor autoimmune diseases
CN104899474A (en) * 2015-06-09 2015-09-09 大连三生科技发展有限公司 Method and system for rectifying MB-seq methylation level based on ridge regression
US20170175205A1 (en) * 2015-12-17 2017-06-22 Illumina, Inc. Distinguishing methylation levels in complex biological samples
US20180066317A1 (en) * 2015-03-11 2018-03-08 Deutsches Krebsforschungszentrum Stiftung des öffentlichen Rechts Dna-methylation based method for classifying tumor species
CN109852672A (en) * 2017-11-30 2019-06-07 深圳豪石生物科技有限公司 A method of screening acute myeloid leukemia DNA methylation prognostic marker
US20190177792A1 (en) * 2016-08-05 2019-06-13 The Broad Institute, Inc. Methods for genome characterization
CN110322928A (en) * 2019-08-16 2019-10-11 河海大学常州校区 DNA methylation spectrum detection method
US20190316209A1 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-Assay Prediction Model for Cancer Detection
WO2020094775A1 (en) * 2018-11-07 2020-05-14 Cancer Research Technology Limited Enhanced detection of target dna by fragment size analysis
CN111755072A (en) * 2020-08-04 2020-10-09 深圳吉因加医学检验实验室 Method and device for simultaneously detecting methylation level, genome variation and insertion fragment
CN112397150A (en) * 2021-01-20 2021-02-23 臻和(北京)生物科技有限公司 ctDNA methylation level prediction device and method based on target region capture sequencing
CN112397151A (en) * 2021-01-21 2021-02-23 臻和(北京)生物科技有限公司 Methylation marker screening and evaluating method and device based on target capture sequencing
CN112951418A (en) * 2021-05-17 2021-06-11 臻和(北京)生物科技有限公司 Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN113257350A (en) * 2021-06-10 2021-08-13 臻和(北京)生物科技有限公司 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
US20210313006A1 (en) * 2020-03-31 2021-10-07 Grail, Inc. Cancer Classification with Genomic Region Modeling
CN113539355A (en) * 2021-07-15 2021-10-22 云康信息科技(上海)有限公司 Tissue-specific source for predicting cfDNA (deoxyribonucleic acid), related disease probability evaluation system and application
CN113903401A (en) * 2021-12-10 2022-01-07 臻和(北京)生物科技有限公司 ctDNA length-based analysis method and system
CN114045345A (en) * 2022-01-07 2022-02-15 臻和(北京)生物科技有限公司 Free DNA-based genome canceration information detection system and detection method
CN114703284A (en) * 2022-04-15 2022-07-05 北京莱盟君泰国际医疗技术开发有限公司 Blood free DNA methylation quantitative detection method and application thereof
US20220228209A1 (en) * 2021-01-20 2022-07-21 Genecast Biotechnology Co., Ltd. Dna methylation sequencing analysis methods
CN114868191A (en) * 2019-10-11 2022-08-05 格瑞尔有限责任公司 Cancer classification using tissue of origin threshold

Patent Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120149593A1 (en) * 2009-01-23 2012-06-14 Hicks James B Methods and arrays for profiling dna methylation
WO2014043763A1 (en) * 2012-09-20 2014-03-27 The Chinese University Of Hong Kong Non-invasive determination of methylome of fetus or tumor from plasma
US20150087529A1 (en) * 2013-09-20 2015-03-26 The Chinese University Of Hong Kong Sequencing analysis of circulating dna to detect and monitor autoimmune diseases
US20180066317A1 (en) * 2015-03-11 2018-03-08 Deutsches Krebsforschungszentrum Stiftung des öffentlichen Rechts Dna-methylation based method for classifying tumor species
CN104899474A (en) * 2015-06-09 2015-09-09 大连三生科技发展有限公司 Method and system for rectifying MB-seq methylation level based on ridge regression
US20170175205A1 (en) * 2015-12-17 2017-06-22 Illumina, Inc. Distinguishing methylation levels in complex biological samples
US20190177792A1 (en) * 2016-08-05 2019-06-13 The Broad Institute, Inc. Methods for genome characterization
CN109852672A (en) * 2017-11-30 2019-06-07 深圳豪石生物科技有限公司 A method of screening acute myeloid leukemia DNA methylation prognostic marker
US20190316209A1 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-Assay Prediction Model for Cancer Detection
WO2020094775A1 (en) * 2018-11-07 2020-05-14 Cancer Research Technology Limited Enhanced detection of target dna by fragment size analysis
CN110322928A (en) * 2019-08-16 2019-10-11 河海大学常州校区 DNA methylation spectrum detection method
CN114868191A (en) * 2019-10-11 2022-08-05 格瑞尔有限责任公司 Cancer classification using tissue of origin threshold
US20210313006A1 (en) * 2020-03-31 2021-10-07 Grail, Inc. Cancer Classification with Genomic Region Modeling
CN111755072A (en) * 2020-08-04 2020-10-09 深圳吉因加医学检验实验室 Method and device for simultaneously detecting methylation level, genome variation and insertion fragment
CN112397150A (en) * 2021-01-20 2021-02-23 臻和(北京)生物科技有限公司 ctDNA methylation level prediction device and method based on target region capture sequencing
US20220228209A1 (en) * 2021-01-20 2022-07-21 Genecast Biotechnology Co., Ltd. Dna methylation sequencing analysis methods
CN112397151A (en) * 2021-01-21 2021-02-23 臻和(北京)生物科技有限公司 Methylation marker screening and evaluating method and device based on target capture sequencing
CN112951418A (en) * 2021-05-17 2021-06-11 臻和(北京)生物科技有限公司 Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN113257350A (en) * 2021-06-10 2021-08-13 臻和(北京)生物科技有限公司 ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
CN113539355A (en) * 2021-07-15 2021-10-22 云康信息科技(上海)有限公司 Tissue-specific source for predicting cfDNA (deoxyribonucleic acid), related disease probability evaluation system and application
CN113903401A (en) * 2021-12-10 2022-01-07 臻和(北京)生物科技有限公司 ctDNA length-based analysis method and system
CN114045345A (en) * 2022-01-07 2022-02-15 臻和(北京)生物科技有限公司 Free DNA-based genome canceration information detection system and detection method
CN114703284A (en) * 2022-04-15 2022-07-05 北京莱盟君泰国际医疗技术开发有限公司 Blood free DNA methylation quantitative detection method and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
凡时财等: "人类基因组DNA甲基化数据分析的研究现状", 《中国科学:生命科学》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115323058A (en) * 2022-10-17 2022-11-11 深圳市睿法生物科技有限公司 Cancer species localization method based on ctDNA methylation mode
CN115376616A (en) * 2022-10-24 2022-11-22 臻和(北京)生物科技有限公司 Multi-classification method and device based on cfDNA (cfDNA) multiomics
CN115376616B (en) * 2022-10-24 2023-04-28 臻和(北京)生物科技有限公司 Multi-classification method and device based on cfDNA multiunit science
CN115678964A (en) * 2022-11-08 2023-02-03 广州女娲生命科技有限公司 Noninvasive screening method of preimplantation embryos based on embryo culture solution
CN116153418A (en) * 2023-04-18 2023-05-23 臻和(北京)生物科技有限公司 Method, apparatus, device and storage medium for correcting whole genome methylation sequencing data batch effect
CN117423388A (en) * 2023-12-19 2024-01-19 北京求臻医疗器械有限公司 Methylation-level-based multi-cancer detection system and electronic equipment
CN117423388B (en) * 2023-12-19 2024-03-22 北京求臻医疗器械有限公司 Methylation-level-based multi-cancer detection system and electronic equipment

Also Published As

Publication number Publication date
CN115064211B (en) 2023-01-24

Similar Documents

Publication Publication Date Title
CN115064211B (en) ctDNA prediction method and device based on whole genome methylation sequencing
CN106909806B (en) The method and apparatus of fixed point detection variation
KR101795124B1 (en) Method and system for detecting copy number variation
CN109767810B (en) High-throughput sequencing data analysis method and device
CN112669901A (en) Chromosome copy number variation detection device based on low-depth high-throughput genome sequencing
CN114045345B (en) Free DNA-based genome canceration information detection system and detection method
CN112397151B (en) Methylation marker screening and evaluating method and device based on target capture sequencing
CN113168886A (en) Systems and methods for germline and somatic variant calling using neural networks
CN112218957A (en) Systems and methods for determining tumor fraction in cell-free nucleic acids
CN110846411A (en) Method for distinguishing gene mutation types of single tumor sample based on next generation sequencing
EP3973080A1 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
CN112397150B (en) ctDNA methylation level prediction device and method based on target region capture sequencing
CN111968701A (en) Method and device for detecting somatic copy number variation of designated genome region
CN107480470A (en) Known the variation method for detecting and device examined based on Bayes and Poisson distribution
CN109712671B (en) Gene detection device based on ctDNA, storage medium and computer system
CN117524301B (en) Copy number variation detection method, device and computer readable medium
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
CN110970091A (en) Label quality control method and device
CN110373458B (en) Kit and analysis system for thalassemia detection
KR102347463B1 (en) Method and appartus for detecting false positive variants in nucleic acid sequencing analysis
CN117059173A (en) Method for identifying copy number variation accurate breakpoint and application thereof
CN112837748A (en) System and method for distinguishing tumors of different anatomical origins
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
CN115831355A (en) Early tumor screening method for multiple cancer species WGS
CN114974432A (en) Screening method of biomarker and related application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant