Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a detection and prediction method for microsatellite instability (MSI), a related site screening method and a related model construction method. The inventor finds that a reasonable algorithm fitting MSI distribution can greatly improve the MSI detection performance of the existing NGS (next generation sequencing) method, and more importantly, compared with the existing method, the method breaks through the limitation that the existing NGS method needs to have a control sample, and has excellent performance in panel analysis application and higher sensitivity, specificity and accuracy. According to the principle of the method, the method can be embedded into various known or to-be-developed gene combinations (panel), simple repeated sequence quantification (repeat) is carried out on a microsatellite region in a panel detection range, an MSI predicted value of each MSI site is fitted by utilizing the algorithm in the method based on the quantitative result of the microsatellite region, the MSI detection site is noise-reduced by utilizing a random forest algorithm, and an MSI classification model is constructed (figure 1), so that accurate prediction can be realized.
In order to achieve the above object, the present invention provides a screening method of a microsatellite instability (MSI) detection site or a combination of microsatellite instability detection sites, the screening method comprising:
i) A step of determining a microsatellite instability prediction value of a candidate microsatellite locus in a sample; and/or
ii) screening the sites or combinations of sites for microsatellite instability detection using one or more indicators including a microsatellite instability prediction of candidate microsatellite sites;
the microsatellite instability prediction value is a quantification value of the difference in abundance between the major allele (allele) type and the minor allele type in candidate microsatellite loci.
In particular embodiments of the invention, the abundance of an allele can be calculated by simple repeat sequencing read length number (repeat count).
In particular embodiments of the invention, the abundance of an allele can be calculated by the number of reads (reads) supported by the genotype.
Optionally, the abundance of the allele falls within the interval 0-1 by normalization; preferably, the method of allele abundance normalization may be to divide the number of read support for the allele in the candidate microsatellite locus by the sum of the number of read support for all alleles for that locus.
In the present invention, the calculation is performed by selecting the allelic type with the repeat (repeat) length between 0 and 50 bp.
In a specific embodiment of the present invention, the screening method further comprises the step of calculating a microsatellite instability prediction value of the sample from the microsatellite instability prediction values of all candidate microsatellite loci in the sample.
Further, the microsatellite instability prediction value of the sample is calculated according to the following formula:
wherein f (X) represents a microsatellite instability predicted value of the sample, X i1 Representing normalized major allele abundance, X i2 Indicating normalized secondary genotype abundance, i is site number or sequence, n is number of sites.
In particular embodiments of the invention, the candidate microsatellite loci may be microsatellite loci in the whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in the genome, or microsatellite loci of interest.
In particular embodiments of the invention, the candidate microsatellite loci may be selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.
In a specific embodiment of the invention, the screening method further comprises a method of denoising the detection site by machine learning.
Further, the noise reduction method may be selected from random forest (PCA), principal Component Analysis (PCA), linear Discriminant Analysis (LDA), ridge regression, lasso regression, neural network, or algorithms derived from the above methods; random forests or algorithms derived from random forests are preferred.
Further, the algorithm derived from random forests may be selected from an isolated Forest (Isolation Forest) algorithm, a TRTE (Totally Random Trees Embedding) algorithm, or an extreme tree (ExtraTrees) algorithm.
In a specific embodiment of the invention, the microsatellite instability predicted value is used as input data for machine learning.
The screening method of the present invention can use only samples derived from tumor patients or tumor tissues; the screening methods of the present invention may not use samples derived from healthy individuals or normal tissues.
The invention also provides a construction method of the microsatellite instability (MSI) detection model, which comprises the following steps:
i) A step of determining a microsatellite instability prediction value of a candidate microsatellite locus in a sample; and/or
ii) constructing a microsatellite instability detection model using one or more indicators including a microsatellite instability prediction value of the candidate microsatellite locus;
The predicted value of microsatellite instability is a quantified value of the difference in abundance between the major and minor genotypes in the candidate microsatellite loci.
In a specific embodiment of the invention, the abundance of an allele is calculated by simple repeat sequencing read length number (repeat count).
In a specific embodiment of the invention, the abundance of an allele is calculated by the genotype reads (reads) support number.
Optionally, the abundance of the allele falls within the interval 0-1 by normalization; preferably, the abundance of an allele is normalized by dividing the number of reads for the allele in the candidate microsatellite locus by the sum of the number of reads for all alleles for that locus.
In a specific embodiment of the invention, the calculation can be performed by selecting the allelic type with a repeat (repeat) length between 0 and 50 bp.
In a specific embodiment of the present invention, the construction method further comprises the step of calculating a microsatellite instability prediction value of the sample from the microsatellite instability prediction values of all candidate microsatellite loci in the sample.
Further, the microsatellite instability prediction value of the sample may be calculated according to the following formula:
Wherein f (X) represents a microsatellite instability predicted value of the sample, X i1 Representing normalized major allele abundance, X i2 Indicating normalized secondary genotype abundance, i is site number or sequence, n is number of sites.
In particular embodiments of the invention, the candidate microsatellite loci may be microsatellite loci in the whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in the genome, or microsatellite loci of interest.
In particular embodiments of the invention, the candidate microsatellite loci may be selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.
In a specific embodiment of the present invention, the construction method may further comprise a method of denoising the detection sites used for the model by machine learning.
Further, the noise reduction method may be selected from random forest (PCA), principal Component Analysis (PCA), linear Discriminant Analysis (LDA), ridge regression, lasso regression, neural network, or algorithms derived from the above methods; random forests or algorithms derived from random forests are preferred.
Further, the algorithm derived from random forests may be selected from an isolated Forest (Isolation Forest) algorithm, a TRTE (Totally Random Trees Embedding) algorithm, or an extreme tree (ExtraTrees) algorithm.
In a specific embodiment of the invention, the microsatellite instability predicted value is used as input data for machine learning.
The construction method of the present invention may use only samples derived from tumor patients or tumor tissues; the construction method of the present invention may not use samples derived from healthy individuals or normal tissues.
In the present invention, a construction method of a microsatellite instability (MSI) detection model is also provided, and the construction method uses the microsatellite instability detection sites or detection site combinations obtained by the screening method of the present invention to construct a model.
Further, the construction method may be a machine learning method.
The invention also provides a method for detecting or predicting microsatellite instability, which comprises the following steps:
i) The method comprises the step of screening microsatellite instability detection sites or combination of detection sites; and/or
ii) detecting the microsatellite instability detection sites or the combination of detection sites obtained by screening by using the screening method of the invention; and/or
iii) The microsatellite instability detection model constructed by the construction method is used for detection.
The invention also provides a microsatellite instability detection marker combination, which is a microsatellite marker combination at a detection site obtained by the screening method of the invention.
Further, the detection marker combinations of the present invention may comprise the markers in table 2, table 5, table 6 or table 7.
The invention also provides application of the reagent for specifically detecting the microsatellite instability detection sites or detection site combinations in the preparation of microsatellite instability detection kits and/or tumor accompanying diagnosis kits, wherein the microsatellite instability detection sites or detection site combinations are microsatellite instability detection sites or detection site combinations obtained by adopting the screening method, or detection marker combinations or microsatellite instability detection sites or detection site combinations in the detection model constructed by the construction method.
The invention also provides a microsatellite instability detection kit and/or a tumor accompanying diagnosis kit, wherein the kit comprises a reagent for specifically detecting microsatellite instability detection sites or detection site combinations, and the microsatellite instability detection sites or detection site combinations are microsatellite instability detection sites or detection site combinations obtained by adopting the screening method, or detection marker combinations or microsatellite instability detection sites or detection site combinations in a detection model constructed by the construction method.
The present invention also provides a system or device for microsatellite instability detection and/or tumor companion diagnosis, the system or device comprising:
the acquisition module is used for acquiring the measurement data of a microsatellite instability detection site or detection site combination of a subject, wherein the microsatellite instability detection site or detection site combination is obtained by adopting the screening method of the invention, or the detection marker combination is obtained by adopting the detection marker combination of the invention, or the microsatellite instability detection site or detection site combination in the detection model constructed by the construction method of the invention, the measurement data is a microsatellite instability prediction value of the detection site, and the microsatellite instability prediction value is a quantification value of the abundance difference between a major allele type and a minor allele type in candidate microsatellite sites;
the data analysis module is used for inputting the detection data of the microsatellite instability detection sites or the detection site combinations into the detection model constructed by the construction method of the invention to obtain detection results.
In particular embodiments of the invention, the system or apparatus may further comprise: a sequencing module for sequencing the subject.
In a specific embodiment of the invention, the system or apparatus further comprises: a diagnostic module for generating tumor-associated diagnostic results and/or treatment recommendations.
The present invention also provides a computer-readable storage medium including a stored computer program, the computer program comprising:
i) A program for performing the screening method of the microsatellite instability detection site or combination of microsatellite instability detection sites of the present invention; and/or
ii) a program for executing the construction method of the microsatellite instability detection model of the present invention; and/or
iii) A program for performing the microsatellite instability detection or prediction method of the present invention.
The present invention also provides an apparatus comprising a processor, a memory and a computer program stored in the memory, the computer program comprising:
i) A program for performing the screening method of the microsatellite instability detection site or combination of microsatellite instability detection sites of the present invention; and/or
ii) a program for executing the construction method of the microsatellite instability detection model of the present invention; and/or
iii) A program for performing the microsatellite instability detection or prediction method of the present invention.
The invention also provides a method for quantifying microsatellite instability, which uses a microsatellite instability predicted value, wherein the microsatellite instability predicted value is a quantized value of abundance difference between a major allele type and a minor allele type in a microsatellite locus.
In a specific embodiment of the invention, the abundance of an allele is calculated by simple repeat sequencing read length number (repeat count).
In a specific embodiment of the invention, the abundance of an allele is calculated by the genotype reads (reads) support number.
Optionally, the abundance of the allele falls within the interval 0-1 by normalization; preferably, the abundance of an allele is normalized by dividing the number of reads supported for the allele in the microsatellite locus by the sum of the number of reads supported for all alleles at that locus.
In a specific embodiment of the invention, the calculation can be performed by selecting the allelic type with a repeat (repeat) length between 0 and 50 bp.
In a specific embodiment of the present invention, the quantification method further comprises the step of calculating a microsatellite instability prediction value of the sample from the microsatellite instability prediction values of all target microsatellite loci in the sample.
Further, the microsatellite instability prediction value of the sample is calculated according to the following formula:
wherein f (X) represents a microsatellite instability predicted value of the sample, X i1 Representing normalized major allele abundance, X i2 Indicating normalized secondary genotype abundance, i is site number or sequence, n is number of sites.
In particular embodiments of the invention, the microsatellite loci are microsatellite loci in the whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in the genome, or microsatellite loci of interest.
In a specific embodiment of the invention, the microsatellite loci are selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.
The methods of the invention may further comprise the step of sequencing to determine the nucleotide sequence of the sample.
The methods of the invention may use only samples derived from tumor patients or tumor tissue; the methods of the invention may not use samples derived from healthy individuals or normal tissue.
The technical scheme of the invention can be used in various diagnostic and non-diagnostic application scenes of cancers. The technical scheme of the invention can be applied to tumors of any stage, such as very early tumor, medium tumor and late tumor; preferably for early stage tumors or very early stage tumors.
The beneficial effects of the invention at least comprise the following aspects:
(1) The accuracy is high, and the model performance is excellent. The sensitivity, specificity and AUC can reach 1, the prediction result is completely consistent with the PCR method serving as a gold standard, and the detection rate can reach 100% when the LOD is detected and limited to 20%, 15% and 10% compared with the existing NGS method.
(2) The MSI state can be accurately and stably detected only based on a single tumor sample, the limitation caused by the loss or sampling difficulty of a control sample is eliminated, the sequencing cost is saved on the technical level, the operation steps are simplified, and the detection speed is improved.
(3) The method can flexibly embed detection flows of different gene detection panel products, flexibly screen MSI detection sites according to panel sizes, has excellent MSI detection performance and is flexible and wide in application. For a brand new panel product, all MSI detection sites in the panel detection range can be selected, effective detection sites in the panel detection range can be screened out by adopting the method, and MSI prediction is carried out on a sample of the panel by using the algorithm and the model of the invention. For other panel products, if the detection range of the panel product is contained in the existing panel screened by the method, the loci in the panel range can be directly extracted from the detection loci screened by the existing panel, and the loci can be used for MSI prediction; for the panel products with the detection range different from the existing panel, the method can be adopted to screen out high-efficiency sites from the head and construct a prediction model.
Detailed Description
Unless otherwise indicated, all terms used herein have the meaning commonly found in the art, and all reagents used are conventional commercial reagents in the art.
The term "genotype reads (or read length) support" as used herein refers to the number of reads obtained by sequencing that correspond to the sequence of a genotype. For microsatellite instability, the genotype is a simple series of repeat species present at the same position on the chromosome, each simple series of repeat species being referred to as a genotype. The term "simple repeat sequencing read length number" (repeatcount) refers to the number of read supports per simple repeat.
The term "sensitivity" in the present invention may refer to the number of true positives divided by the sum of the number of true positives and false negatives, and may be used to characterize the ability to correctly identify a population that is truly suffering from cancer.
The term "specificity" in the present invention may refer to the number of true negatives divided by the sum of the number of true negatives and false positives, and may be used to characterize the ability to correctly identify a population that is truly free of cancer.
The term "ROC" or "ROC curve" in the present invention may refer to a subject's working characteristic curve (receiver operating characteristic curve) that may be used to characterize the performance of a classifier. ROC curves can be generated by plotting sensitivity versus specificity at various threshold settings.
The term "AUC" in the present invention may refer to the area under the ROC curve and may be used to characterize the performance of cancer screening/prediction. AUC ranges from 0.5 to 1.0, with values closer to 1.0 indicating better screening/predictive performance of the method.
The technical contents of the present invention will be described in detail with reference to the accompanying drawings and specific embodiments. It will be appreciated by those skilled in the art that the following examples are illustrative of the present invention and should not be construed as limiting the scope of the invention.
Example 1
Clinical patient tissue samples, including tumor test samples and paired healthy human control (normal) samples, were collected for testing the MSI detection methods of the present invention and validated using PCR MSI detection methods and compared to other MSI detection tools known in the art. The PCR MSI detection result is used as a gold standard, and the consistency of the result with the MSI detection method is compared to evaluate the performance of the MSI detection method. It should be noted that the MSI detection method proposed by the present invention does not require the use of control samples, which are used in the examples section for PCR validation and other MSI detection software analysis.
As for MSI sites used for the test, a large gene combination (panel) containing 654 genes related to tumors such as rectal cancer, gastric cancer, ovarian cancer and pulmonary sarcoma cancer, covering 550 MSI sites was used for the test in this example.
The kit for detecting MSI by the PCR method is a Promega kit, MSI state information of each sample is obtained by PCR detection, and the result is used as a gold standard. The results of PCR were divided into two groups: samples detected as MSI-H (microsatellite highly unstable) are grouped and scored as MSI-H (positive); samples detected as MSI-L (microsatellite low instability) and MSS (microsatellite stability) were grouped and scored as MSS (negative). The purpose of such grouping is to correlate the PCR detection results with the results of NGS-based MSI detection methods (NGS detection methods have only two types of detection results, MSI-H and MSS, as reported in the prior art).
Example 2
The 4 gene combinations of example 1 were subjected to exon probe capture library sequencing, and 150bp Pair-End mode sequencing (Read 1:151, read2:151; index1:8, index 2:8) was performed according to the instrument standard protocol using a gene sequencer (NextSeq CN 500), finally fastq format second generation sequencing data was obtained as raw data (raw data).
And performing quality control on the obtained second-generation sequencing data by using quality control software fastp, filtering sequencing joints, low-quality bases, sequencing error fragments and the like, and filtering to obtain high-quality data (clean data).
Comparing the clean data with a reference genome hg19 by using comparison software bwa to obtain corresponding specific position information on each DNA fragment genome; the data were then deduplicated and base corrected using genecore software.
Based on the known repeat (repeat) locations of the reference genome detection region, the number of reads (or read lengths) supported for each repeat type is obtained. First, a genomic MSI candidate locus bed file is generated using the published MANTIS self-contained tool RepeatFinder, loci are read from the bed file, and the differences generated by the 0-/1-base indexes are aligned using the supplied reference genomic file. Wherein, the microsatellite repeated sequence with the length of 10-100, single base and more than 5 times of repeated times is selected. Secondly, quality control is carried out on reads matched with all repeat types on each loci, the average base quality of each read is not less than 20, the average base quantity of each read falling in a loci interval is not less than 25, the length of the reads (only counting non-shearing parts and not counting soft-clips or hard-clips) is greater than 35bp, the support number of the minimum repeat reads is greater than 1, and other involved filtering steps are software default parameters. Finally, counting the number of reads qualified for quality control for each repeat type at all sites of tumor patient and control, respectively, generating a simple repeat sequencing read length number (repeat count) result for subsequent analysis (since MANTIS is an analysis of paired samples, tumor and control samples need to be input simultaneously, and the method of the invention does not need to use control samples, so in order not to affect software operation, tumor samples are repeatedly counted once as control samples, and finally the statistical result of tumor samples is used).
Example 3
From the repeat result obtained in example 2, the allele type representing the position is obtained at each locus, the difference between the primary and secondary abundance values is calculated as the predicted value of the microsatellite instability level for each loci, and the average of the (1-unstable level predicted values) of all loci in the sample is used as the predicted value of microsatellite instability (MSIscore) of the sample, and the formula of the non-parametric algorithm is as follows:
wherein X is i1 Representing the main peak value, X i2 The secondary peak is represented, i is the loci order, n is the loci number.
Specifically, counting the repeat count of each detection position with the repeat length between 0 and 50bp, normalizing the repeat count to be within a range of 0 to 1, wherein the normalization mode is that the number of reads support of each repeat type of the position is divided by the sum of the numbers of reads support of all repeat types of the position, and sorting according to normalized numerical values from large to small, wherein the maximum value is a main peak value, and the secondary value is a secondary peak value.
Example 4
153 samples of tumor tissue among the samples collected in example 1 were prepared according to 8:2 and the MSI state information (MSI-H/MSS) measured by the PCR method is randomly and hierarchically sampled and divided into 122 cases of training sets and 31 cases of test sets, and then initial training set samples are further processed according to the training sets: the validation set was 8:2 and MSI status information (MSI-H/MSS) are randomly layered sampled 20 times to obtain samples of 20 sampling results (fig. 2), and 97 samples of a training set and 25 samples of a verification set are obtained in each sampling. The above samples did not contain any control samples.
The following is performed on each sample taken, with 550 candidate sites covered by the large panel of example 1 for each sample, the MSIscore is calculated using the parameterisation algorithm of example 3, and the set of samples taken each time is sorted into a matrix of training and validation sets containing the MSIscore for each sample at each detection site. For the matrix sorted by each sampling set, the weight of each candidate site is calculated by using a random forest algorithm, all candidate sites are ordered according to the times that the weight is not 0 in 20 samples, and 10 gradients are divided according to the times that the weight occurs to be greater than or equal to 0-9 (as shown in table 1).
TABLE 1 weight gradient partitioning results
Number of times of occurrence of weight
|
>=0
|
>=1
|
>=2
|
>=3
|
>=4
|
>=5
|
>=6
|
>=7
|
>=8
|
>=9
|
Detecting point count
|
550
|
381
|
265
|
200
|
156
|
129
|
108
|
98
|
83
|
71 |
And carrying out prediction classification on the training set according to the detection site construction model selected by each gradient, dividing a threshold value by the training set, carrying out prediction classification on the verification set, and evaluating the performance of the evaluation model by using the prediction classification effect of the verification set. According to the result of each gradient test, drawing ROC curve to evaluate the performance of each model, considering the classification performance and stability, selecting the model of 200 detection sites contained in the gradient with the number of times of weight occurrence > =3, and defining the threshold according to the test result of the training set on the selected candidate site. The detection site selection results are shown in the following table:
TABLE 2 detection site screening results
Table 2 (subsequent)
Table 2 (subsequent)
The performance of the ROC curve evaluation model is drawn according to 97 training set sample prediction data, and the result is shown in figure 3, and the sensitivity, the specificity and the AUC of the model constructed by the method are all 1, so that the accurate prediction of MSI is realized.
Further, 158 samples were also used to evaluate model performance, with MSIscore > = threshold determined to be MSI-H and vice versa for MSS. The results are shown in table 3 and fig. 4:
table 3 model predictive performance:
MSI-H samples in 158 samples are correctly detected by the model constructed by the invention, and the MSI-H samples are completely consistent with the PCR detection result, and the consistency of the MSI-H samples and the PCR detection result is up to 100%.
To verify the highest sensitivity range of the method of the invention, we collected MSI-positive (MSI-H) reference KM12 and MSI-negative (MSS) reference NA12878, mixed MSI-H and MSS cell lines in proportion, diluted MSI-H cell line concentrations (20%, 15%, 10%, 5% respectively) to determine the lowest MSI detection limit. MSI was detected on-machine sequencing of the above cell samples using the large panel of example 1 as the sequencing panel. As shown in Table 4, MSI-H detection rate can reach 100% under the conditions of LOD 20%, LOD 15% and LOD 10%; in the case of LOD 5%, MSI-H detection rate can still reach 83%.
Table 4 detection rate at different LODs:
dilution concentration (%)
|
20
|
15
|
10
|
5
|
Detection rate (%)
|
100
|
100
|
100
|
83 |
Example 5
In order to evaluate the generalization performance of the model, three genes panel of medium, small and micro size are introduced, which respectively comprise 457, 86 and 31 genes related to tumors such as rectal cancer, gastric cancer, ovarian cancer, lung sarcoma cancer and the like, and 146, 109 and 41 MSI sites are respectively covered. 173 new samples were sequenced on-machine using the three panels described above, respectively.
Of the 200 MSI sites selected in example 4, 61, 33, and 7 sites were found in the middle, small, and mini panels, respectively, and the results are shown in tables 5-6 and SEQ ID Nos. 1-7 in Table 7.
Table 5 detection site screening results (medium panel):
table 5 (subsequent)
Table 6 detection site screening results (mini):
watch 6 (subsequent)
Table 7 detection site screening results (mini):
sequencing data from 173 new samples were analyzed using the above sites and the results are shown in table 8. For medium-sized, small-sized and miniature panels, all 11 MSI-H samples in 173 samples can be accurately detected by directly using 61, 33 or 7 screened sites, and the sensitivity and the specificity reach 1.
Table 8 model predictive performance in medium, small, mini panel:
Example 6
To test the performance improvement produced by the method of the present invention over prior art methods, it was compared to prior art MSI detection models. MANTIS, MSISensor and mSINGS are the three most commonly used MSI detection methods in the prior art, wherein the sensitivity and specificity of MANTIS are considered to be optimal (Performance evaluation for rapid detection of pan-cancer microsatellite instability with MANTIS.Oncostarget, 2017, vol.8, (No. 5), pp: 7452-7463), thus MANTIS was selected as a comparative example, MSI was predicted based on the sequencing data of the aforementioned 173 samples in 3 panels, and the MSI detection sites used by MANTIS software were all MSI detection sites covered by the 3 panel gene regions. The results are shown in Table 9:
table 9 model predictive performance in medium, small, mini panel:
therefore, compared with the prior art, the MSI detection model is constructed by adopting the method provided by the invention, and higher sensitivity and specificity can be obtained under the condition of using fewer detection sites.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.