CN111304303B

CN111304303B - Method for predicting microsatellite instability and application thereof

Info

Publication number: CN111304303B
Application number: CN202010098341.2A
Authority: CN
Inventors: 吴书昌; 白健; 王寅
Original assignee: Fujian Herui Gene Technology Co ltd
Current assignee: Fujian Herui Gene Technology Co ltd
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2023-05-05
Anticipated expiration: 2040-02-18
Also published as: CN111304303A

Abstract

The invention discloses a microsatellite instability detection and prediction method, a related site screening method, a model construction method and application. The site screening method comprises the following steps: i) A step of determining a microsatellite instability prediction value of a candidate microsatellite locus in a sample; and/or ii) screening the sites or combinations of sites for microsatellite instability detection using one or more indicators including a microsatellite instability prediction value for candidate microsatellite sites; the microsatellite instability prediction value is a quantification value of the difference in abundance between the major allele (allele) type and the minor allele type in candidate microsatellite loci. The method disclosed by the invention has high accuracy, and the MSI state can be accurately and stably detected only on the basis of a single tumor sample.

Description

Method for predicting microsatellite instability and application thereof

Technical Field

The invention belongs to the technical field of gene detection, and particularly relates to a microsatellite instability detection and prediction method, a related site screening method, a model construction method, a marker combination, a kit, a system, a device, a computer-readable storage medium, equipment and a related microsatellite instability quantification method.

Background

Microsatellite (microsatellite) sequences are simple tandem repeat regions that exist at millions of loci (loci) on the human genome. Microsatellite instability (Microsatellite instability, MSI) refers to the phenomenon of microsatellite sequence lengthening or shortening due to DNA mismatch repair defects (dMMR). Such somatic mutations can cause the inactivation of oncogenes or the disruption of other non-coding regulatory sequences, thereby acting as carcinogenesis. MSI development has so far become a hotspot of clinical concern, mainly due to the impact of immunotherapy on the clinic. Cancer therapies have evolved extensively over the past few years, recognizing that different cancer subtypes with unique molecular phenotypes can be treated with novel targeting therapies. MSI is reported to be present as one of these unique molecular phenotypes in a variety of cancers, including colorectal cancer, endometrial cancer, gastric cancer, prostate cancer, ovarian cancer, glioblastoma, and the like. Among the various types of cancer, particularly colorectal cancer, MSI status, particularly high levels of MSI status (MSI-H), has been considered a prognostic biomarker, and subsequently studies have also reported that MSI status was used as a prognostic biomarker for a pan-cancerous immune screening point blocking response, and the U.S. Food and Drug Administration (FDA) approved pembrolizumab as an inhibitor of apoptosis 1 (PD-1) in 2017 for the treatment of all advanced solid tumor patients with high levels of microsatellite instability (MSI-H) or dhmr without other treatment options. Clinical trials indicate that immune checkpoint inhibitors can improve the efficacy of a variety of solid tumors, making MSI-H and dMMR the first full-cancer predictors of therapeutic response. Therefore, the exploration of tools and methods for detecting MSI status is of great importance in the clinical diagnosis and prognosis of tumors.

Three methods of MSI detection have been developed up to now, immunohistochemistry (Immunohistochemical stains, IHC), polymerase chain reaction (polymerase chain reaction, PCR) and second generation sequencing (next-generation sequencing, NGS). IHC method uses FFPE tumor tissue section to determine whether there is mismatch repair function defect in cell by judging the expression condition of 4 DNA mismatch repair proteins (MLH 1, PMS2, MSH2, MSH 6). If the result shows that any protein expression is absent, judging as dMMR; if all four proteins are expressed, they are interpreted as normal mismatch repair (pMMR). The method has the advantages of wide applicability and capability of determining which MMR proteins are lost in tumors, however, IHC has the problems that some qualitative protein changes cannot be detected, MMR results are occasionally reported wrong, and result interpretation is different from person to person. The PCR method is to identify the presence or absence of dMMR in tumor cells by comparing the change in microsatellite sequence length of tumor and non-tumor tissue DNA, and is widely recognized as a gold standard diagnostic tool for MSI detection. At least 5 sites are usually detected, 1 site instability is referred to as low level microsatellite instability (MSI-L), 2 and 2 sites instability is referred to as MSI-H, and 5 sites are stable, i.e., microsatellite stability (MSS). MSI-L and MSS are equivalent to the concept of pMMR, and MSI-H is equivalent to the concept of dMMR. The PCR detection MSI method not only makes up for the leak that IHC cannot detect MSI caused by non-truncated missense mutation, but also has good repeatability, but also has the limitation that the site of gene combination (panel) is less, the flux is lower, the MSI detection sensitivity caused by dinucleotide repeated sequence is lower, and the like. NGS methods are methods for MSI detection of a sample to be tested over a whole genome or whole exome range based on second generation sequencing. MSI is used as a biomarker to distinguish MSI mutation phenotype and is not used for finding out the determined MSI mutation position, so that the NGS method solves the problems of low PCR flux and small panel, can be used for detecting common sequencing data with target mutation detection, tumor mutation load (TMB) and the like, and the detection site can be automatically adjusted according to the detection purpose, so that the detection method is time-saving, labor-saving, flexible and convenient. In the NGS method reported so far, the PCR detection result is generally used as a gold standard, and the consistency of the two results is compared as a standard for evaluating the performance of the NGS detection method. However, the NGS methods are many and mostly pairing methods, such as the published MSI detection tool mSINGS, MSIsensor, MANTIS.

Along with the wide clinical application of MSI, compared with PCR and IHG methods, NGS has the advantages of high speed, high flux and wide coverage detection sites, and becomes the development trend of MSI clinical detection. The method has higher requirements on accuracy and feasibility, particularly the excellent performance requirements in panel detection are more urgent, however, most of the reported NGS methods detect MSI based on comparison of tumor tissues and control tissues, the detection standards are uneven, performance in panel application is not ideal, and clinical application of the NGS method is limited. By considering and researching the advantages of the existing NGS method and combining with panel analysis specific case data, a new algorithm can be utilized to obtain an MSI detection method suitable for panel analysis based on single tumor samples.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art and provides a detection and prediction method for microsatellite instability (MSI), a related site screening method and a related model construction method. The inventor finds that a reasonable algorithm fitting MSI distribution can greatly improve the MSI detection performance of the existing NGS (next generation sequencing) method, and more importantly, compared with the existing method, the method breaks through the limitation that the existing NGS method needs to have a control sample, and has excellent performance in panel analysis application and higher sensitivity, specificity and accuracy. According to the principle of the method, the method can be embedded into various known or to-be-developed gene combinations (panel), simple repeated sequence quantification (repeat) is carried out on a microsatellite region in a panel detection range, an MSI predicted value of each MSI site is fitted by utilizing the algorithm in the method based on the quantitative result of the microsatellite region, the MSI detection site is noise-reduced by utilizing a random forest algorithm, and an MSI classification model is constructed (figure 1), so that accurate prediction can be realized.

In order to achieve the above object, the present invention provides a screening method of a microsatellite instability (MSI) detection site or a combination of microsatellite instability detection sites, the screening method comprising:

i) A step of determining a microsatellite instability prediction value of a candidate microsatellite locus in a sample; and/or

ii) screening the sites or combinations of sites for microsatellite instability detection using one or more indicators including a microsatellite instability prediction of candidate microsatellite sites;

the microsatellite instability prediction value is a quantification value of the difference in abundance between the major allele (allele) type and the minor allele type in candidate microsatellite loci.

In particular embodiments of the invention, the abundance of an allele can be calculated by simple repeat sequencing read length number (repeat count).

In particular embodiments of the invention, the abundance of an allele can be calculated by the number of reads (reads) supported by the genotype.

Optionally, the abundance of the allele falls within the interval 0-1 by normalization; preferably, the method of allele abundance normalization may be to divide the number of read support for the allele in the candidate microsatellite locus by the sum of the number of read support for all alleles for that locus.

In the present invention, the calculation is performed by selecting the allelic type with the repeat (repeat) length between 0 and 50 bp.

In a specific embodiment of the present invention, the screening method further comprises the step of calculating a microsatellite instability prediction value of the sample from the microsatellite instability prediction values of all candidate microsatellite loci in the sample.

Further, the microsatellite instability prediction value of the sample is calculated according to the following formula:

wherein f (X) represents a microsatellite instability predicted value of the sample, X _i1 Representing normalized major allele abundance, X _i2 Indicating normalized secondary genotype abundance, i is site number or sequence, n is number of sites.

In particular embodiments of the invention, the candidate microsatellite loci may be microsatellite loci in the whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in the genome, or microsatellite loci of interest.

In particular embodiments of the invention, the candidate microsatellite loci may be selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.

In a specific embodiment of the invention, the screening method further comprises a method of denoising the detection site by machine learning.

Further, the noise reduction method may be selected from random forest (PCA), principal Component Analysis (PCA), linear Discriminant Analysis (LDA), ridge regression, lasso regression, neural network, or algorithms derived from the above methods; random forests or algorithms derived from random forests are preferred.

Further, the algorithm derived from random forests may be selected from an isolated Forest (Isolation Forest) algorithm, a TRTE (Totally Random Trees Embedding) algorithm, or an extreme tree (ExtraTrees) algorithm.

In a specific embodiment of the invention, the microsatellite instability predicted value is used as input data for machine learning.

The screening method of the present invention can use only samples derived from tumor patients or tumor tissues; the screening methods of the present invention may not use samples derived from healthy individuals or normal tissues.

The invention also provides a construction method of the microsatellite instability (MSI) detection model, which comprises the following steps:

ii) constructing a microsatellite instability detection model using one or more indicators including a microsatellite instability prediction value of the candidate microsatellite locus;

The predicted value of microsatellite instability is a quantified value of the difference in abundance between the major and minor genotypes in the candidate microsatellite loci.

In a specific embodiment of the invention, the abundance of an allele is calculated by simple repeat sequencing read length number (repeat count).

In a specific embodiment of the invention, the abundance of an allele is calculated by the genotype reads (reads) support number.

Optionally, the abundance of the allele falls within the interval 0-1 by normalization; preferably, the abundance of an allele is normalized by dividing the number of reads for the allele in the candidate microsatellite locus by the sum of the number of reads for all alleles for that locus.

In a specific embodiment of the invention, the calculation can be performed by selecting the allelic type with a repeat (repeat) length between 0 and 50 bp.

In a specific embodiment of the present invention, the construction method further comprises the step of calculating a microsatellite instability prediction value of the sample from the microsatellite instability prediction values of all candidate microsatellite loci in the sample.

Further, the microsatellite instability prediction value of the sample may be calculated according to the following formula:

In a specific embodiment of the present invention, the construction method may further comprise a method of denoising the detection sites used for the model by machine learning.

The construction method of the present invention may use only samples derived from tumor patients or tumor tissues; the construction method of the present invention may not use samples derived from healthy individuals or normal tissues.

In the present invention, a construction method of a microsatellite instability (MSI) detection model is also provided, and the construction method uses the microsatellite instability detection sites or detection site combinations obtained by the screening method of the present invention to construct a model.

Further, the construction method may be a machine learning method.

The invention also provides a method for detecting or predicting microsatellite instability, which comprises the following steps:

i) The method comprises the step of screening microsatellite instability detection sites or combination of detection sites; and/or

ii) detecting the microsatellite instability detection sites or the combination of detection sites obtained by screening by using the screening method of the invention; and/or

iii) The microsatellite instability detection model constructed by the construction method is used for detection.

The invention also provides a microsatellite instability detection marker combination, which is a microsatellite marker combination at a detection site obtained by the screening method of the invention.

Further, the detection marker combinations of the present invention may comprise the markers in table 2, table 5, table 6 or table 7.

The invention also provides application of the reagent for specifically detecting the microsatellite instability detection sites or detection site combinations in the preparation of microsatellite instability detection kits and/or tumor accompanying diagnosis kits, wherein the microsatellite instability detection sites or detection site combinations are microsatellite instability detection sites or detection site combinations obtained by adopting the screening method, or detection marker combinations or microsatellite instability detection sites or detection site combinations in the detection model constructed by the construction method.

The invention also provides a microsatellite instability detection kit and/or a tumor accompanying diagnosis kit, wherein the kit comprises a reagent for specifically detecting microsatellite instability detection sites or detection site combinations, and the microsatellite instability detection sites or detection site combinations are microsatellite instability detection sites or detection site combinations obtained by adopting the screening method, or detection marker combinations or microsatellite instability detection sites or detection site combinations in a detection model constructed by the construction method.

The present invention also provides a system or device for microsatellite instability detection and/or tumor companion diagnosis, the system or device comprising:

the acquisition module is used for acquiring the measurement data of a microsatellite instability detection site or detection site combination of a subject, wherein the microsatellite instability detection site or detection site combination is obtained by adopting the screening method of the invention, or the detection marker combination is obtained by adopting the detection marker combination of the invention, or the microsatellite instability detection site or detection site combination in the detection model constructed by the construction method of the invention, the measurement data is a microsatellite instability prediction value of the detection site, and the microsatellite instability prediction value is a quantification value of the abundance difference between a major allele type and a minor allele type in candidate microsatellite sites;

the data analysis module is used for inputting the detection data of the microsatellite instability detection sites or the detection site combinations into the detection model constructed by the construction method of the invention to obtain detection results.

In particular embodiments of the invention, the system or apparatus may further comprise: a sequencing module for sequencing the subject.

In a specific embodiment of the invention, the system or apparatus further comprises: a diagnostic module for generating tumor-associated diagnostic results and/or treatment recommendations.

The present invention also provides a computer-readable storage medium including a stored computer program, the computer program comprising:

i) A program for performing the screening method of the microsatellite instability detection site or combination of microsatellite instability detection sites of the present invention; and/or

ii) a program for executing the construction method of the microsatellite instability detection model of the present invention; and/or

iii) A program for performing the microsatellite instability detection or prediction method of the present invention.

The present invention also provides an apparatus comprising a processor, a memory and a computer program stored in the memory, the computer program comprising:

The invention also provides a method for quantifying microsatellite instability, which uses a microsatellite instability predicted value, wherein the microsatellite instability predicted value is a quantized value of abundance difference between a major allele type and a minor allele type in a microsatellite locus.

Optionally, the abundance of the allele falls within the interval 0-1 by normalization; preferably, the abundance of an allele is normalized by dividing the number of reads supported for the allele in the microsatellite locus by the sum of the number of reads supported for all alleles at that locus.

In a specific embodiment of the present invention, the quantification method further comprises the step of calculating a microsatellite instability prediction value of the sample from the microsatellite instability prediction values of all target microsatellite loci in the sample.

In particular embodiments of the invention, the microsatellite loci are microsatellite loci in the whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in the genome, or microsatellite loci of interest.

In a specific embodiment of the invention, the microsatellite loci are selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.

The methods of the invention may further comprise the step of sequencing to determine the nucleotide sequence of the sample.

The methods of the invention may use only samples derived from tumor patients or tumor tissue; the methods of the invention may not use samples derived from healthy individuals or normal tissue.

The technical scheme of the invention can be used in various diagnostic and non-diagnostic application scenes of cancers. The technical scheme of the invention can be applied to tumors of any stage, such as very early tumor, medium tumor and late tumor; preferably for early stage tumors or very early stage tumors.

The beneficial effects of the invention at least comprise the following aspects:

(1) The accuracy is high, and the model performance is excellent. The sensitivity, specificity and AUC can reach 1, the prediction result is completely consistent with the PCR method serving as a gold standard, and the detection rate can reach 100% when the LOD is detected and limited to 20%, 15% and 10% compared with the existing NGS method.

(2) The MSI state can be accurately and stably detected only based on a single tumor sample, the limitation caused by the loss or sampling difficulty of a control sample is eliminated, the sequencing cost is saved on the technical level, the operation steps are simplified, and the detection speed is improved.

(3) The method can flexibly embed detection flows of different gene detection panel products, flexibly screen MSI detection sites according to panel sizes, has excellent MSI detection performance and is flexible and wide in application. For a brand new panel product, all MSI detection sites in the panel detection range can be selected, effective detection sites in the panel detection range can be screened out by adopting the method, and MSI prediction is carried out on a sample of the panel by using the algorithm and the model of the invention. For other panel products, if the detection range of the panel product is contained in the existing panel screened by the method, the loci in the panel range can be directly extracted from the detection loci screened by the existing panel, and the loci can be used for MSI prediction; for the panel products with the detection range different from the existing panel, the method can be adopted to screen out high-efficiency sites from the head and construct a prediction model.

Drawings

FIG. 1 is a schematic diagram of a process for predicting the instability of a microsatellite according to an embodiment of the present invention.

FIG. 2 is a flow chart of the construction and performance evaluation of a microsatellite instability classification model according to an embodiment of the present invention.

FIG. 3 shows the results of performance evaluation (training set samples) of the model of the present invention. The left graph is a model ROC curve; the right plot is a classification scatter plot.

FIG. 4 shows the results of performance evaluation of the inventive model (158 samples). The left graph is a data summary; the right plot is a classification scatter plot.

Detailed Description

Unless otherwise indicated, all terms used herein have the meaning commonly found in the art, and all reagents used are conventional commercial reagents in the art.

The term "genotype reads (or read length) support" as used herein refers to the number of reads obtained by sequencing that correspond to the sequence of a genotype. For microsatellite instability, the genotype is a simple series of repeat species present at the same position on the chromosome, each simple series of repeat species being referred to as a genotype. The term "simple repeat sequencing read length number" (repeatcount) refers to the number of read supports per simple repeat.

The term "sensitivity" in the present invention may refer to the number of true positives divided by the sum of the number of true positives and false negatives, and may be used to characterize the ability to correctly identify a population that is truly suffering from cancer.

The term "specificity" in the present invention may refer to the number of true negatives divided by the sum of the number of true negatives and false positives, and may be used to characterize the ability to correctly identify a population that is truly free of cancer.

The term "ROC" or "ROC curve" in the present invention may refer to a subject's working characteristic curve (receiver operating characteristic curve) that may be used to characterize the performance of a classifier. ROC curves can be generated by plotting sensitivity versus specificity at various threshold settings.

The term "AUC" in the present invention may refer to the area under the ROC curve and may be used to characterize the performance of cancer screening/prediction. AUC ranges from 0.5 to 1.0, with values closer to 1.0 indicating better screening/predictive performance of the method.

The technical contents of the present invention will be described in detail with reference to the accompanying drawings and specific embodiments. It will be appreciated by those skilled in the art that the following examples are illustrative of the present invention and should not be construed as limiting the scope of the invention.

Example 1

Clinical patient tissue samples, including tumor test samples and paired healthy human control (normal) samples, were collected for testing the MSI detection methods of the present invention and validated using PCR MSI detection methods and compared to other MSI detection tools known in the art. The PCR MSI detection result is used as a gold standard, and the consistency of the result with the MSI detection method is compared to evaluate the performance of the MSI detection method. It should be noted that the MSI detection method proposed by the present invention does not require the use of control samples, which are used in the examples section for PCR validation and other MSI detection software analysis.

As for MSI sites used for the test, a large gene combination (panel) containing 654 genes related to tumors such as rectal cancer, gastric cancer, ovarian cancer and pulmonary sarcoma cancer, covering 550 MSI sites was used for the test in this example.

The kit for detecting MSI by the PCR method is a Promega kit, MSI state information of each sample is obtained by PCR detection, and the result is used as a gold standard. The results of PCR were divided into two groups: samples detected as MSI-H (microsatellite highly unstable) are grouped and scored as MSI-H (positive); samples detected as MSI-L (microsatellite low instability) and MSS (microsatellite stability) were grouped and scored as MSS (negative). The purpose of such grouping is to correlate the PCR detection results with the results of NGS-based MSI detection methods (NGS detection methods have only two types of detection results, MSI-H and MSS, as reported in the prior art).

Example 2

The 4 gene combinations of example 1 were subjected to exon probe capture library sequencing, and 150bp Pair-End mode sequencing (Read 1:151, read2:151; index1:8, index 2:8) was performed according to the instrument standard protocol using a gene sequencer (NextSeq CN 500), finally fastq format second generation sequencing data was obtained as raw data (raw data).

And performing quality control on the obtained second-generation sequencing data by using quality control software fastp, filtering sequencing joints, low-quality bases, sequencing error fragments and the like, and filtering to obtain high-quality data (clean data).

Comparing the clean data with a reference genome hg19 by using comparison software bwa to obtain corresponding specific position information on each DNA fragment genome; the data were then deduplicated and base corrected using genecore software.

Based on the known repeat (repeat) locations of the reference genome detection region, the number of reads (or read lengths) supported for each repeat type is obtained. First, a genomic MSI candidate locus bed file is generated using the published MANTIS self-contained tool RepeatFinder, loci are read from the bed file, and the differences generated by the 0-/1-base indexes are aligned using the supplied reference genomic file. Wherein, the microsatellite repeated sequence with the length of 10-100, single base and more than 5 times of repeated times is selected. Secondly, quality control is carried out on reads matched with all repeat types on each loci, the average base quality of each read is not less than 20, the average base quantity of each read falling in a loci interval is not less than 25, the length of the reads (only counting non-shearing parts and not counting soft-clips or hard-clips) is greater than 35bp, the support number of the minimum repeat reads is greater than 1, and other involved filtering steps are software default parameters. Finally, counting the number of reads qualified for quality control for each repeat type at all sites of tumor patient and control, respectively, generating a simple repeat sequencing read length number (repeat count) result for subsequent analysis (since MANTIS is an analysis of paired samples, tumor and control samples need to be input simultaneously, and the method of the invention does not need to use control samples, so in order not to affect software operation, tumor samples are repeatedly counted once as control samples, and finally the statistical result of tumor samples is used).

Example 3

From the repeat result obtained in example 2, the allele type representing the position is obtained at each locus, the difference between the primary and secondary abundance values is calculated as the predicted value of the microsatellite instability level for each loci, and the average of the (1-unstable level predicted values) of all loci in the sample is used as the predicted value of microsatellite instability (MSIscore) of the sample, and the formula of the non-parametric algorithm is as follows:

wherein X is _i1 Representing the main peak value, X _i2 The secondary peak is represented, i is the loci order, n is the loci number.

Specifically, counting the repeat count of each detection position with the repeat length between 0 and 50bp, normalizing the repeat count to be within a range of 0 to 1, wherein the normalization mode is that the number of reads support of each repeat type of the position is divided by the sum of the numbers of reads support of all repeat types of the position, and sorting according to normalized numerical values from large to small, wherein the maximum value is a main peak value, and the secondary value is a secondary peak value.

Example 4

153 samples of tumor tissue among the samples collected in example 1 were prepared according to 8:2 and the MSI state information (MSI-H/MSS) measured by the PCR method is randomly and hierarchically sampled and divided into 122 cases of training sets and 31 cases of test sets, and then initial training set samples are further processed according to the training sets: the validation set was 8:2 and MSI status information (MSI-H/MSS) are randomly layered sampled 20 times to obtain samples of 20 sampling results (fig. 2), and 97 samples of a training set and 25 samples of a verification set are obtained in each sampling. The above samples did not contain any control samples.

The following is performed on each sample taken, with 550 candidate sites covered by the large panel of example 1 for each sample, the MSIscore is calculated using the parameterisation algorithm of example 3, and the set of samples taken each time is sorted into a matrix of training and validation sets containing the MSIscore for each sample at each detection site. For the matrix sorted by each sampling set, the weight of each candidate site is calculated by using a random forest algorithm, all candidate sites are ordered according to the times that the weight is not 0 in 20 samples, and 10 gradients are divided according to the times that the weight occurs to be greater than or equal to 0-9 (as shown in table 1).

TABLE 1 weight gradient partitioning results

Number of times of occurrence of weight	>＝0	>＝1	>＝2	>＝3	>＝4	>＝5	>＝6	>＝7	>＝8	>＝9
											Detecting point count	550	381	265	200	156	129	108	98	83	71

And carrying out prediction classification on the training set according to the detection site construction model selected by each gradient, dividing a threshold value by the training set, carrying out prediction classification on the verification set, and evaluating the performance of the evaluation model by using the prediction classification effect of the verification set. According to the result of each gradient test, drawing ROC curve to evaluate the performance of each model, considering the classification performance and stability, selecting the model of 200 detection sites contained in the gradient with the number of times of weight occurrence > =3, and defining the threshold according to the test result of the training set on the selected candidate site. The detection site selection results are shown in the following table:

TABLE 2 detection site screening results

Table 2 (subsequent)

Table 2 (subsequent)

The performance of the ROC curve evaluation model is drawn according to 97 training set sample prediction data, and the result is shown in figure 3, and the sensitivity, the specificity and the AUC of the model constructed by the method are all 1, so that the accurate prediction of MSI is realized.

Further, 158 samples were also used to evaluate model performance, with MSIscore > = threshold determined to be MSI-H and vice versa for MSS. The results are shown in table 3 and fig. 4:

table 3 model predictive performance:

MSI-H samples in 158 samples are correctly detected by the model constructed by the invention, and the MSI-H samples are completely consistent with the PCR detection result, and the consistency of the MSI-H samples and the PCR detection result is up to 100%.

To verify the highest sensitivity range of the method of the invention, we collected MSI-positive (MSI-H) reference KM12 and MSI-negative (MSS) reference NA12878, mixed MSI-H and MSS cell lines in proportion, diluted MSI-H cell line concentrations (20%, 15%, 10%, 5% respectively) to determine the lowest MSI detection limit. MSI was detected on-machine sequencing of the above cell samples using the large panel of example 1 as the sequencing panel. As shown in Table 4, MSI-H detection rate can reach 100% under the conditions of LOD 20%, LOD 15% and LOD 10%; in the case of LOD 5%, MSI-H detection rate can still reach 83%.

Table 4 detection rate at different LODs:

dilution concentration (%)	20	15	10	5
					Detection rate (%)	100	100	100	83

Example 5

In order to evaluate the generalization performance of the model, three genes panel of medium, small and micro size are introduced, which respectively comprise 457, 86 and 31 genes related to tumors such as rectal cancer, gastric cancer, ovarian cancer, lung sarcoma cancer and the like, and 146, 109 and 41 MSI sites are respectively covered. 173 new samples were sequenced on-machine using the three panels described above, respectively.

Of the 200 MSI sites selected in example 4, 61, 33, and 7 sites were found in the middle, small, and mini panels, respectively, and the results are shown in tables 5-6 and SEQ ID Nos. 1-7 in Table 7.

Table 5 detection site screening results (medium panel):

table 5 (subsequent)

Table 6 detection site screening results (mini):

watch 6 (subsequent)

Table 7 detection site screening results (mini):

sequencing data from 173 new samples were analyzed using the above sites and the results are shown in table 8. For medium-sized, small-sized and miniature panels, all 11 MSI-H samples in 173 samples can be accurately detected by directly using 61, 33 or 7 screened sites, and the sensitivity and the specificity reach 1.

Table 8 model predictive performance in medium, small, mini panel:

Example 6

To test the performance improvement produced by the method of the present invention over prior art methods, it was compared to prior art MSI detection models. MANTIS, MSISensor and mSINGS are the three most commonly used MSI detection methods in the prior art, wherein the sensitivity and specificity of MANTIS are considered to be optimal (Performance evaluation for rapid detection of pan-cancer microsatellite instability with MANTIS.Oncostarget, 2017, vol.8, (No. 5), pp: 7452-7463), thus MANTIS was selected as a comparative example, MSI was predicted based on the sequencing data of the aforementioned 173 samples in 3 panels, and the MSI detection sites used by MANTIS software were all MSI detection sites covered by the 3 panel gene regions. The results are shown in Table 9:

table 9 model predictive performance in medium, small, mini panel:

therefore, compared with the prior art, the MSI detection model is constructed by adopting the method provided by the invention, and higher sensitivity and specificity can be obtained under the condition of using fewer detection sites.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A screening method for a microsatellite instability detection site or a combination of microsatellite instability detection sites, the screening method comprising:

the step of determining the microsatellite instability predicted value of the candidate microsatellite loci in the sample comprises the following steps: carrying out exon probe capturing, library building and sequencing on the gene combination covering the candidate sites, and carrying out 150bp Pair-End mode sequencing according to instrument standard operation rules by using a gene sequencer to finally obtain fastq format second generation sequencing data serving as original data raw data; performing quality control on the obtained second generation sequencing data by using quality control software fastp, filtering a sequencing joint, low-quality bases and a sequencing error fragment, and filtering to obtain high-quality data clean data; comparing the clean data with a reference genome hg19 by using comparison software bwa to obtain corresponding specific position information on each DNA fragment genome; then using genecore software to perform de-duplication and base correction treatment on the comparison data;

Obtaining the read support number of each repeat type according to the repeat position of the known reference genome detection region; firstly, generating a genome MSI candidate locus bed file by using published software MANTIS self-contained tool repeat, reading locus from the bed file, and using the provided reference genome file to compare and correct the difference generated by 0-/1-base indexes; wherein, selecting microsatellite repeated sequences with the length of 10-100, single base and more than 5 times of repeated times; secondly, quality control is carried out on reads matched with all repeat types on each loci, the average base quality of each read is not less than 20, the average base quantity of each read falling in a loci interval is not less than 25, the length of the reads is more than 35bp, the minimum repeat number is more than 1, and other involved filtering steps are software default parameters; finally, counting the number of reads qualified in quality control corresponding to each repeat type at all sites of the tumor patient, and generating a simple repeat sequence sequencing read long number repeat count result for subsequent analysis;

the predicted value of the microsatellite instability of the candidate microsatellite locus is a quantized value of the abundance difference between the major allele type and the minor allele type in the candidate microsatellite locus, and the calculation formula is as follows: microsatellite instability prediction value of site=1- |x _i1 -X _i2 I (I); wherein i is site number or site order, X _i1 Representing normalized major allele abundance, X _i2 Representing normalized minor genotype abundance;

the screening method further comprises: calculating the microsatellite instability predicted value of the sample according to the following formula by using the microsatellite instability predicted values of all candidate microsatellite loci in the sample:

wherein f (X) represents a microsatellite instability predicted value of the sample, X _i1 Representing normalized major allele abundance, X _i2 Representing normalized secondary genotype abundance, i being the site number or the site order, n being the number of sites;

the screening method further comprises the steps of: and carrying out prediction classification on the training set according to the selected detection site construction model, dividing a threshold value by the training set, and carrying out prediction classification on the verification set.

2. The screening method of claim 1, wherein the abundance of an allele is calculated by simple repeat sequencing read a long number of repeat counts.

3. The screening method of claim 2, wherein the abundance of the allele is calculated by genotyping reads support numbers.

4. The screening method of claim 2, wherein the abundance of the allele is within the interval 0-1 by normalization; the method of allele abundance normalization is to divide the number of reads for the allele in the candidate microsatellite locus by the sum of the number of reads for all alleles for that locus.

5. The screening method according to claim 2, wherein the calculation is performed by selecting the allele type having the repeat length of between 0 and 50 bp.

6. The screening method of any one of claims 1-5, wherein the candidate microsatellite loci are microsatellite loci in a whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in a genome, or microsatellite loci of interest.

7. The screening method of any one of claims 1-5, wherein the candidate microsatellite loci are selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.

8. The screening method of any one of claims 1-5, further comprising a method of denoising the detection site by machine learning.

9. The screening method of claim 8, wherein the noise reduction method is selected from random forest, principal component analysis, linear discriminant analysis, ridge regression, lasso regression, neural network, or algorithms derived from the above methods.

10. The screening method of claim 9, wherein the noise reduction method is selected from a random forest or an algorithm derived from a random forest.

11. The screening method of claim 10, wherein the algorithm derived from random forests is selected from an isolated forest algorithm, a TRTE algorithm, or a limit tree algorithm.

12. The screening method according to claim 8, wherein the microsatellite instability prediction value is used as input data for machine learning.

13. The screening method according to any one of claims 1 to 5, wherein the screening method uses only samples derived from tumor patients or tumor tissues; or the screening method does not use samples derived from healthy individuals or normal tissue.

14. The construction method of the microsatellite instability detection model is characterized by comprising the following steps:

the predicted value of the microsatellite instability is a quantized value of the abundance difference between the major allele type and the minor allele type in candidate microsatellite loci; the step of determining a predicted value of microsatellite instability of a candidate microsatellite locus in a sample and the quantized value in the screening method according to claim 1;

The microsatellite instability detection model comprises: calculating the microsatellite instability predicted value of a sample according to the microsatellite instability predicted values of all candidate microsatellite loci in the sample, wherein the microsatellite instability predicted value of the sample is calculated according to the following formula:

the construction method further comprises the following steps: and carrying out prediction classification on the training set according to the selected detection site construction model, dividing a threshold value by the training set, and carrying out prediction classification on the verification set.

15. The method of construction of claim 14, wherein the abundance of the allele is calculated by simple repeat sequencing read a long number of repeat counts.

16. The method of claim 15, wherein the abundance of the allele is calculated by genotyping reads support numbers.

17. The method of construction according to any one of claims 14 to 16, wherein the abundance of the allele is within the interval 0-1 by normalization; the method of allele abundance normalization is to divide the number of reads for the allele in the candidate microsatellite locus by the sum of the number of reads for all alleles for that locus.

18. Construction method according to any one of claims 14 to 16, characterized in that the calculation is performed by selecting the allelic type with repeat length between 0 and 50 bp.

19. The method of construction according to any one of claims 14 to 16, wherein the candidate microsatellite loci are microsatellite loci in the whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in the genome or microsatellite loci of interest.

20. The method of any one of claims 14 to 16, wherein the candidate microsatellite loci are selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.

21. The method of construction according to any one of claims 14 to 16, further comprising a method of denoising the detection sites used for the model by machine learning.

22. The method of construction according to claim 21, wherein the noise reduction method is selected from random forest, principal component analysis, linear discriminant analysis, ridge regression, lasso regression, neural network or algorithms derived from the above methods.

23. Construction method according to claim 22, characterized in that the noise reduction method is selected from random forests or algorithms derived from random forests.

24. The method of claim 23, wherein the random forest derived algorithm is selected from an isolated forest algorithm, a TRTE algorithm, or a limit tree algorithm.

25. The method of claim 21, wherein the microsatellite instability prediction value is used as input data for machine learning.

26. The method of construction according to any one of claims 14 to 16, wherein the method of construction uses only samples derived from tumor patients or tumor tissue; or the construction method does not use samples derived from healthy individuals or normal tissues.

27. A method for constructing a model for detecting a microsatellite instability, characterized in that the method uses the microsatellite instability detection site or combination of detection sites obtained by screening according to any one of claims 1 to 13.

28. The building method of claim 27, wherein the building method is a machine learning method.

29. A method for detecting or predicting microsatellite instability, the method comprising:

i) A step comprising screening for a microsatellite instability detection site or combination of detection sites according to the screening method of any one of claims 1-13; and/or

ii) detecting the microsatellite instability detection sites or combinations of detection sites obtained by screening using the screening method according to any one of claims 1 to 13; and/or

iii) Detecting using the microsatellite instability detection model constructed by the construction method according to any one of claims 14 to 28;

the detection or prediction method is not directed to the diagnosis or treatment of a disease.

30. Use of a reagent for specifically detecting a microsatellite instability detection site or a combination of detection sites in the preparation of a microsatellite instability detection kit and/or a tumor accompanying diagnostic kit, characterized in that the microsatellite instability detection site or combination of detection sites is a microsatellite instability detection site or combination of detection sites obtained according to the screening method of any one of claims 1 to 13 or a microsatellite instability detection site or combination of detection sites in a detection model constructed according to the construction method of any one of claims 14 to 28.

31. A device for microsatellite instability detection and/or tumor-associated diagnosis, the device comprising:

an acquisition module for acquiring measurement data of a microsatellite instability detection site or a combination of detection sites of a subject, the microsatellite instability detection site or the combination of detection sites being a microsatellite instability detection site or a combination of detection sites obtained according to the screening method of any one of claims 1 to 13 or a microsatellite instability detection site or a combination of detection sites in a detection model constructed according to the construction method of any one of claims 14 to 28, the measurement data being a microsatellite instability prediction value of a detection site, the microsatellite instability prediction value being a quantification value of an abundance difference between a major allele type and a minor allele type in a candidate microsatellite site;

A data analysis module for inputting the measurement data of the microsatellite instability detection sites or the combination of detection sites into a detection model constructed according to the construction method of any one of claims 14 to 28 to obtain a detection result.

32. The apparatus of claim 31, wherein the apparatus further comprises: a sequencing module for sequencing the subject.

33. The apparatus according to claim 31 or 32, characterized in that the apparatus further comprises: a diagnostic module for generating tumor-associated diagnostic results and/or treatment recommendations.

34. A computer-readable storage medium, the computer-readable storage medium comprising a stored computer program, the computer program comprising:

i) A program for performing a screening method of a microsatellite instability detection site or a combination of microsatellite instability detection sites according to any one of claims 1-13; and/or

ii) a program for executing the construction method of the microsatellite instability detection model according to any one of claims 14 to 28; and/or

iii) Program for performing a method for detecting or predicting microsatellite instability according to claim 29.

35. An apparatus for detecting microsatellite instability comprising a processor, a memory and a computer program stored in said memory, said computer program comprising:

36. A method for quantifying microsatellite instability, characterized in that a microsatellite instability predicted value is used for quantifying the abundance difference between a major allele type and a minor allele type in a microsatellite locus; the quantization value is the quantization value in the screening method of claim 1.

37. The method of quantification of claim 36, wherein the abundance of the allele is calculated by simple repeat sequencing read a long number of repeat counts.

38. The method of quantification of claim 37, wherein the abundance of the allele is calculated by genotyping reads support numbers.

39. The method of quantification according to any of the claims 36 to 38, wherein the abundance of the allele is within the interval 0-1 by normalization; the abundance of an allele is normalized by dividing the number of reads for the allele in the microsatellite locus by the sum of the number of reads for all alleles at that locus.

40. The method of quantification according to any of the claims 36 to 38, wherein the calculation is performed by selecting the allelic type with repeat length between 0 and 50 bp.

41. The method of any one of claims 36 to 38, further comprising the step of calculating a microsatellite instability prediction value for the sample from the microsatellite instability prediction values for all target microsatellite loci in the sample.

42. The method of claim 41, wherein the sample's microsatellite instability prediction is calculated according to the following equation:

43. The method of quantification of any of claims 36 to 38, wherein the microsatellite loci are microsatellite loci in a whole genome, microsatellite loci in an exome, microsatellite loci contained in one or more regions in a genome or microsatellite loci of interest.

44. The method of quantification according to any of the claims 36 to 38, wherein the microsatellite loci are selected from the group consisting of single nucleotide repeats, dinucleotide repeats and/or complex repeats.