CN117198406B

CN117198406B - Feature screening method, system, electronic equipment and medium

Info

Publication number: CN117198406B
Application number: CN202311222677.5A
Authority: CN
Inventors: 王聃; 许春萍; 腾飞
Original assignee: Yicon Beijing Medical Science And Technology Co ltd
Current assignee: Yicon Beijing Medical Science And Technology Co ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2024-06-11
Anticipated expiration: 2043-09-21
Also published as: CN117198406A

Abstract

The invention discloses a feature screening method, a system, electronic equipment and a medium, which can be based on a meta heuristic algorithm and a graph neural network interpreter, especially creatively incorporate the importance of each candidate feature in unbiased and stable feature domains into an abnormal graph as key node data, can integrate and topologically obtain the unbiased and stable feature domain importance of each candidate feature based on the abnormal graph by using the graph neural network interpreter, screen and obtain the optimal feature combination with small quantity, non-redundancy, interpretable and high prediction efficiency from the dense, redundant, ultra-high-dimensional and large-scale candidate features of a plurality of feature domains, effectively solve the defects in the prior art and obtain positive technical effects.

Description

Feature screening method, system, electronic equipment and medium

Technical Field

The invention belongs to the field of biological medicine, and particularly relates to a feature screening method, a feature screening system, electronic equipment and a medium.

Background

The high-throughput histology sequencing technology has become a powerful means for the exploration and excavation of the current biomarkers, and is widely applied to the aspects of clinical diagnosis, treatment, prognosis, medication and the like. The multi-group chemical integration method can integrate and mine multi-dimensional and multi-layer different group chemical data, so that the comprehensive and deep understanding and mining of the molecular mechanism panorama behind the diseases can be further realized. However, how to integrate multiple groups of data with different granularities, high heterogeneity and high noise comprehensively, accurately and on a large scale still remains to be solved and improved.

The method is characterized in that a plurality of candidate biomarkers can be obtained by a statistical analysis or bioinformatics analysis method based on the histology data, and because the candidate biomarkers are often large in number, large in noise and have certain false positive, and have wide and complex interaction effects, the optimal biomarker combination is required to be screened from the candidate biomarkers in order to further reduce the detection cost and improve the prediction efficiency.

If candidate biomarkers are considered as candidate features, screening for optimal biomarker combinations may also be considered as a problem in machine learning of screening for optimal feature combinations from dense, redundant, ultra-high dimensional, large-scale candidate features. The ultra-high-dimensional and large-scale candidate features not only can cause gradient explosion and dimension disaster, but also can improve the prediction cost and reduce the prediction efficiency. However, how to take into consideration the importance and the interpretability of each candidate biomarker (candidate feature) of a single group, and comprehensively take into consideration the interaction effect between different candidate biomarkers (candidate features), so as to screen out the optimal biomarker combinations (optimal feature combinations) with small quantity, non-redundancy, interpretability and high prediction efficiency from the candidate biomarkers (candidate features), and still remain to be solved and improved.

The invention creatively provides a feature screening method based on a meta heuristic algorithm and a graph neural network interpreter, which can screen out the optimal feature combination with small quantity, non-redundancy, interpretability and high prediction efficiency from candidate features of a plurality of feature domains like multiple learners, effectively solves the defects in the prior art and achieves positive technical effects.

Disclosure of Invention

The present invention aims to solve the above-mentioned technical problems in the related art by providing a feature screening method, a feature screening system, an electronic device and a medium.

In order to achieve the above purpose, the technical scheme adopted by the embodiment of the invention is as follows:

In a first aspect, the present invention provides a feature screening method, including:

constructing a training set, wherein the training set comprises a plurality of samples, each sample in the plurality of samples comprises a category label, a candidate feature and candidate feature data corresponding to the candidate feature, and the candidate feature data corresponding to the candidate feature form a feature domain;

Constructing a corresponding iso-graph for each sample, wherein the class label of the iso-graph is the class label contained in each sample; the heterogeneous graph comprises nodes and node data with a plurality of node types and edges and edge data with a plurality of edge types; each node type corresponds to a feature domain, each node corresponds to a candidate feature, and each node data comprises candidate feature data corresponding to the candidate feature and the importance of the candidate feature in the feature domain; each edge type represents a relationship between two node types, each edge represents a relationship between the two nodes, and each edge data comprises a weight between the two nodes;

constructing a graph neural network model for predicting the category labels, inputting the heterogeneous graph corresponding to each sample into the graph neural network model, training the graph neural network model by using the loss function, and obtaining a trained graph neural network model;

inputting the different composition corresponding to each sample and the trained graph neural network model into a graph neural network interpreter to obtain the importance among feature domains of each candidate feature;

and constructing a machine learning model for predicting the category label, training the machine learning model by utilizing the importance among the feature domains of each candidate feature, the category label of each sample, the candidate feature and the candidate feature data, screening to obtain an optimal feature combination and utilizing the optimal machine learning model of the optimal feature combination.

Further, the calculating process of the intra-feature importance of the candidate feature is as follows:

Aiming at a feature domain, each sample, a category label of the sample, candidate features of the feature domain and candidate feature data are obtained;

Constructing a classifier model;

Based on a meta heuristic algorithm, training the classifier model by using each sample and a class label of the sample, candidate features of the feature domain and candidate feature data for multiple iterations to obtain importance of each candidate feature of the feature domain during each iteration;

And adding the importance of each candidate feature of the feature domain during each iteration, and ordering in descending order to obtain the importance ordering position of each candidate feature of the feature domain, namely the importance of the feature domain of the candidate feature.

Further, the intra-feature importance of the candidate features includes further normalization and updating.

Further, the normalization method comprises a Min-Max method.

Further, the constructing the graph neural network model for category label prediction includes: the system comprises U GCN layers with different depths and cascaded, V GAT layers with different depths and cascaded, a splicing layer, a global pooling layer, a plurality of full-connection layers and a Softmax layer; the 1 st GCN layer is used for inputting the iso-composition corresponding to each sample and calculating to obtain the output of the 1 st GCN layer, and the i th GCN layer is used for receiving the output of the i-1 th GCN layer and calculating to obtain the output of the i th GCN layer; the 1 st GAT layer is used for inputting the iso-composition corresponding to each sample and calculating to obtain the output of the 1 st GAT layer, and the j-th GAT layer is used for receiving the output of the j-1 st GAT layer and calculating to obtain the output of the j-th GAT layer; i is i=2 to U, j is j=2 to V, U and V are integers not less than 2; the splicing layer receives the output of the U GCN layers and the output of the V GAT layers respectively and splices the output; the global pooling layer receives the output of the splicing layer, performs global pooling operation and outputs the output; the plurality of full-connection layers receive the output of the global pooling layer, perform nonlinear fusion and output; the Softmax layer receives the outputs of the plurality of fully connected layers and outputs after calculation for calculation of the loss function.

Further, the outputting of the GCN layer and the GAT layer includes further increasing the activation operation.

Further, the activation operation includes a ReLU activation operation.

Further, the global pooling operation includes operating using Global Add Pooling methods.

Further, the output of the global pooling layer comprises a discard operation and an activation operation which are further connected with the Dropout layer to increase the setting probability.

Further, the set probability is 0.2.

Further, the activation operation includes a ReLU activation operation.

Further, the loss function includes a cross entropy loss function.

Further, the Softmax layer includes further logarithmic conversion.

Further, inputting the heterograms corresponding to each sample and the trained graph neural network model into a graph neural network interpreter, and obtaining the feature inter-domain importance of each candidate feature comprises the following steps:

The graph neural network interpreter is GNNExplainer, the abnormal graph corresponding to each sample and the trained graph neural network model are input into GNNExplainer, the importance of each node in the abnormal graph is calculated and ordered in a descending order, and further the importance ordering position of the candidate feature corresponding to each node is obtained, namely the feature inter-domain importance of each candidate feature.

Further, the inter-feature domain importance of the candidate features includes further normalization and updating.

Further, constructing a machine learning model for predicting the class label, training the machine learning model by using the importance among feature domains of each candidate feature, the class label of each sample, the candidate feature and the candidate feature data, and screening to obtain an optimal feature combination and realizing the optimal machine learning model by using the optimal feature combination, wherein the method comprises the following steps:

Ordering the importance among the feature domains of each candidate feature in a descending order to obtain all non-empty subsets of the first K candidate features, namely K ² -1 candidate feature combinations, wherein K is an integer;

Training the machine learning model by using the class label of each sample, the candidate feature of the candidate feature combination and the candidate feature data aiming at each candidate feature combination, evaluating the trained machine learning model and calculating a performance index;

Selecting the trained integrated learning model with the optimal performance index as an optimal machine learning model, wherein a candidate feature combination used by the optimal machine learning model is an optimal feature combination;

The performance index can be selected from AUC-RMSE+SPE, wherein AUC represents the area under the ROC curve, RMSE represents root mean square error, SPE represents specificity, and the performance index is optimal when the value of the performance index is maximum.

Further, the machine learning model includes an ensemble learning model.

Further, the ensemble learning model includes CatBoost.

Further, the category label is an MSI category label, and the MSI category label comprises MSI-H, MSI-L and MSS;

The characteristic fields comprise methylation site characteristic fields and gene characteristic fields, the candidate characteristics comprise methylation site candidate characteristics and gene candidate characteristics, the methylation site candidate characteristics belong to the methylation group characteristic fields, the gene candidate characteristics belong to the transcriptome characteristic fields, candidate characteristic data of the methylation site candidate characteristics are methylation degree values, and candidate characteristic data of the gene candidate characteristics are gene expression values;

The heterograms of node types include a methylation site node type and a gene node type, the methylation site node type representing the methylation site feature domain, the gene node type representing the gene feature domain, the methylation site node type comprising a methylation site candidate feature node, the gene node type comprising the gene candidate feature node, the methylation site candidate feature node representing the methylation site candidate feature, the gene candidate feature node representing the gene candidate feature, the node data of the methylation site candidate feature node comprising candidate feature data of the methylation site candidate feature and intra-feature importance of the methylation site candidate feature, the node data of the gene candidate feature node comprising candidate feature data of the gene candidate feature and intra-feature importance of the gene candidate feature;

The edge types of the heterograms include methylation site node type-methylation site node type edge types, gene node type-gene node type edge types, and methylation site node type-gene node type edge types; the methylation site node type-methylation site node type edge type comprises methylation site candidate feature nodes-methylation site candidate feature node edges, and represents the relationship between two methylation site types; the gene node type-gene node type edge type comprises a gene candidate characteristic node-gene candidate characteristic node edge, and represents the relationship between two gene node types; the methylation site node type-gene node type edge type includes a methylation site candidate feature node-gene candidate feature node edge representing a relationship between the methylation site type and the gene node type.

Further, the optimal feature combination includes 4 methylation sites cg14598950, cg27331401, cg05428436 and cg15048832, and 2 genes RPL22L1 and MSH4, the optimal machine learning model is used for the MSI category label prediction.

In a second aspect, the present invention further provides a feature screening system, including:

the training set construction module is used for constructing a training set, wherein the training set comprises a plurality of samples, each sample in the plurality of samples comprises a category label, a candidate feature and candidate feature data corresponding to the candidate feature, and the candidate feature data corresponding to the candidate feature form a feature domain;

The heterogram construction module is used for constructing a corresponding heterogram for each sample, and the class label of the heterogram is the class label contained in each sample; the heterogeneous graph comprises nodes and node data with a plurality of node types and edges and edge data with a plurality of edge types; each node type corresponds to a feature domain, each node corresponds to a candidate feature, and each node data comprises candidate feature data corresponding to the candidate feature and the importance of the candidate feature in the feature domain; each edge type represents a relationship between two node types, each edge represents a relationship between the two nodes, and each edge data comprises a weight between the two nodes;

The graph neural network model module is used for constructing a graph neural network model for predicting the category labels, inputting the heterogeneous graph corresponding to each sample into the graph neural network model, training the graph neural network model by using the loss function, and obtaining a trained graph neural network model;

the graph neural network interpreter module is used for inputting the heterograms corresponding to each sample and the trained graph neural network model into the graph neural network interpreter to obtain the inter-feature importance of each candidate feature;

And the optimal feature combination screening module is used for constructing a machine learning model for predicting the category labels, training the machine learning model by utilizing the importance among feature domains of each candidate feature, the category label of each sample, the candidate feature and the candidate feature data, and screening to obtain an optimal feature combination and an optimal machine learning model by utilizing the optimal feature combination.

In a third aspect, the present invention also provides an apparatus comprising:

A memory: for storing program instructions;

A processor: for executing program instructions, which when executed implement the feature screening method according to any one of the above-mentioned first aspects or the optimal feature combination obtained by the feature screening method according to any one of the above-mentioned first aspects or the optimal machine learning model using the optimal feature combination obtained by the feature screening method according to any one of the above-mentioned first aspects or the feature screening system according to the above-mentioned second aspects.

In a fourth aspect, the present invention further provides a computer readable storage medium, on which program instructions are stored, where the program instructions, when executed by a processor, implement the feature screening method according to any one of the first aspect or the optimal feature combination obtained by the feature screening method according to any one of the first aspect or the optimal machine learning model using the optimal feature combination obtained by the feature screening method according to any one of the first aspect or the feature screening system according to the second aspect.

The beneficial effects of the invention include as follows:

1) Constructing an isomerism graph, using edges among candidate feature nodes to represent wide and complex interaction effects among candidate features of a plurality of feature domains, and simultaneously, taking importance of unbiased and stable feature domains of each candidate feature as key node data to be incorporated into the isomerism graph;

2) The image neural network model is built by adopting a plurality of GCN layers and GAT layers which are in cascade connection and different depths, and the shallow layer, the deep layer GCN layers and the GAT layers are spliced and fused, so that generalized information and specific information in the heterogeneous image can be learned, information with different granularity and different dimensions of the heterogeneous image can be learned, the representation capability of the image neural network model on the heterogeneous image is improved, and the problems of gradient explosion, gradient disappearance, overcomplete, overfitting and the like are avoided;

3) The graph neural network interpreter is used for integrating and topologically obtaining unbiased and stable inter-domain importance of each candidate feature based on the heterogram, and then screening and obtaining optimal feature combinations with small quantity, non-redundancy, interpretation and high prediction efficiency;

in summary, through the combination of the technical schemes, especially creatively taking the importance of each candidate feature in unbiased and stable feature domain as key node data into an abnormal composition, and using a graph neural network interpreter to integrate and topologically obtain the unbiased and stable feature domain importance of each candidate feature based on the abnormal composition, repeated researches and experiments prove that the optimal feature combination with small quantity, non-redundancy, interpretability and high prediction efficiency can be screened from dense, redundant, ultra-high-dimensional and large-scale candidate features, and the optimal feature combination and the optimal machine learning model using the optimal feature combination have excellent prediction performance and strong generalization capability and overfitting resistance capability, so that the innovative scheme of the invention achieves unexpected and beneficial positive effects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a feature screening method of the present invention;

FIG. 2 is a schematic diagram of the neural network model of FIG. 1 in accordance with an embodiment of the present invention;

FIG. 3 is a schematic view of ROC curves obtained by cross-validation of the training set by using the optimal ensemble learning model of the optimal feature combination selected in embodiment 1 of the present invention; wherein the X-axis False Positive Rate represents false positive rate, the Y-axis True Postitive Rate represents true positive rate, and AUC is area under ROC curve;

FIG. 4 is a schematic diagram of a confusion matrix obtained by training set cross-validation using an optimal ensemble learning model of optimal feature combinations obtained by screening in embodiment 1 of the present invention; wherein vertical columns Truth represent sample real MSI category labels, and horizontal lines Prediction represent sample Prediction MSI category labels;

FIG. 5 is a schematic view of ROC curves obtained by using a first independent test set to independently test an optimal ensemble learning model of optimal feature combinations obtained by screening in embodiment 1 of the present invention; wherein the X-axis False Positive Rate represents false positive rate, the Y-axis True Postitive Rate represents true positive rate, and AUC is area under ROC curve;

FIG. 6 is a schematic view of ROC curves obtained by using a second independent test set to independently test an optimal ensemble learning model of optimal feature combinations obtained by screening in embodiment 1 of the present invention; wherein the X-axis False Positive Rate represents false positive rate, the Y-axis True Postitive Rate represents true positive rate, and AUC is area under ROC curve;

FIG. 7 is a schematic view of ROC curves obtained by using a third independent test set to independently test an optimal ensemble learning model of optimal feature combinations obtained by screening in embodiment 1 of the present invention; wherein the X-axis False Positive Rate represents false positive rate, the Y-axis True Postitive Rate represents true positive rate, and AUC is area under ROC curve;

FIG. 8 is a graph comparing the performance of the optimal ensemble learning model using the filtered optimal feature combination with other existing MSI class prediction models in embodiment 1 of the present invention; wherein the column name ACC represents accuracy, SEN represents sensitivity, SPE represents specificity, MCC represents Matthews correlation coefficient, and the MSI class prediction tool comprises an optimal integrated learning model of the embodiment 1, MSISensor, MANTIS, MIRMMR, preMSim, mSING and MSISensor-pro, and the numerical value is the corresponding performance index numerical value of the corresponding MSI class prediction tool.

Detailed Description

The following provides definitions of some of the terms used in this specification. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The high-throughput histology sequencing technology has become a powerful means for the exploration and excavation of the current biomarkers, and is widely applied to the aspects of clinical diagnosis, treatment, prognosis, medication and the like. Histology mainly includes genome, transcriptome, proteome, metabolome, and the like. The clinical queue sample based on the disease can comprehensively and deeply describe and describe the occurrence, the progress, the prognosis, the survival, the staging typing, the treatment response and other dimensions of the disease at the corresponding histology level by using single histology sequencing, and further explore and mine the biomarkers which are closely related to the disease at the histology level and have indication or prediction significance. Such as genome, which mainly researches the association of genes and genetic variations such as mutation and copy number variation with diseases, transcriptome, which mainly researches the expression of genes at transcription level and abnormal expression patterns related to diseases, proteome, which mainly researches the relation between protein expression and post-translational modifications such as phosphorylation, acetylation, etc. and metabolome, which mainly researches the changes of metabolites and specific metabolic pathways related to diseases.

Meanwhile, the occurrence and development of diseases are not only driven by mechanisms at a single omic level, but are complex results of mutual regulation and deep influence at a plurality of omic levels. The multi-group chemical integration method can integrate and mine multi-dimensional and multi-layer different group chemical data, so that the comprehensive and deep understanding and mining of the molecular mechanism panorama behind the diseases can be further realized. For example, by analysis of the genome and transcriptome integration of cancer patients, not only can abnormal gene mutations be mined that drive the progression or recurrence of cancer at the genomic level, but it can be further appreciated which gene mutations affect gene expression and pathway activity at the transcriptional level, and thus affect cellular phenotype and microenvironment and drive tumors. However, how to comprehensively, accurately and massively represent and integrate multiple groups of chemical data with different granularities, high heterogeneity and high noise still remains to be solved.

The clinical biomarker may be a single biomolecule or may comprise a combination of multiple biomolecules from a single group or multiple groups (i.e., biomarker panel) to further enhance the indicated and predicted efficacy. The method comprises the steps of obtaining a plurality of candidate biomarkers through a statistical analysis or bioinformatics mining method based on the histology data, wherein the candidate biomarkers are large in number, large in noise, have certain false positive and have wide and complex interaction effects, and the optimal biomarker combination is required to be further screened from the candidate biomarkers in order to further reduce the detection cost and improve the prediction efficiency. Thus, screening for an optimal biomarker combination from several candidate markers can also be regarded as a problem in machine learning of screening for an optimal feature combination from several candidate features, where candidate biomarkers are identical to candidate features, an optimal biomarker combination is identical to an optimal feature combination, and a single set of candidate marker sources can be regarded as a feature domain of a candidate feature. However, how to take into account both the importance and the interpretability (Explainability or Interpretability) of the candidate biomarkers (candidate features) of each single set (feature domain) and the interaction effect between different candidate biomarkers (candidate features) in an unbiased manner, and further, the problem of screening out the optimal biomarker combinations (optimal feature combinations) with small number, non-redundancy, interpretability and high prediction efficiency from the candidate biomarkers (candidate features) remains to be solved.

In order to solve the above-mentioned shortcomings in the prior art, as shown in fig. 1, the present invention provides a feature screening method, which includes:

The invention is further illustrated below in connection with specific embodiments. It should be understood that the particular embodiments described herein are presented by way of example and not limitation. The principal features of the invention may be used in various embodiments without departing from the scope of the invention.

The embodiment 1 of the invention provides a feature screening method, namely a method for screening methylation site candidate features from a methylation group and gene candidate features from a transcriptome to obtain optimal feature combinations for MSI (minimum shift register) class label prediction, 6 optimal feature combinations are obtained, and cross verification and independent tests show that the obtained optimal feature combinations are excellent in prediction performance. Wherein the 6 optimal feature combinations include 4 methylation sites (i.e., methylation sites cg14598950 and cg27331401 for gene EPM2AIP1 and MLH together, methylation site cg05428436 for gene LNP1 and methylation site cg15048832 for gene, where methylation sites are from Illumina Infinium HumanMethylation chip K BeadChip) and 2 genes (i.e., RPL22L1 and MSH 4), the cross-validated predicted performance index ROC curve area AUC is 0.99, and the predicted performance index ROC curves independently tested using three independent test sets have area AUC under 0.93, 0.94 and 0.91, respectively.

In example 1, MSI (Micro-SATELLITE INSTABILITY, microsatellite instability) refers to an abnormal change in the length of a region of a DNA microsatellite sequence due to an increase in the mutation of insertion (insertion) or deletion (deletion) of the region of the microsatellite sequence upon DNA replication caused by the malfunction of the DNA mismatch repair system (MISMATCH REPAIR SYSTEM, MMR). MSI occurrence and MMR abnormality are usually driven by MMR related regulatory gene mutation or factors such as hypermethylation (hypermethylation) of the MLH1 gene promoter region. This mechanism is a genetic instability that is manifested at the genetic level, and is common in many different cancers, such as colorectal cancer, endometrial cancer, and the like. MSIs are generally divided into three classes of varying degrees, including MSI-H (high microsatellite instability), MSI-L (low microsatellite instability) and MSS (microsatellite stability), which are the most important clinical people of interest, so MSIs can be combined and thus further divided into two classes, MSI-H and MSS/MSI-L. Clinically MSI has been considered as an important biomarker for tumor immunotherapy and prognosis survival. Tumor patients with MSI-H typically have a higher Tumor Mutational Burden (TMB) and MMR dysfunction than tumor patients with MSS/MSI-L, and respond better to immunotherapy (e.g., PD-1/PD-L1 inhibitors, etc.) with better prognosis and longer survival.

On the one hand, on the molecular mechanism, mutation of MMR related regulatory genes can cause abnormal expression of the MMR related regulatory genes and further cause MMR function imbalance and MSI occurrence, so transcriptome data representing gene expression independently has the potential of predicting MSI, and can be used as a candidate characteristic source for predicting MSI. On the other hand, methylation modification abnormality is also one of driving factors for MSI occurrence, so methylation set data describing the degree of methylation of a methylation site corresponding to a gene can also be used as another candidate feature source for MSI prediction. Although methylation can regulate repression of gene expression such that the degree of methylation of a gene exhibits a negative correlation with the overall expression, gene expression is simultaneously affected by a combination of multiple regulatory mechanisms, and expression of each gene is simultaneously affected by multiple methylation sites. Thus, embodiment 1 views methylation site candidate features from the methylation set and gene candidate features from the transcriptome as two feature domains, namely a methylation site feature domain and a gene feature domain, and uses nodes and edges contained in the iso-pattern to represent and integrate candidate features of the methylation site feature domain and the gene feature domain, and the MSI class prediction problem of the sample is converted into a class problem of the corresponding iso-pattern at the map level, and uses the map neural network model to input the iso-pattern and train the map neural network model for MSI class prediction.

In order to screen out the candidate features (i.e., methylation sites and genes) from the methylation group and transcriptome, a small number of optimal feature combinations with non-redundancy, interpretability and high prediction efficiency are used for MSI class prediction, in the embodiment 1, first, a meta heuristic algorithm is used to rely on the methylation site candidate features and the predicted MSI classes separately and obtain the importance of each methylation site candidate feature and each gene candidate feature in unbiased and stable feature domain as the second node data of each node in the heterogram, then the heterogram and the trained neural network model are input into the neural network interpreter GNNExplainer to obtain the importance of each candidate feature between feature domains, and finally, an integrated learning model CatBoost is used to screen out the optimal feature combinations with small number of optimal feature combinations with non-redundancy, interpretable and high prediction efficiency based on the importance of the methylation site candidate features and the gene candidate features between feature domains and use of the optimal integrated learning model for MSI class label prediction.

The feature screening method in embodiment 1 specifically includes:

S101: constructing a training set, wherein the training set comprises a plurality of samples, each sample in the plurality of samples comprises a category label, candidate features and candidate feature data corresponding to the candidate features, the candidate features and the candidate feature data corresponding to the candidate features form a feature domain, the category label is an MSI category label, and the MSI category label comprises MSI-H, MSI-L and MSS.

Constructing a corresponding iso-graph for each sample, wherein the class label of the iso-graph is the class label contained in each sample; the heterogeneous graph comprises nodes and node data with a plurality of node types and edges and edge data with a plurality of edge types; each node type corresponds to a feature domain, each node corresponds to a candidate feature, and each node data comprises candidate feature data corresponding to the candidate feature and the importance of the candidate feature in the feature domain; each edge type represents a relationship between two node types, each edge represents a relationship between the two nodes, and each edge data includes a weight between the two nodes.

The characteristic fields comprise methylation site characteristic fields and gene characteristic fields, the candidate characteristics comprise methylation site candidate characteristics and gene candidate characteristics, the methylation site candidate characteristics belong to the methylation group characteristic fields, the gene candidate characteristics belong to the transcriptome characteristic fields, candidate characteristic data of the methylation site candidate characteristics are methylation degree values, and candidate characteristic data of the gene candidate characteristics are gene expression values.

The heterographic node types include a methylation site node type and a gene node type, the methylation site node type representing the methylation site feature domain, the gene node type representing the gene feature domain, the methylation site node type comprising a methylation site candidate feature node, the gene node type comprising the gene candidate feature node, the methylation site candidate feature node representing the methylation site candidate feature, the gene candidate feature node representing the gene candidate feature, the node data of the methylation site candidate feature node comprising candidate feature data of the methylation site candidate feature and intra-feature importance of the methylation site candidate feature, the node data of the gene candidate feature node comprising candidate feature data of the gene candidate feature and intra-feature importance of the gene candidate feature.

S102: constructing a graph neural network model for predicting the category labels, inputting the heterogeneous graph corresponding to each sample into the graph neural network model, training the graph neural network model by using the loss function, and obtaining a trained graph neural network model;

S103: inputting the different composition corresponding to each sample and the trained graph neural network model into a graph neural network interpreter to obtain the importance among feature domains of each candidate feature;

S104: and constructing a machine learning model for predicting the category label, training the machine learning model by utilizing the importance among the feature domains of each candidate feature, the category label of each sample, the candidate feature and the candidate feature data, screening to obtain an optimal feature combination and utilizing the optimal machine learning model of the optimal feature combination.

In S101, the heterogeneous graph (heterogeneous graph, heterogeneous network, or heterogeneous information network, also called heterogeneous graph, heterogeneous network, heterogeneous information network, etc.) refers to a graph that specifically has nodes of several node types, node data, and edges of several edge types, and can be used to describe complex and abundant interactions among various node types, thus providing a powerful tool for multi-group data integration, representation, analysis, and mining.

In step S101, first, samples in a training set for MSI category prediction and a category label for each sample are constructed. In specific example 1, 639 samples from THE CANCER Genome Atlas (TCGA) database of clinical tumor patients were selected as training set, wherein 453 samples were labeled MSS in category, 97 samples were labeled MSI-L in category, 89 samples were labeled MSI-H in category, and MSI category labels of all samples were detected by PCR method. Considering that MSI-H is an important biomarker for tumor immunotherapy response and good prognosis, the class labels of MSI are further combined into two classes, namely MSI-H and MSS/MSI-L, so that 639 samples in the training set can be divided into 550 samples with MSS/MSI-L as class labels and 89 samples with MSI-H as class labels. The samples of the training set comprise a plurality of tumor types, wherein 146 samples are colon adenocarcinoma (COAD), 62 samples are esophageal carcinoma (ESCA), 72 samples are rectal adenocarcinoma (READ), 342 samples are gastric adenocarcinoma (STAD), and 17 samples are endometrial carcinoma (UCEC), so that the data heterogeneity in the training set and the generalization capability of the graphic neural network model are ensured.

And secondly, constructing candidate characteristics and candidate characteristic data of each sample in the training set. The candidate features of each sample comprise methylation site candidate features and gene candidate features, the methylation site candidate features are from a methylation group and belong to a methylation site feature domain, the gene candidate features are from a transcriptome and belong to a gene feature domain, and corresponding candidate feature data are respectively methylation site selection feature data and gene candidate feature data. Methylation site candidate feature data is a methylation degree value, and gene candidate feature data is a gene expression value.

In the TCGA database, a plurality of different types of histologic data, including genome, transcriptome, methylation group, metabolome, and the like, are available for each sample in the training set. Transcriptome data is gene expression data covering all human genes at the transcription level, which is obtained by sequencing by using the RNA-seq technique, and log ₂ (TPM+1) is used as a quantitative index, wherein the calculation formula of TPM is specifically as followsWherein C _i is the read count corresponding to the ith gene, S _i is the effective length corresponding to the ith gene, and log ₂ conversion is performed after the pseudo count 1 is added to the TPM to obtain log ₂ (TPM+1). The methylation group data are methylation degree data of methylation sites corresponding to all genes of the human being, which are detected by Illumina Infinium HumanMethylation450,450, 450K BeadChip chips, and beta value is used as a quantitative index.

In order to reduce the computational complexity and improve the computational efficiency, in the specific embodiment 1, a differential statistical analysis method is used to screen methylation sites with significantly different methylation degree values and genes with significantly different gene expression values between samples with MSI-H class labels and samples with MSS/MSI-L class labels as methylation site candidate features and gene candidate features, respectively. Specifically, in order to obtain methylation sites with significant differences in methylation degree values between samples with category labels of MSI-H and samples with category labels of MSS/MSI-L, specific example 1 first extracts methylation sites within a certain bp (base pair) upstream and a certain bp downstream of the transcription start site of each gene, and then jointly screens 2775 methylation sites with significant differences as methylation site candidate features according to methylation site screening criteria after Kruskal-wall test and chi-square test (χ ² or chi-squared) test, wherein methylation site candidate feature data is methylation degree values (quantitative beta value). Here, 2000 may be selected in a certain range of upstream bp, or an integer in a range of 1000, 1500, 2500, etc. or an interval may be selected, 2000 may be selected in a certain range of downstream bp, or an integer in a range of 1000, 1500, 2500, etc. or an interval may be selected. Chi-square test was performed by first discretizing the methylation level value (i.e., beta value) corresponding to the methylation site into three methylation levels (i.e., low, medium and high methylation levels) using different thresholds (0.2,0.6 and 1.0, respectively), then calculating p-values from the frequency distribution of different methylation levels between the sample labeled MSI-H and the sample labeled MSS/MSI-L, and finally correcting the p-values using FDR (False Discovery Rate) method and obtaining corrected p-values. The methylation site screening criteria herein is that for each methylation site, a methylation site is considered to be a significantly different methylation site if the p-value obtained after Kruskal-Wallis test or the corrected p-value obtained after chi-square test is less than 0.05 and the absolute value of the mean difference in the methylation degree value, i.e. beta-value, between the sample labeled MSI-H and the sample labeled MSS/MSI-L is greater than 0.1.

In order to obtain a gene whose gene expression value significantly differs between the sample classified as MSI-H and the sample classified as MSS/MSI-L, the differential expression analysis software DESeq2 was used in embodiment 1 to calculate the gene whose expression difference between the sample classified as MSI-H and the sample classified as MSS/MSI-L, and according to the gene screening criteria (i.e.The absolute value of (a) is greater than 1.5 and the p value after correction by the FDR method is less than 0.05, where fold change indicates 1632 genes obtained by the difference multiple between the gene expression value of each gene in the sample with category label MSI-H and the gene expression value in the sample with category label MSI-L) as significant differential expression genes, namely gene candidate characteristics, and the gene candidate characteristic data is the gene expression value (the quantitative value is log ₂ (tpm+1)).

Finally, constructing an iso-composition, wherein each sample in the training set is correspondingly expressed as an iso-composition, and the node data of methylation site candidate feature nodes contained in the iso-composition comprises the intra-feature importance corresponding to the methylation site candidate feature, and the calculation process is as follows: constructing a classifier model; based on a meta heuristic algorithm, repeatedly iterating and training a classifier model by using MSI class labels, methylation site candidate features and methylation site candidate feature data (namely methylation degree value and quantitative beta value) of a sample to obtain importance of the methylation site candidate features in each iteration; and adding and sequencing the importance of the methylation site candidate features in a descending order during each iteration to obtain the importance sequencing position of each methylation site candidate feature, namely the importance in the feature domain of the methylation site candidate feature. The node data of the gene candidate feature nodes contained in the heterogram comprises the intra-feature importance corresponding to the gene candidate features, and the calculation process is as follows: constructing a classifier model; based on a meta heuristic algorithm, training a classifier model by utilizing MSI class labels, gene candidate features and gene candidate feature data (namely gene expression values, which are quantified as log ₂ (TPM+1) values) of samples for multiple times, so as to obtain the importance of the gene candidate features in each iteration; and adding the importance of the gene candidate features in each iteration and sorting in a descending order to obtain the importance sorting position of each gene candidate feature, namely the importance of the gene candidate feature in the feature domain. The classifier model may be a decision tree model, the meta-heuristic algorithm may be a genetic algorithm, and the multiple iterations may be 100. The importance in the feature domain can be further normalized and updated, and can be selected as Min-Max method.

The edges included in the heterograms comprise methylation site candidate feature node-methylation site candidate feature node edges, gene candidate feature node-gene candidate feature node edges and methylation site candidate feature node-gene candidate feature node edges, and the edge data corresponding to the edges are edge weights. The edge weight of the methylation site candidate feature node-methylation site candidate feature node edge is a Spearman correlation coefficient quantitative representation calculated by using the methylation site candidate feature values in the two methylation site candidate feature node data, the edge weight of the gene candidate feature node-gene candidate feature node edge is a Spearman correlation coefficient quantitative representation calculated by using the gene candidate feature values in the two gene candidate feature node data, and the edge weight of the methylation site candidate feature node-gene candidate feature node edge is a Spearman correlation coefficient quantitative representation calculated by using the methylation site candidate feature values in the methylation site candidate feature node data and the gene candidate feature values in the gene candidate feature node data. To reduce the computational burden, the weakly correlated edges with Spearman correlation coefficients less than 0.6 were further removed in example 1. Optionally, to satisfy the scale-free topology assumption of the iso-graph, the edge weights in the iso-graph are further power converted and normalized, which is specifically implemented as follows: after the exponent is calculated as the power of 2 on the edge weight, the Min-Max method is used for normalization processing, and the normalized numerical value is used for updating the edge weight.

In S102, as shown in fig. 2, the graph neural network model includes U GCN layers of different depths and cascaded, V GAT layers of different depths and cascaded, a Concatenation (Concat or Cat) layer, a global pooling (Global Pooling) layer, a plurality of full-Connected (FC) layers, and a Softmax layer. The 1 st GCN layer is used for inputting the iso-pattern and calculating to obtain the output of the 1 st GCN layer, and the i th GCN layer is used for receiving the output of the i-1 st GCN layer and calculating to obtain the output of the i th GCN layer. Similarly, the 1 st GAT layer is used for inputting the iso-pattern corresponding to each sample and calculating to obtain the output of the 1 st GAT layer, and the j-th GAT layer is used for receiving the output of the j-1 st GAT layer and calculating to obtain the output of the j-th GAT layer. Where i takes values i=2 to U, j takes values j=2 to V, and U and V are integers not less than 2. The output of each GCN layer and each GAT layer may further increase the ReLU activation operation. The splicing layer receives and splices the output of each GCN layer with different depth and the output of each GAT layer with different depth. The global pooling layer receives the output of the splicing layer, performs global pooling operation and outputs the result. The global pooling operation may optionally be operated using the Global Add Pooling method. The output of the global pooling layer may further connect drop operations and ReLU activation operations with a Dropout layer increase set probability of 0.2. And the plurality of full-connection layers receive the output of the global pooling layer, and output the output after nonlinear fusion. The Softmax layer receives the outputs of the plurality of fully connected layers and outputs after calculation for calculation of the loss function. In order to quickly fall off the loss function and reduce up-down oscillations during model training, the Softmax layer is optionally replaced with LogSoftmax layers to further logarithmically convert the output. In the training of the graph neural network model, the optimizer is preferably an Adam optimizer, the loss function is preferably a cross entropy loss (cross-entrophy loss) function, the batch size (batch_size) is set to 4, the learning rate (LEARNING RATE) is set to 0.001, the iteration number (epoch) is set to 200, and the trained graph neural network model is obtained after training.

In the graph neural network model, the GCN layer aggregates the characteristics of each node and the adjacent nodes thereof in an average weighting mode based on the adjacent matrix and the identity matrix of the heterogram, so that the local characteristic information and the topology information of each node and the adjacent nodes thereof in the heterogram are captured, and meanwhile, the GCN layer aggregates the information of each node and the adjacent nodes thereof by adopting a weight matrix with shared parameters, so that translational invariance on the structure of the heterogram is simulated. The GAT layer is different from the GCN layer, dynamically distributes attention coefficients for each node and the adjacent neighbor thereof in the heterogram based on an attention mechanism, and carries out weighted aggregation on the characteristics of the node and the adjacent neighbor thereof based on the attention coefficients, so that the specific characteristic information and the topology information of the node and the adjacent neighbor thereof are effectively captured. The graphic neural network model adopts the GCN layer and the GAT layer simultaneously, on one hand, the GCN layer can learn general information of a node adjacent structure in the heterogram through translation invariance, and on the other hand, the GAT layer can learn specific information of a node neighbor in the middle through dynamic weight of an attention mechanism, so that the representation capability of the graphic neural network model on the heterogram can be enhanced, and the generalization capability of the graphic neural network model is further improved.

In order to improve the representation capability of the GCN layer and the GAT layer on nodes in the heterograms, and sense and aggregate the characteristic information of neighbors in a node farther range, the graph neural network model adopts a structure of stacking a plurality of GCN layers and GAT layers to increase depth. However, with the increase of the stacking depth of the GCN layer and the GAT layer, the problems of excessive smoothness, gradient disappearance, gradient explosion, overfitting and the like easily occur only by increasing the number of layers and the depth of the GCN layer and the GAT layer. Therefore, the image neural network model uses the splicing layer to splice the output of each GCN layer with different depths and the output of each GAT layer with different depths, and the image neural network model sequentially passes through the global pooling layer, the Dropout layer and a plurality of full-connection layers, so that information aggregation and nonlinear fusion of the output of the GCN layer with different depths and the GAT layer are further realized, the image neural network model is helped to capture global and local topological structure information of different patterns at the same time, the multi-level and multi-scale representation capability of the image neural network model in the different patterns is improved, and meanwhile, the problems of gradient disappearance, gradient explosion, excessive smoothness, overfitting and the like are relieved.

In S103, in order to evaluate the relative contribution of the methylation site candidate feature and the gene candidate feature represented by each node in the iso-graph to the MSI class label predictive performance of the trained graph neural network model, the importance of each node in the iso-graph is obtained using the graph neural network interpreter GNNExplainer. GNNExplainer is a graph neural network interpreter that uses node Masking (Masking) and edge Masking to calculate the importance of each node in the iso-graph, and further interprets which nodes in the iso-graph and the importance of the edges between the nodes for classification prediction at the graph level, while also considering the interaction effect between candidate features locally and globally based on the structure of the iso-graph. The method is concretely realized as follows: firstly inputting an abnormal pattern corresponding to each sample and a trained graph neural network model into GNNExplainer, then calculating the importance of each node in the abnormal pattern and sequencing in a descending order, and further obtaining the importance sequencing positions of the methylation site candidate features corresponding to each methylation site candidate feature node and the gene candidate features corresponding to each gene candidate feature node, namely the feature inter-domain importance of each methylation candidate feature and each gene candidate feature. The feature inter-domain importance may be further normalized and updated using the Min-Max method.

In S104, the specific implementation of selecting the optimal feature combination from the candidate features using the feature inter-domain importance of the methylation candidate features and the gene candidate features based on the ensemble learning model is as follows: firstly, constructing a machine learning model for predicting MSI category labels, wherein the machine learning model can be selected as an integrated learning model, and is preferably CatBoost; the methylation site candidate feature and the gene candidate feature obtained in the step S103 are utilized for feature interdomain importance and descending order sorting, the first K (such as the first 100) candidate features are selected, and all non-empty subsets of the first K candidate features, namely K ² -1 (namely 9999) candidate feature combinations are obtained; then, aiming at each candidate feature combination, training an integrated learning model by utilizing the importance among feature domains of each candidate feature in the candidate feature combination, the category label of each sample, the candidate feature and the candidate feature data, wherein a super-parameter tuning method can be used for tuning the super parameters of the integrated learning model when the integrated learning model is trained, such as a random grid search (Randomized GRID SEARCHING) method with cross validation, and simultaneously, performance evaluation and calculation of performance fingers, including area AUC under ROC curve, root mean square error RMSE and specific SPE, are carried out on the trained integrated learning model; and finally, when the value of AUC-RMSE+SPE is maximum, the performance index is considered to be optimal, a trained integrated learning model with the optimal performance index is selected as an optimal integrated learning model, and the candidate feature combination utilized by the optimal integrated learning model is the optimal feature combination.

The feature screening method in specific example 1 was applied to screen methylation site candidate features and gene candidate features from the methylation group and transcriptome, respectively, to obtain an optimal feature combination for predicting MSI category, and the optimal feature combination obtained by final screening included 6 features, wherein 4 methylation sites (i.e., methylation sites cg14598950 and cg27331401 of gene EPM2AIP1, corresponding to MLH, methylation site cg05428436 of gene LNP1, and methylation site cg15048832 of gene, where methylation sites are from Illumina Infinium HumanMethylation450K BeadChip chip) and 2 genes (i.e., RPL22L1 and MSH 4), as shown in fig. 3, the area under 10-fold cross-validated prediction performance ROC curve of the optimal integrated learning model was 0.99, indicating that the prediction efficacy of the optimal feature combination utilized by the optimal integrated learning model was very excellent. As shown in fig. 4, the True Positive ratio (True Positive or TP, i.e., the sample ratio actually being MSI-H and correctly predicted as MSI-H) in the confusion matrix obtained by 10-fold cross-validation evaluation is 0.978, the False Negative ratio (FALSE NEGATIVE or FN, i.e., the sample ratio actually being MSI-H but incorrectly predicted as MSS/MSI-L) is 0.004, the False Positive ratio (False Positive or FP, i.e., the sample ratio actually being MSS/MSI-L but incorrectly predicted as MSI-H) is 0.022, and the True Negative ratio (True Negative or TN, i.e., the sample ratio actually being MSS/MSI-L and correctly predicted as MSS/MSI-L) is 0.996, indicating that the optimal integrated learning model using the optimal feature combination is quite balanced and excellent in prediction performance for different MSI categories even in the case of an MSI category label is not balanced.

In order to evaluate whether the most integrated learning model using the optimal feature combination is over-fitted (overfitting), the optimal integrated learning model is further independently tested based on 3 different independent test sets, so that the optimal feature combination and the corresponding optimal integrated learning model can be proved to be very excellent in MSI (minimum likelihood) category prediction performance, and strong in generalization capability and over-fitting resistance.

In a specific embodiment 3 different independent test sets are constructed for independent testing of the optimal ensemble learning model, wherein the first independent test set is a uterine and ovarian carcinoma sarcoma independent queue containing both methylation group data (GEO data number GSE 136790) and transcriptome data (GEO data number GSE 128630) from the histologic sequencing database GEO (Gene Expression Omnibus), the second independent test set is an endometrial tumor independent queue containing both methylation group data and transcriptome data from the oncology database CPTAC (Clinical Proteomic Tumor Analysis Consortium) (CPTAC item number CPTAC-3), the third independent test set is a pan-cancer independent queue from the TCGA database that contains both methylation set data, transcriptome data, and genomic data. MSI class labels of samples in the first independent test set are detected using a PCR method. And simultaneously detecting MSI class labels of the second independent test seed sample by using 5 MSI class prediction tools, wherein the detection standard is that if at least 3 MSI class detection methods give MSI-H results, the MSI class labels of the samples are considered to be MSI-H, and otherwise, MSS/MSI-L is considered to be MSI-H. MSI category labels for the samples in the third independent test set are predicted based on the genomic data using MSI category prediction tool MSIsensor. The MSI class labels of the three independent test sets are all detected by using paired paracancerous tissue samples as comparison. MSISensor is a software tool for predicting MSI class based on genomic data, and calculates MSI score by comparing abnormal changes (e.g., insertions or deletions, etc.) in microsatellite sequences (micro-SATELLITE LOCI) on the genome of tumor samples and paracancerous tissue control samples, and considers MSI class label to be MSI-H when MSI score is greater than 10, and MSS/MSI-L otherwise. The first independent test set includes 87 samples, with 24 samples labeled MSI-H and 63 samples labeled MSS/MSI-L. The second independent test set includes 100 samples, with 25 samples labeled MSI-H for category and 75 samples labeled MSS/MSI-L for category. The third independent test set includes 6620 samples, with 58 samples labeled MSI-H and 6562 samples labeled MSS/MSI-L. 6620 samples in the third independent test set were pan-carcinoma samples covering more than 30 tumor types including adrenocortical carcinoma (ACC), bladder carcinoma (BLCA), breast carcinoma (BRCA), cervical squamous cell carcinoma (CESC), cholangiocarcinoma (CHOL), colonic adenocarcinoma (COAD), diffuse large B-cell lymphoma (DLBC), esophageal carcinoma (ESCA), glioblastoma (GBM), head and neck squamous cell carcinoma (HNSC), renal chromocytocarcinoma (KICH), renal cell carcinoma (KIRC), renal chromocytocarcinoma (KIRP), low Grade Glioma (LGG), and, Hepatocellular Carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (luc), mesothelioma (MESO), ovarian serous cyst adenocarcinoma (OV), pancreatic adenocarcinoma (PAAD), pheochromocytoma (PCPG), prostate adenocarcinoma (PRAD), rectal adenocarcinoma (READ), soft tissue Sarcoma (SARC), skin melanoma (SKCM), gastric adenocarcinoma (STAD), germ cell tumor (TGCT), thyroid carcinoma (THCA), thymoma (THYM), endometrial carcinoma (UCEC), cervical sarcoma (UCS), uveal melanoma (UVM), the third independent test set has large sample scale, complex clinical situation, high data heterogeneity, larger corresponding MSI type prediction difficulty and higher requirements on generalization capability and overfitting resistance capability of the MSI type prediction model. As shown in fig. 5, fig. 6, and fig. 7, the area under the ROC curves of the prediction performance indexes of the optimal integrated learning model using the optimal feature combination and using the first, second, and third independent test sets are respectively 0.93, 0.94, and 0.91, and AUC is greater than 0.90, and especially considering the sample size and data heterogeneity of the third independent test set, these independent test results fully indicate that the optimal integrated learning model using the optimal feature combination is not only excellent in the prediction performance of the MSI category, but also strong in generalization capability, but also does not fit.

To further demonstrate the efficacy of the optimal feature combinations screened by the feature screening method of embodiment 1 for MSI category prediction, a comparison dataset was further constructed in embodiment 1 for performance comparison of the optimal ensemble learning model using the optimal feature combinations with other existing MSI category prediction tools. Other existing MSI category prediction tools here include: MSISensor, MANTIS, MIRMMR, preMSIm, mSING and MSISensor-pro. The types of characteristic data required for the tools can be classified into 3 classes according to MSI class prediction, tools based on WES (Whole Exome Sequencing, whole exon sequencing) data (MSIsensor, MANTIS, mSING and MSIsensor-pro), tools based on marker gene expression (PreMSIm) and tools based on methylation and mutation at the same time (MIRMMR), respectively. While the optimal ensemble learning model using optimal feature combinations can be considered class 4, i.e., tools based on both methylation and gene expression. In order to compare the performance of these different MSI category prediction tools, samples in the data set are compared for both a certain representativeness and heterogeneity, while also taking into account the type of characteristic data required for the different MSI category prediction tools. Therefore, in embodiment 1, 426 samples simultaneously containing these different types of feature data and having MSI categories verified by the PCR method are selected from the TCGA database to construct a comparison dataset, and category labels of 47 samples in the comparison dataset are MSI-H, category labels of 259 samples are MSS/MSI-L, which can be regarded as a category-unbalanced dataset.

The WES-based tools quantitatively calculate abnormalities (i.e., MSI scores) of microsatellite loci based on WES data by comparing tumor samples with paired paracancerous samples as controls to infer the MSI status of the patient, and the MSI scores are continuous values, which still require human subjective setting of thresholds for differentiating MSI categories. MSI tools based on WES have higher computational load, and because tumor samples and paired paracancerous tissue samples need to be compared for analysis, sampling tissue is more demanding and detection costs are higher. The marker gene expression-based tool predicts MSI status using 15 marker gene expressions closely related to MMR function, and the marker gene expression-based tool has relatively lower detection cost and computational load, but also lower detection efficiency, compared to the WES-based tool. PreMSIm an MSI class prediction model is constructed based on the 15 marker genes by adopting a KNN model, but the interaction effect and the feature importance among the marker genes are not deeply explored. MIRMMR further integrates data for gene methylation and MMR-related gene mutation to predict MSI category. MIRMMR adopts methylation of 35 genes and characteristics of over 2 thousands of point mutations corresponding to the genes, and adopts a Logistic regression model to predict MSI category after pretreatment conversion. The Logistic regression model is easy to overfit when the features are more, and does not consider the interaction effect between the features, so that the feature representation capability, the prediction capability, the generalization capability and the overfitting resistance capability of MIRMMR are limited. In summary, existing MSI category prediction tools either have high detection costs, or have high computational load, or have an excessive number of features, or ignore the interaction effects between features, which can limit prediction performance as well as generalization and anti-overfitting capabilities.

Embodiment 1 a predicted MSI category label for each sample in a comparison dataset was predicted using MSISensor, MANTIS, MIRMMR, preMSIm, mSING, MSISensor-pro and using an optimal ensemble learning model with optimal feature combinations, respectively, using the comparison dataset, and compared to the true MSI category label and a number of performance metrics were calculated, including Accuracy (ACC, i.e., accuracy), sensitivity (SEN, i.e., sensitivity), specificity (SPE, i.e., SPECIFICITY), and Matthews correlation coefficient (MCC, i.e., matthews Correlation Coefficient), respectively. ACC, SEN, SPE, MCC the calculation formula is as follows:

Wherein TP, TN, FP, and FN represent the number of true positive samples, the number of true negative samples, the number of false positive samples, and the number of false negative samples, respectively. The accuracy ACC is the proportion of correctly predicted samples (including TP and TN) used to evaluate the overall performance of the model predictions. Sensitivity SEN is the proportion of correctly identified Positive Samples (TPs) used to measure the ability of a model to identify true positive samples. The specific SPE is the proportion of negative samples (TN) correctly identified by the model and is used for measuring the capacity of the model to identify false negative samples. MCC is an index comprehensively considering model discrimination of true positive, true negative, false positive and false negative samples in different categories, and further reflects the prediction performance of the model on a data set with unbalanced categories.

As shown in fig. 8, from the performance comparison results of different MSI category prediction tools using the comparison data, the performance of the optimal ensemble learning model using the optimal feature combination is significantly superior to the other 5 MSI category prediction tools, except that the performance of MSISensor can be leveled, on a plurality of performance indexes ACC, SEN, SPE, MCC. However, MSISensor uses thousands of microsatellite loci as features, which is computationally intensive, and requires paired paracancerous tissue as a control, which is relatively large in the amount of tissue required for detection and increases detection costs. The optimal integrated learning model utilizing the optimal feature combination only needs 6 features to achieve matched prediction performance, so that the calculation load is greatly reduced, the generalization capability and the overfitting resistance capability can be improved, paired cancer tissue samples are not needed to be used as contrast, and the tissue quantity and the detection cost required by detection are obviously reduced. In a comprehensive view, the comprehensive performance of the optimal integrated learning model by utilizing the optimal feature combination is obviously better than MSISensor. In addition, the performance indexes of MCC, SEN, SPE and the like show that the optimal integrated learning model of the optimal feature combination is quite balanced and excellent in unbalanced MSI type prediction performance.

From the aspect of the feature quantity required for predicting the MSI category, in the existing MSI category prediction tools, the MSI tool based on WES is used for quantitatively calculating the abnormality of the microsatellite loci based on WES data by comparing a tumor sample with a paired paracancerous tissue sample as a comparison so as to predict the MSI category of a patient, and the feature quantity of the microsatellite loci is usually thousands. The marker gene expression-based tool PreMSIm predicts MSI category by expression of 15 marker genes that are closely related to MMR function. MIRMMR further integration of data for methylation of genes and MMR-related gene mutations the MSI category is predicted, comprising data for methylation of 35 genes and characterization of over 2 thousand point mutations corresponding to these genes. The optimal feature combination utilized by the optimal ensemble learning model in embodiment 1 has the least number of features, but achieves the optimal prediction performance, and the cross-validation is not much different from the independent test performance of the multiple independent test sets, so that the excellent prediction performance, generalization performance and anti-overfitting performance of the optimal feature combination for MSI category prediction screened by the feature screening method in embodiment 1 are further verified.

The embodiment 1 of the invention further applies the feature screening method to screen the optimal feature combination for MSI category prediction from candidate features of methylation groups and transcriptomes, and further proves that the embodiment 1 of the invention has the following technical advantages:

1) Constructing an iso-graph, using edges between methylation site candidate feature nodes and gene candidate feature nodes to represent wide and complex interaction effects between methylation site candidate features belonging to a methylation site feature domain and gene candidate features belonging to a gene feature domain, and simultaneously taking importance in unbiased and stable feature domains of each methylation site candidate feature and each gene candidate feature as key node data into the iso-graph;

3) The method comprises the steps that a graph neural network interpreter is used for integrating and topologically obtaining candidate characteristics of each methylation site and the unbiased and stable inter-domain importance of each gene candidate based on the heterograms, and then optimal characteristic combinations with small quantity, non-redundancy, interpretability and high prediction efficiency are obtained through screening;

4) The optimal feature combination for MSI class prediction obtained by screening comprises 4 methylation sites (cg 14598950, cg27331401, cg05428436 and cg 15048832) and 2 genes (RPL 22L1 and MSH 4), and the optimal integrated learning model using the optimal feature combination is used for MSI class prediction, so that the MSI class prediction has excellent performance, strong generalization capability and strong overfitting resistance.

It should be noted that, in a specific embodiment, the candidate feature may be derived from only one feature domain, for example, the candidate feature may include only methylation site candidate features derived from the methylation site feature domain, may include only gene candidate features derived from the gene feature domain, and the corresponding iso-graph may be regarded as a graph including only a node type point and an edge type edge, for example, may include only methylation site feature candidate nodes and edges between methylation site feature candidate nodes, and may include only gene candidate feature nodes and edges between gene candidate feature nodes. After inputting the heterogeneous map into the map neural network model and training, the inter-domain importance of the candidate features obtained by using the map neural network interpreter still considers the interaction effect between the candidate features in the same type of feature domain, and the inter-domain importance of the methylation site candidate features in the methylation site feature domain can be only considered, or the inter-domain importance of the gene candidate features in the gene feature domain can be only considered, and finally the optimal feature combination obtained by screening can only comprise the methylation site candidate features or the gene candidate features.

In a specific embodiment, the methylation set data and transcriptome data can be sequencing data published from an automated sequencing data or database. For example, it may be TCGA, ICGC, COSMIC, cBioPortal, CGWB, GEO, UALCAN, methHC, methyCancer databases, preferably selected from the TCGA databases.

In a specific embodiment, the MSI class label may be a two-class MSI-H and MSS/MSI-L, or may be a three-class MSI-H, MSI-L and MSS, and the MSI class label of the sample may be detected by PCR method, or may be detected by other MSI class prediction tools, including MSISensor, MANTIS, MIRMMR, preMSIm, mSING, MSISensor-pro, etc.

In one embodiment, the gene screening criteria includes the use of a gene selected from the group consisting ofThe value and p-value index are determined, e.g./>Is greater than 1.5 and p-value <0.05.

In one embodiment, the Loss functions include, but are not limited to, a cross entropy Loss (Cross Entropy Loss) function, a KL divergence Loss (KL Div Loss) function, a binary cross entropy Loss (BCE Loss) function, with the preferred Loss function being a cross entropy Loss (Cross Entropy Loss) function.

The invention provides a feature screening system, comprising:

The present invention provides an apparatus comprising:

A memory: for storing program instructions;

A processor: and the program instructions are used for executing the program instructions, and when the program instructions are executed, the optimal feature combination obtained by the feature screening method or the feature screening method is realized, or the optimal machine learning model using the optimal feature combination is obtained by the feature screening method or the feature screening system is realized.

In one embodiment, the device is a computer device, which may be a terminal, including a processor, a memory, connected by a system bus; and further comprises a network interface, a display screen and an input device. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

The above-described arrangements are only partial structures associated with the present application and do not constitute limitations of the computer apparatus to which the present application is applied, and a particular computer apparatus may include more or less components than those shown in the drawings, or may combine certain components, or have different arrangements of components.

The present invention provides a computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement the above-mentioned feature screening method or the optimal feature combination obtained by implementing the above-mentioned feature screening method or the optimal machine learning model using the optimal feature combination obtained by implementing the above-mentioned feature screening method or the above-mentioned feature screening system.

In one embodiment, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In one embodiment, the executable instructions may be in the form of programs, software modules, scripts, or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

As an example, executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

In one embodiment, the executable instructions may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a non-volatile computer readable storage medium, and the computer program may include the flow of the embodiments of the methods described above when executed. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

In summary, the feature screening method, the system, the electronic equipment and the medium can be based on a meta heuristic algorithm and a graph neural network interpreter, especially creatively incorporate importance in unbiased and stable feature domains of each candidate feature into an abnormal graph, and can integrate and topologically obtain unbiased and stable feature domain importance of each candidate feature by using the graph neural network interpreter based on the abnormal graph, so as to screen out the optimal feature combination with small quantity, non-redundancy, interpretability and high prediction efficiency from dense, redundant, ultrahigh-dimensional and large-scale candidate features of a plurality of feature domains, effectively solve the defects in the prior art, obtain positive technical effects, screen out the optimal feature combination for MSI class label prediction from a large number of methylation site candidate features from a methylation group and gene candidate features of a transcriptome by using the feature screening method, wherein the optimal feature combination comprises 4 methylation sites (cg 14598950, cg27331401, cg05428436 and cg 15048832) and 2 genes (RPL 22) and MSH4, and the optimal feature combination is proved by cross-validation and independent tests, and the optimal performance is proved by repeated research and the optimal performance of the combination is proved by the invention, and the optimal performance is quite proved by the combination of the optimal feature combination.

The above description of the embodiments is only for the understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that several improvements and modifications can be made to the present invention without departing from the principle of the invention, and these improvements and modifications will fall within the scope of the claims of the invention.

Claims

1. A method of feature screening, the method comprising:

Constructing a machine learning model for predicting the category label, training the machine learning model by utilizing the importance among the feature domains of each candidate feature, the category label of each sample, the candidate feature and the candidate feature data, and screening to obtain an optimal feature combination and an optimal machine learning model by utilizing the optimal feature combination;

The class label is an MSI class label, and the MSI class label comprises MSI-H, MSI-L and MSS;

the characteristic fields comprise methylation site characteristic fields and gene characteristic fields, wherein the candidate characteristics comprise methylation site candidate characteristics and gene candidate characteristics, the methylation site candidate characteristics belong to a methylation group characteristic field, the gene candidate characteristics belong to a transcriptome characteristic field, candidate characteristic data of the methylation site candidate characteristics are methylation degree values, and candidate characteristic data of the gene candidate characteristics are gene expression values;

The heterogram node type comprises a methylation site node type and a gene node type, the methylation site node type represents the methylation site feature domain, the gene node type represents the gene feature domain, the methylation site node type comprises a methylation site candidate feature node, the gene node type comprises the gene candidate feature node, the methylation site candidate feature node represents the methylation site candidate feature, the gene candidate feature node represents the gene candidate feature, the node data of the methylation site candidate feature node comprises candidate feature data of the methylation site candidate feature and significance in a feature domain of the methylation site candidate feature, and the node data of the gene candidate feature node comprises candidate feature data of the gene candidate feature and significance in a feature domain of the gene candidate feature;

The edge types of the heterograms include methylation site node type-methylation site node type edge types, gene node type-gene node type edge types, and methylation site node type-gene node type edge types; the methylation site node type-methylation site node type edge type comprises methylation site candidate feature nodes-methylation site candidate feature node edges, and represents the relationship between two methylation site types; the gene node type-gene node type edge type comprises a gene candidate characteristic node-gene candidate characteristic node edge, and represents the relationship between two gene node types; the methylation site node type-gene node type edge type comprises methylation site candidate feature node-gene candidate feature node edges, and represents the relation between the methylation site type and the gene node type;

The calculation process of the importance of the candidate feature in the feature domain is as follows:

Constructing a classifier model;

2. The feature screening method of claim 1, wherein the intra-feature importance of the candidate features includes further normalization and updating.

3. The feature screening method of claim 2, wherein the normalization process comprises a Min-Max process.

4. The feature screening method of claim 1, wherein said constructing a graph neural network model for predicting the class labels comprises: the system comprises U GCN layers with different depths and cascaded, V GAT layers with different depths and cascaded, a splicing layer, a global pooling layer, a plurality of full-connection layers and a Softmax layer; the 1 st GCN layer is used for inputting the iso-composition corresponding to each sample and calculating to obtain the output of the 1 st GCN layer, and the i th GCN layer is used for receiving the output of the i-1 th GCN layer and calculating to obtain the output of the i th GCN layer; the 1 st GAT layer is used for inputting the iso-composition corresponding to each sample and calculating to obtain the output of the 1 st GAT layer, and the j-th GAT layer is used for receiving the output of the j-1 st GAT layer and calculating to obtain the output of the j-th GAT layer; i is i=2 to U, j is j=2 to V, U and V are integers not less than 2; the splicing layer receives the output of the U GCN layers and the output of the V GAT layers respectively and splices the output; the global pooling layer receives the output of the splicing layer, performs global pooling operation and outputs the output; the plurality of full-connection layers receive the output of the global pooling layer, perform nonlinear fusion and output; the Softmax layer receives the outputs of the plurality of fully connected layers and outputs after calculation for calculation of the loss function.

5. The feature screening method of claim 4, wherein the outputting of the GCN layer and GAT layer includes further increasing an activation operation.

6. The feature screening method of claim 5, wherein the activation operation comprises a ReLU activation operation.

7. The feature screening method of claim 4, wherein the global pooling operation comprises operating using GlobalAdd Pooling methods.

8. The feature screening method of claim 4, wherein the output of the global pooling layer includes a discard operation and an activate operation that further connect to the Dropout layer to increase the set probability.

9. The feature screening method of claim 8, wherein the set probability is 0.2.

10. The feature screening method of claim 8, wherein the activation operation comprises a ReLU activation operation.

11. The feature screening method of claim 4, wherein the loss function comprises a cross entropy loss function.

12. The feature screening method of claim 4, wherein the Softmax layer comprises further performing logarithmic conversion.

13. The feature screening method according to claim 1, wherein inputting the iso-graph corresponding to each sample and the trained graph neural network model into a graph neural network interpreter, the implementation of obtaining the feature inter-domain importance of each candidate feature comprises:

14. The feature screening method of claim 13, wherein the inter-feature domain importance of the candidate features includes further normalization and updating.

15. The feature screening method according to claim 1, wherein constructing a machine learning model for predicting the class label, training the machine learning model using the inter-feature importance of each candidate feature, the class label of each sample, the candidate feature, and the candidate feature data, and screening to obtain an optimal feature combination and using the optimal machine learning model of the optimal feature combination comprises:

16. The feature screening method of claim 15, wherein the machine learning model comprises an ensemble learning model.

17. The feature screening method of claim 16, wherein the ensemble learning model comprises CatBoost.

18. The feature screening method of claim 1, wherein the optimal feature combination comprises 4 methylation sites and 2 genes, the 4 methylation sites being cg14598950, cg27331401, cg05428436 and cg15048832, the 2 genes being RPL22L1 and MSH4, the optimal machine learning model being used for prediction of the MSI category label.

19. A feature screening system, the system comprising:

The optimal feature combination screening module is used for constructing a machine learning model for predicting the category labels, training the machine learning model by utilizing the importance among feature domains of each candidate feature, the category label of each sample, the candidate feature and the candidate feature data, and screening to obtain an optimal feature combination and an optimal machine learning model by utilizing the optimal feature combination;

Constructing a classifier model;

20. An electronic device, the electronic device comprising:

A memory: for storing program instructions;

A processor: for executing program instructions which, when executed, implement the feature screening method of any one of claims 1 to 18 or implement the optimal feature combination obtained by the feature screening method of any one of claims 1 to 18 or implement the feature screening method of any one of claims 1 to 18 to obtain the optimal machine learning model using the optimal feature combination or implement the feature screening system of claim 19.

21. A computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the feature screening method of any one of claims 1 to 18 or the optimal feature combination obtained by the feature screening method of any one of claims 1 to 18 or the feature screening method of any one of claims 1 to 18, to obtain the optimal machine learning model using the optimal feature combination or the feature screening system of claim 19.