CN111180009B

CN111180009B - Cancer stage prediction system based on genome analysis

Info

Publication number: CN111180009B
Application number: CN202010003411.1A
Authority: CN
Inventors: 张海霞; 李芳君; 袁东风
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2023-04-28
Anticipated expiration: 2040-01-03
Also published as: CN111180009A

Abstract

The invention relates to a cancer stage prediction system based on genome analysis, which comprises an original data acquisition unit, a combined characteristic preprocessing unit, a combined gene selection unit, a classification model creation unit and a prediction unit which are connected in sequence; the original data acquisition unit is used for: acquiring RNAseq expression data and clinical information of a cancer genome map TCGA project corresponding to a cancer subtype sample, and acquiring an RSEM value of gene expression in the RNAseq expression data; the combined characteristic preprocessing unit is used for: discretizing genetic characteristics; alternatively, after 1.0 addition, the RSEM values are transformed using log2, and the log2 transformed RSEM values are normalized; the joint gene selection unit is used for: performing FCBF searching, combined statistical feature extraction and logistic regression model feature selection in sequence; the classification model creation unit is used for generating a classification model and optimizing the performance of the classification model; the prediction performance of the invention is more stable and more accurate.

Description

Cancer stage prediction system based on genome analysis

Technical Field

The invention relates to the technical field of biological information and machine learning, in particular to a cancer stage prediction system based on genome analysis.

Background

Cancer has a great association with genes. When tumors are found in advanced stages, survival rates are very low, whereas early detection and effective treatment can improve survival rates. Therefore, effective strategies are formulated to stratify patients according to the stage of cancer and the intrinsic mechanisms driving the development and progression of cancer, which are critical for early prevention and treatment of cancer. Cancer is often asymptomatic at an early stage, and many patients have metastasis when diagnosed with cancer. Patients resected by resection have a high risk of metastatic recurrence, and early detection helps in the prevention and treatment of early stage cancers. Furthermore, understanding key gene-driven factors for disease progression has helped to develop new therapeutic approaches.

Since conventional imaging techniques, such as ultrasound and Computed Tomography (CT), lead biopsies are not sufficiently stable in terms of detecting primary cancers, new diagnostic methods need to be developed. Gene expression profiles play an important role in tumorigenesis and metastasis and thus have potential classification value. Machine learning based methods that can use gene expression profiling to identify the stage of various cancers have recently shown tremendous potential, although existing researchers have used classification models to distinguish between early and late stage samples, see Rahimi, arezou, and Mehmet

"Discriminating early-and late-stage cancers using multiple kernel learning on Gene sets", "Bioinformation 34.13 (2018): i412-i421. Bhalla, sherry, et al," Gene expression-based biomarkers for discriminating early and late stage of clear cell renal cancer "," Scientific reports (2017): 44997, improved model in paper, classified prediction of multiple cancers, but a broader range of results in 100 random experiments, unstable performance; in summary, the stability of the classification model in the prior art cannot be guaranteed, and there is still room for improvement in the performance of the model.

Chinese patent document CN 109994151a discloses a tumor driving gene prediction system based on a complex network and a machine learning method. The invention predicts potential tumor driving genes, deepens the knowledge of cancers to a certain extent, and further promotes the development of cancer treatment. The invention comprises a data acquisition and data preprocessing module, a characteristic engineering module, a model algorithm design module and a result evaluation module. The data acquisition and data preprocessing module: the data acquisition and data preprocessing module comprises data acquisition, tumor gene network construction and maximum connected subgraph screening, and provides a data basis for driving gene prediction. The feature engineering module comprises feature engineering extraction and feature engineering arrangement. The model algorithm design module includes constructing training samples and predicting model designs. The result evaluation module adopts a confusion matrix and an ROC curve to verify the prediction effect of the model. However, this patent suffers from the following drawbacks: in the patent, a characteristic selection method of a gene network is adopted, the number of gene characteristics is tens of thousands, and the constructed network is very complex.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cancer stage prediction system based on genome analysis;

term interpretation:

1. RSEM value, a form of gene expression RNAseq data for TCGA, can be downloaded from a website:

https://xenabrowser.net/datapages/。

2. ChiMerge bin division, chi-Square bin division, is a bin division method depending on Chi-Square test, and the basic idea of bin division is to judge whether two adjacent intervals have distribution differences or not by selecting Chi-Square statistics (Chi-Square) on statistical indexes, and to combine from bottom to top based on the result of the Chi-Square statistics until the limit condition of bin division is satisfied. The realization steps of the chi-square sub-box are as follows:

zeroth step: presetting a reading value of a card party;

first step, initializing

Sorting the instances according to the attribute to be discretized, wherein each instance belongs to a section;

second step, merging intervals

(1) Calculating the chi-square value of each pair of sum intervals;

(2) Merging a pair of intervals with the smallest chi-square value;

A _ij the number of instances of class j in section i; e (E) _ij ：A _ij Is a desired frequency of (2);

the chi-square statistic measures the difference between the frequency distribution of the samples in the interval and the frequency distribution of the whole samples, and two limiting conditions can be used in the case of the box division:

(1) Number of sub-boxes: and limiting the final number of the boxes, and merging the interval with the minimum chi-square value in the sample with the adjacent minimum chi-square interval each time until the number of the boxes reaches the limiting condition.

(2) Chi-square threshold: and obtaining a corresponding chi-square threshold according to the degree of freedom and the significance level, and if the minimum chi-square value of each section of the bin is smaller than the chi-square threshold, continuing to combine until the minimum chi-square value exceeds the set threshold.

3. FCBF search, a feature selection algorithm based on information theory, which considers both feature correlation and feature redundancy. It uses dominant correlation to identify relevant features in a high-dimensional dataset in a reduced feature space.

4. Information value IV for evaluating the overall predictive power of the feature, i.e. the separation power of the feature for early and late samples.

5. Weak predictive variables, IV values below 0.1, cull such features during feature selection.

6. Strong predictive variable, IV value no less than 0.1.

7. Variance expansion factor (Variance Inflation Factor, VIF) refers to the ratio of the variance in the presence of multiple collinearity to the variance in the absence of multiple collinearity between the interpretation variables.

8. The machine learning model and the logistic regression model are generalized linear regression analysis models, and essentially belong to two classification problems. Logistic regression uses a sigmoid function to map the predicted value to a probability value on (0, 1), helping to determine the result. The predicted value has a special meaning and represents the probability of the result taking 1.

9. The basic model of the Support Vector Machine (SVM) is to find the best separation hyperplane in the feature space so that the positive and negative sample interval on the training set is maximum. The classifier with sparsity and robustness is obtained by calculating an empirical risk (empirical risk) using a hinge loss function (hinge loss) and adding regularization terms to the solution system to optimize the structural risk (structural risk). The SVM can perform nonlinear classification by a kernel method (kernel method), which is one of the common kernel learning methods.

10. A multi-layer perceptron (MLP) is a feedforward artificial neural network model, and besides an input-output layer, a plurality of hidden layers can be arranged in the middle of the model, and the layers are fully connected. Random forest is an algorithm integrating a plurality of trees through the idea of ensemble learning, the basic unit of which is a decision tree, and the essence of which belongs to a large branch of machine learning, namely an ensemble learning (Ensemble Learning) method.

11. Random Forest (RF), a classifier that contains multiple decision trees, and whose output class is a mode of the class output by individual trees.

12. Naive Bayes (NB) refers to a probabilistic-based classification algorithm that predicts classification by considering feature probabilities.

The technical scheme of the invention is as follows:

a cancer stage prediction system based on genome analysis comprises an original data acquisition unit, a combined characteristic preprocessing unit, a combined gene selection unit, a classification model creation unit and a prediction unit which are connected in sequence;

the original data acquisition unit is used for: acquiring RNAseq expression data and clinical information of cancer genome map TCGA project corresponding to cancer subtype samples, acquiring RSEM values of gene expression in the RNAseq expression data and the clinical information, wherein the samples with stage I and stage II notes are regarded as early cancer, and the rest samples with stage III and stage IV notes are late cancer;

the combined characteristic preprocessing unit is used for: the data expressed by RNAseq which is the genetic characteristic is discretized through the ChiMerge bin and WOE coding, and the stability of the data and the robustness of the classification model are improved through the ChiMerge bin and WOE coding.

Or,

the combined characteristic preprocessing unit is used for: after adding 1.0, the RSEM values were transformed using log2, and the log2 transformed RSEM values were normalized;

the combined gene selection unit is used for: performing FCBF searching, combined statistical feature extraction and logistic regression model feature selection in sequence;

the invention combines FCBF search with information value, linear correlation coefficient and variance expansion factor, removes uncorrelated/redundant features, and finds out key genes by using a feature selection method based on logistic regression;

the classification model creation unit is configured to: five machine learning methods are used, including Support Vector Machines (SVMs), logistic Regression (LR), multi-layer perceptions (MLPs), random Forests (RF), and Naive Bayes (NB) to generate classification models and optimize their performance.

Using a sklearn package of python, calling a related algorithm model, taking the processed characteristics as input, and putting the processed characteristics into the model to train model parameters;

the classification model can be built on discretized data or standardized data, and the data obtained by two preprocessing modes can obtain better results.

The prediction unit is used for: after training the classification model, storing the classification model, inputting RNAseq expression data of the sample to be tested after pretreatment during prediction, directly calling a

prediction result

0 or 1,0 to predict early cancer, and 1 to predict late cancer.

According to the present invention, it is preferable to convert the RSEM value using log2 and normalize the log 2-converted RSEM value, which means:

the RSEM values were converted by formula (I) using log 2:

x＝log ₂ (RSEM+1) (Ⅰ)

normalizing the log2 transformed RSEM values by formula (II) yields z:

/>

in the formula (II), x is a value obtained by logarithmically subjecting an RSEM value,

is the mean value of x, s is the standard deviation.

According to the present invention, the FCBF search is preferably performed on raw training data, which refers to RSEM values, including:

(1) Selecting 80% of data in original training data as a training data set by adopting a random sampling method, randomly dividing the training data into ten folds each time in ten times of ten-fold cross-validation experiments, performing FCBF searching on the training data set, and performing ten-fold cross sampling on each time of FCBF searching to obtain 10 sub-feature sets;

(2) Selecting the features with the overlapping number larger than 6 to perform data preprocessing and joint feature selection;

step (1) carrying out ten times of ten-fold cross validation, generating a feature set each time, combining the ten feature sets, and selecting gene features from the feature sets, wherein the gene features are RNA;

data preprocessing refers to discretizing RNA by chimere binning and WOE encoding;

the joint feature selection refers to feature selection based on logistic regression by merging the FCBF algorithm, the IV and the VIF.

In accordance with the preferred embodiment of the present invention, the method utilizes various statistics to evaluate the importance and relevance of features and filter out redundant features in conjunction with statistical feature extraction. The processed data is discretized data; comprising the following steps:

A. univariate analysis: namely eliminating the variable with the information value IV less than or equal to 1;

the information value IV of each gene coded by ChiMerge sub-box and WOE is calculated according to a formula (III), and is shown as a formula (III):

in the formula (III), G _i Is the proportion of the samples annotated in stage I and stage II in the ith bin to all early samples, B _i Is the proportion of the samples annotated in stage III and stage IV in the ith bin to all late samples; samples in the early samples correspond to patients belonging to early cancers, and samples in the late samples correspond to patients belonging to late cancers;

IV <0.02 is no predicted variable, IV is more than or equal to 0.02 and less than or equal to 0.10 is a weak predicted variable, IV is more than or equal to 0.10 and less than or equal to 0.30 is a moderate predicted variable, and IV >0.30 is a strong predicted variable;

B. multivariate analysis: measuring the linear correlation between the multiple variables by using a variance expansion factor;

the variance expansion factor VIF is adopted to evaluate the multi-element linear correlation, when the calculated variance expansion factor VIF is smaller than 10, the problem of collinearity does not exist, otherwise, the problem of collinearity exists;

R _i is x _i Corresponds to { x ] ₁ ,x ₂ ,...,x _i-1 ,x _i+1 ,x _i+2 ,...,x _N R of } ² A value; x is x ₁ ,x ₂ ,...,x _i-1 ,x _i+1 ,x _i+2 ,...,x _N Refers to N features in the feature set; x is x _i Refers to the ith feature in the feature set, VIF _i Refers to the ith feature x _i A variance expansion factor of (2);

R ² : the original target variable y may be a combination of independent variables xi, if such a combination can be established to account for how large a ratio of the fluctuation of y over the sample is interpreted by a linear expression, this ratio being R ² 。

To ensure the correctness and significance of the variables sent to the logistic regression model, it is necessary to check the coefficients and p-values of the input variables, which represent the influence of the independent variables on the dependent variables, and whether there is a significant change in early and late gene expression. Some variables had p values greater than 0.1 before detection, indicating no obvious correlation between the two parameters.

According to the invention, the characteristic selection of the logistic regression model refers to: and filtering out variables with p values exceeding the threshold value of 0.1 and positive coefficient values, and respectively eliminating invalid and insignificant variables after filtering.

Not significant: judgment by p-value

Where the P value is a value labeled P (> |z|). An independent variable with a p-value of less than 0.05 is considered significant, that is, there is statistical evidence that this variable affects the probability that the dependent variable is 1 (i.e., the sample is late in cancer). In summary, for a given significance level a, if the p value is less than a, the variable varies significantly at the a level. Here we set a to 0.1%.

Invalidation: determination by means of linear regression coefficients

In the linear model, when other variables are kept unchanged, the value represents the variation of each unit of increase of the X1 value; in the logistic regression model, in combination with our previously mentioned occurrence ratios, it can be seen that the change in Log occurrence ratio (Log Odds) is for every unit increase in X1. According to the expression of logistic regression, the p (X) and the p (X) are not in linear relation any more, and the variation of the p (X) is also influenced by the value of X1 at the time. In general, we are more concerned about the sign of the regression coefficients: when taking positive values, p (X) will increase as X1 increases; when taking negative values, p (X) decreases as X1 increases.

According to the invention, RBF cores under different parameters are preferably used by using a support vector machine; gamma E (10) ^-9 ,10 ^-7 ,10 ^-5 ,10 ^-3 ,10 ^-1 ,10,10 ³ ]，c∈[-5，-3，-1,1,3,5,7,9,11，13,15]To optimize the performance of the classification model, the RBF kernel refers to a radial basis function, namely a scalar function which is symmetrical along the radial direction, C is a penalty factor, and gamma is a kernel parameter.

The beneficial effects of the invention are as follows:

1. compared with the prior art, the invention selects fewer genes through the combined gene characteristic selection scheme, but is more stable and accurate in prediction performance. The detection result obtained by the method is superior to the most advanced model. In addition, through gene function analysis, molecular mechanisms that may affect cancer progression can be further studied. The classification model of the invention can extract more prognosis information, and is worthy of further research and verification to know the cancer progress mechanism;

2. according to the invention, the combined characteristic pretreatment is carried out through the sorting and encoding, so that the stability of data and the robustness of a classification model are improved;

3. the invention utilizes a machine learning algorithm to build a classification model. Furthermore, based on feature selection, genetic functional analysis can be performed to investigate molecular mechanisms that may affect cancer progression. The result shows that the classification model can extract more prognosis information, and the method deserves further research and verification to know the progress mechanism.

4. The scheme of the invention has low complexity and high execution speed.

Drawings

FIG. 1 is a schematic workflow diagram of a genomic analysis-based cancer stage prediction system of the present invention;

FIG. 2 (a) is a schematic diagram of the RXRE gene and RXRE_WOE obtained by discretizing RXR by ChiMerge binning and WOE encoding in example 2;

FIG. 2 (B) is a schematic diagram of HUS1B gene and HUS1B_WOE obtained by discretizing HUS1B by ChiMerge binning and WOE encoding in example 2.

Detailed Description

The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.

Example 1

ChiMerge binning: by ChiMerge binning, bottom-up (merging-based) data discretization is achieved in dependence on chi-square checking, i.e. merging together adjacent intervals with minimum chi-square value until a stopping criterion is met. For example, in fig. 2 (B), after the ChiMerge is completed, the combination of RNAseq expression levels corresponding to 2.2901 to 6.2093 is Bin0, the combination of sample expression levels corresponding to 0.7561 to 2.2893 is Bin 1, the combination of sample expression levels corresponding to 0 to 0.7443 is Bin 2, and the combination of sample expression levels in the range of 0 to 6.2093 is Bin 3, thereby realizing discretization of continuous variables.

The relative frequencies are consistent over an interval for accurate discretization operation. For example, if the gene expression level is very high, the distribution of bad samples (late stage cancer samples) in the same group after grouping is as same as possible, and the distribution gap of different groups is as large as possible. The intra-group variance and the component variance are measured using chi-square values.

WOE coding: the meaning of the numerical value after the box division cannot be seen, only the interval can be seen, and the variable with similar grade is changed. Because machine learning algorithms such as logistic regression can only accept numeric variables. This time encoding is required.

WOE codes calculate the duty ratio of good samples to all good samples and the duty ratio of all bad samples in each bin according to how many good samples and how many bad samples are in each bin and how many good samples and bad samples are in the whole samples. So that the attribute of the concentration in the prediction category (target variable) is taken as the encoded value. The characteristic values can be normalized to similar scales through WOE coding, and the value range is usually between 0.1 and 3. For example, in FIG. 2 (B), for gene HUS1B, after WOE encoding is completed, bin0 corresponds to all variable codes of-0.4463, bin 2 corresponds to all variable codes of-0.00984, bin 1 corresponds to all variable codes of-0.20982, and Bin 3 corresponds to all variable codes of 1.072603.

Or,

the joint gene selection unit is used for: performing FCBF searching, combined statistical feature extraction and logistic regression model feature selection in sequence;

the classification model creation unit is used for: five machine learning methods are used, including Support Vector Machines (SVMs), logistic Regression (LR), multi-layer perceptions (MLPs), random Forests (RF), and Naive Bayes (NB) to generate classification models and optimize their performance.

The correlation function is called from the sklearn package of python and model parameters are trained using training set data.

SVM：from sklearn.svm import SVC；

LR:sklearn.linear_model import LogisticRegressionCV；

MLP:from sklearn.neural_network import MLPClassifier；

RF:from sklearn.ensemble import RandomForestClassifier；

NB:from sklearn import naive_bayes；

prediction result

0 or 1,0 to predict early cancer, and 1 to predict late cancer.

Example 2

The cancer stage prediction system based on genomic analysis according to example 1, uses log2 transformed RSEM values and normalizes the log2 transformed RSEM values, which means:

the RSEM values were converted by formula (I) using log 2:

x＝log ₂ (RSEM+1) (Ⅰ)

normalizing the log2 transformed RSEM values by formula (II) yields z:

is the mean value of x, s is the standard deviation.

Performing FCBF search on original training data, wherein the original training data refer to RSEM values, and the method comprises the following steps:

in the step (1), using the Explore function provided by the Weka software (version 3.8), an evaluation policy symmetry Attribute set eval is selected in the Select Attribute, and a search method selects a search policy FCBFSearch. FCBF culling may select the most relevant feature from the high-dimensional features. To select robust features, ten-fold cross-validation was performed on the training set, with each fold corresponding to a selection result on the 10-fold validation set, and genes that appeared more than 8 times out of 10 were selected.

The evaluation strategy symmetry attribute seteval evaluates according to the symmetry instability of each attribute related to other attribute sets; the search strategy FCBFSearch is a feature selection method based on correlation analysis, and the result can eliminate irrelevant attributes. Taking the RNA sequence of kidney clear kidney cell carcinoma (KIRC) as an example, the 20530 gene signature was shared before the selection of FBFSearch, and the number of signatures was reduced to 101 after the selection.

The joint feature selection specifically means: the data which are not preprocessed one by one in series are screened by the FCBF algorithm to obtain variables which meet the conditions; calculating IV value of the reserved gene variable, and eliminating the variable with the information value IV smaller than 0.1; calculating corresponding VIF values of the rest variables, and eliminating variables with the VIF values smaller than 10; the rest variables are subjected to feature selection based on logistic regression, and the variables meeting the significance and the correctness are selected.

In combination with statistical feature extraction, the method utilizes various statistics to evaluate the importance and relevance of features and filters out redundant features. The processed data is discretized data; comprising the following steps:

C. univariate analysis: namely eliminating the variable with the information value IV less than or equal to 1;

and eliminating the variable with IV less than or equal to 1, namely, the variance expansion factor of all the remaining variables is smaller than 1. Taking RNASeq of kidney clear kidney cell carcinoma (KIRC) as an example, 101 features were added after FBFSearch selection, 6 genes were deleted by more than 90% due to 0 expression, and the remaining 95 genes were deleted by 30 after univariate analysis because IV was not more than 0.1, leaving 65 gene features.

D. Multivariate analysis: measuring the linear correlation between the multiple variables by using a variance expansion factor; multivariable refers to: the degree of correlation between each two of the overall variables is low, but put together may be high, e.g., x7 may be expressed as a linear combination of the remaining variables. Meaning that a relatively sharp collinearity between the variables of the set occurs, at which point R ² →1，R ² Is a statistical index reflecting how reliably the regression pattern illustrates the variation of the dependent variable, in this example, how much x7 fluctuation in the sample can be represented by x ₁ -x ₆ Is interpreted by a linear expression of (2). The original target variable x7 canIs a combination of independent variables x1-x6, and if such a combination can be established, how large a fluctuation of x7 over the sample can be made by x ₁ -x ₆ Is interpreted by a linear expression of (2), this ratio is R ² 。

The characteristic selection of the logistic regression model is as follows: and filtering out variables with p values exceeding the threshold value of 0.1 and positive coefficient values, and respectively eliminating invalid and insignificant variables after filtering.

Not significant: judgment by p-value

Invalidation: determination by means of linear regression coefficients

RBF cores under different parameters by using a support vector machine; gamma E (10) ^-9 ,10 ^-7 ,10 ^-5 ,10 ^-3 ,10 ^-1 ,10,10 ³ ]，c∈[-5，-3，-1,1,3,5,7,9,11，13,15]To optimize the performance of the classification model, the RBF kernel refers to a radial basis function, namely a scalar function which is symmetrical along the radial direction, C is a penalty factor, and gamma is a kernel parameter.

FIG. 1 is a workflow of a genomic analysis-based cancer stage prediction system.

The data of this example are derived from RNAseq expression data of 604 renal clear cell carcinoma (KIRC) samples of the cancer genomic profile (TCGA) project and their clinical information used to distinguish early and late KIRC. Gene expression values and clinically annotated RSEM values for KIRC from UCSC Xena, data sets can be obtained from the websitehttps://xenabrowser.net/datapages/And (5) downloading. Where samples with stage I and stage II notes are considered early stage cancers, samples with stage III and stage IV notes are labeled as late stage cancers, and samples that do not contain tumor stage information are knocked out. After treatment, 604 samples were retained from early and late stages. In this study, 80% (482 samples) were randomly selected as the training set, the remaining 20% (122 samples)As a stand-alone test set.

Feature selection is performed prior to classification. The study only performed feature selection for the training set. For classification, this task can be abstracted into a binary classification problem. Five supervised machine learning algorithms are used to predict the pathological phase of a gene set.

FIG. 2 (a) is a schematic diagram of the RXRE gene and RXRE_WOE obtained after discretizing RXR by ChiMerge binning and WOE encoding;

FIG. 2 (B) is a schematic diagram of HUS1B gene and HUS1B_WOE obtained by discretizing HUS1B by ChiMerge binning and WOE encoding;

RXRE and HUS1B, CTSG are names of three genes, RXRE_WOE, HUS1B_WOE and CTSG_WOE represent gene expression after ChiMerge binning and WOE coding, a total of 482 training set samples are obtained, the abscissa sample numbers 1-482 and the ordinate indicates the expression level of the genes.

For the gene RARX, after the ChiMerge is divided into boxes, the combination of RNAseq sample expression levels corresponding to 12.9592-11.3425 is Bin0 (the box WOE code value is 0.789819), the combination of sample expression levels corresponding to 11.3294-11.0885 is Bin 1 (the box WOE code value is 0.651676), the continuous value in the range of the sample expression levels corresponding to 11.0861-10.7110 is divided into Bin 2 (the box WOE code value is-0.758620), the continuous value in the range of the sample expression levels corresponding to 10.7023-10.3522 is divided into Bin 3 (the box WOE code value is-0.23381), the continuous value in the range of the sample expression levels corresponding to 10.3452-6.2093 is divided into Bin 4 (the box WOE code value is-0.75862), and the discretization of the continuous variable is realized.

For the gene HUS1B, after the ChiMerge is completed, the combination corresponding to the RNAseq expression quantity 2.2901-6.2093 is Bin0, the combination corresponding to the sample expression quantity 0.7561-2.2893 is Bin 1, the combination corresponding to the sample expression quantity 0-0.7443 is Bin 2, and the combination corresponding to the sample expression quantity 0-6.2093 is Bin 3, so that discretization of continuous variables is realized. After the WOE encoding is completed, bin0 Bin corresponds to all variable encodings of-0.4463, bin 2 Bin corresponds to all variable encodings of-0.00984, bin 1 Bin corresponds to all variable encodings of-0.20982, and Bin 3 Bin corresponds to all variable encodings of 1.072603.

The bin coding can normalize the variables to similar scales, reducing the impact of data distribution.

Table 1 shows the training results of the validation set on the classification model and the test results of the test set passing through the prediction unit.

TABLE 1

To compare the performance of several classifiers and evaluate the resulting predictive model, table 1 uses general evaluation metrics to evaluate the performance of the classifier, sensitivity, specificity accuracy, matthews Correlation Coefficient (MCC) and area enclosed by the coordinate Axes (AUC) under the subject's working characteristics.

Based on the accuracy and AUC, it is inferred that the SVM-based predictive model performs better on this dataset than the other four machine learning algorithms. The MCC of the model developed in this study was between 0.496 and 0.603. Notably, among the four evaluated predictive models, the sensitivity based on MLP was highest (0.776), the specificity, accuracy and MCC of the SVM-based predictive model were highest, and the AUC values based on logistic regression and the SVM predictive model were all 0.860.

Claims

1. The cancer stage prediction system based on genome analysis is characterized by comprising an original data acquisition unit, a combined characteristic preprocessing unit, a combined gene selection unit, a classification model creation unit and a prediction unit which are connected in sequence;

the combined characteristic preprocessing unit is used for: discretizing genetic characteristics, namely RNAseq expression data, by ChiMerge binning and WOE encoding; alternatively, the combined feature preprocessing unit is configured to: using log2 to convert the RSEM value, and normalizing the log2 converted RSEM value;

the classification model creation unit is configured to: generating a classification model by using five machine learning methods including a support vector machine, logistic regression, multi-layer perception, random forests and naive Bayes, and optimizing the performance of the classification model;

the prediction unit is used for: after training the classification model, storing the classification model, inputting RNAseq expression data of the sample to be tested after pretreatment during prediction, directly calling a prediction result 0 or 1,0 to predict early cancer, and 1 to predict late cancer.

2. The genomic analysis-based cancer stage prediction system according to claim 1, wherein the log2 transformed RSEM values are used and normalized, which means that:

the RSEM values were converted by formula (I) using log 2:

x＝log ₂ (RSEM+1) (Ⅰ)

normalizing the log2 transformed RSEM values by formula (II) yields z:

is the mean value of x, s is the standard deviation.

3. The genomic analysis-based cancer stage prediction system according to claim 1, wherein FCBF search is performed on raw training data, which is referred to as RSEM values, comprising:

4. A genome analysis-based cancer stage prediction system according to claim 3, characterized in that the combined statistical feature extraction comprises:

R _i is x _i Corresponds to { x ] ₁ ,x ₂ ,...,x _i-1 ,x _i+1 ,x _i+2 ,...,x _N R of } ² A value; x is x ₁ ,x ₂ ,...,x _i-1 ,x _i+1 ,x _i+2 ,...,x _N Refers to N features in the feature set; x is x _i Refers to the ith feature in the feature set, VIF _i Refers to the ith feature x _i Is a variance expansion factor of (a).

5. The genomic analysis-based cancer stage prediction system according to claim 1, wherein the logistic regression model feature selection is: and filtering out variables with p values exceeding the threshold value of 0.1 and positive coefficient values, and respectively eliminating invalid and insignificant variables after filtering.

6. A genome-based cancer stage prediction system according to any of claims 1-5, characterized in that RBF cores under different parameters are used with a support vector machine; gamma E (10) ^-9 ,10 ^-7 ,10 ^-5 ,10 ^-3 ,10 ^-1 ,10,10 ³ ]，c∈[-5，-3，-1,1,3,5,7,9,11，13,15]To optimize the performance of the classification model, the RBF kernel refers to a radial basis function, namely a scalar function which is symmetrical along the radial direction, C is a penalty factor, and gamma is a kernel parameter.