CN116680594B

CN116680594B - Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm

Info

Publication number: CN116680594B
Application number: CN202310496632.0A
Authority: CN
Inventors: 赵龙; 刘娇; 司呈坤
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2024-07-05
Anticipated expiration: 2043-05-05
Also published as: CN116680594A

Abstract

The invention relates to a method for improving the classification accuracy of thyroid cancer with multiple groups of chemical data by using a depth feature selection algorithm, and belongs to the technical field of biomedicine. Comprising the following steps: preprocessing data; screening the feature subset based on the feature correlation defined by the weights; and inputting the screened feature subsets into a neural network for learning and classifying to obtain the final classification result of the multiple groups of cancers. The invention provides a new feature correlation based on weight definition, the weight contains more comprehensive information of dynamic change features, and a new evaluation criterion is provided for evaluating the association degree and redundancy of the features. And finally, inputting the screened feature subsets into a DNN neural network with four hidden layers for training and predicting, and finally obtaining the prognosis prediction of the multigroup of the mathematical data based on the thyroid, thereby greatly improving the classification precision.

Description

Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm

Technical Field

The invention relates to a method for improving the classification precision of thyroid cancer of multiple groups of chemical data by using a depth feature selection algorithm, which can extract important correlation and redundant information, has further progress in solving the high-dimensional high-noise aspect, obviously improves the classification precision of the cancer by introducing multiple groups of chemical data, plays a key role in clinical prediction prognosis, and finally improves the classification performance of the multiple groups of chemical data of the thyroid cancer by training by using a four-layer DNN neural network, and belongs to the technical field of biomedicine.

Background

Thyroid cancer is one of inert cancers, the survival rate is high and can exceed 95%, but the incidence rate slightly fluctuates in the last 30 years, the death rate is reduced in the last 10 years, the survival is not obviously improved, but most thyroid cancer patients need to resect thyroid for life and take medicine, and the misdiagnosis rate of clinical diagnosis is high. Therefore, designing an efficient algorithm is critical to the clinical timely prediction of thyroid cancer. In recent years, genomic data technology has become an important tool for cancer prediction, but most of the existing thyroid cancer classification algorithms are based on a single set of histological data. The multiple groups of study data can make up for the incomplete information of a single group, is more beneficial to accurately analyzing the pathogenesis of the cancer, and provides necessary data support for diagnosis and prediction of thyroid cancer. Therefore, the method has important significance in improving the prognosis prediction of thyroid cancer while reducing the dimension by utilizing the multi-set of the characteristic correlation redundancy weight.

Thyroid data is a common cancer, but less research in the field of deep learning. Mourad et al improve classification accuracy by feature extraction of thyroid cancer patient clinical information, see ：M.Mourad,S.Moubayed,A.Dezube,Y.Mourad,K.Park,A.Torreblanca-Zanca,J.S.Torrecilla,J.C.Cancilla,and J.Wang,"Machine learning and feature selection applied to seer data to reliably assess thyroid cancer prognosis,"Scientific reports,vol.10,no.1,p.5176,2020.Raweh et al in particular, improve prediction of various cancers including thyroid cancer using a mixed feature selection algorithm, see ：A.A.Raweh,M.Nassef,and A.Badr,"Ahybridized feature selection and extraction approach for enhancing cancer prediction based on dna methylation,"IEEE Access,vol.6,pp.15212-15223,2018.Lang et al in particular, improve prediction of thyroid cancer risk using deep learning for medical image segmentation, see in particular ：S.Lang,Y.Xu,L.Li,B.Wang,Y.Yang,Y.Xue,and K.Shi,"Joint detection of tap and cea based on deep learning medical image segmentation:risk prediction of thyroid cancer,"Journal of Healthcare Engineering,vol.2021,pp.1-9,2021.

The above researches all adopt single-group data to conduct classification prediction, and the classification performance is still insufficient.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for improving the classification precision of thyroid cancer with multiple groups of chemical data by utilizing a depth feature selection algorithm, which improves the prediction accuracy of thyroid cancer by integrating transcriptome data, copy number variation data and DNA methylation data and improves the classification performance of thyroid cancer by utilizing maximized feature correlation and minimized feature redundancy.

Term interpretation:

1. Data of group study: mainly comprises transcriptomics, lipidomics, immunohistology, RNA histology, image histology, ultrasonic histology and the like.

2. Multiple sets of study data: refers to the analysis of two or more histology data integration.

3. Expression data (Exp): reflecting the abundance of the resulting gene transcript mRNA in the cell, measured directly or indirectly, these data can be used to analyze which genes have altered expression, what correlations are between genes, and how the activity of the genes is affected under different conditions.

4. Copy Number Variation (CNV): is caused by rearrangement of the genome, and generally refers to a gene having a length of 1kb or more.

5. Methylation data (DNA Methylation): is a form of chemical modification of DNA that is capable of altering genetic manifestations without altering the DNA sequence.

6. Data integration: refers to that a plurality of kinds of histology data are integrated into another kind of data through operations such as preprocessing.

The invention mainly solves the following problems:

(1) The data interference caused by the problems of data redundancy and the like is solved. (2) In order to solve the degree of association between features, a new feature correlation based on weight definition is provided, and the weight contains more comprehensive information of the dynamic change features. (3) To solve the relevance and redundancy of the features, new evaluation criteria are proposed. (4) In order to solve the problem of low precision of multiple groups of chemical data, the invention provides a multiple groups of chemical depth feature selection algorithm for feature correlation and redundancy weight to improve the classification precision of thyroid cancer.

The invention adopts the following technical scheme:

A method for improving classification accuracy of thyroid cancer of multiple sets of chemical data by using a depth feature selection algorithm, comprising:

step 1: preprocessing data;

Step 2: screening the feature subset based on the feature correlation defined by the weights;

step 3: and inputting the screened feature subsets into a neural network for learning and classifying to obtain the final classification result of the multiple groups of cancers.

Preferably, in step 1, the expression data is used to obtain important genes by using the R language and adjPvaule <0.5 adjusted in the difference analysis, the metadata file is matched with the sample by using the R language to select tumor samples and normal samples, then the data analysis is performed by using the GISTIC2.0 platform to obtain the samples and gene data, the methylation data is used to analyze the difference expression genes and the difference methylation CpG sites by using the limma analysis in the R package, the difference methylation genes are screened by using FDRFILTER and logFCfiiler, and the pretreatment of the data is completed.

Preferably, in step 2, feature-related redundancy weights FRRW are defined, and feature-related redundancy weights are used to distinguish feature subsets having similar features, as shown in formula (1):

Where I (f _k,f_i; C) represents the joint mutual information of candidate feature subsets, best feature subsets and classes, and also represents the correlation and interaction when dynamically changing the selected subset is taken into consideration, P (f _i |C) represents the probability that the ith best feature occurs in category C, p (f _k |C) represents the probability that the kth candidate feature occurs in category C, and p (f _i,f_k, C) represents the probability that the kth candidate feature occurs with the ith best feature and category C;

H (f _k) represents the entropy of information of the candidate subset, and the data is obtained as follows: Wherein p (f _k) represents the probability that the candidate feature subset occurred in the kth candidate feature in the current subset;

H (f _i) represents the information entropy of the best feature subset, and the data is obtained as follows: wherein p (f _i) represents the probability that the ith best feature occurs in the current subset;

H (f _k,f_i) represents the joint entropy of the candidate feature subset and the best feature subset, and the data is obtained by the formula Wherein p (f _k,,f_i) represents the probability that the kth candidate feature and the ith best feature occur in the current feature subset;

I (f _i;f_k |c) represents that when the optimal feature subset is determined, candidate feature subset information obtained from the category is conditional mutual information, and the data is obtained by the formula:

feature correlation FR is defined, which measures the correlation between two evaluated features, as shown in equation (2):

FR＝FRRW(f_k,f_i)*I(f_k;C|f_i) (2)

Wherein I (f _k;C|f_i) represents that the class information obtained from the best feature subset is conditional mutual information, or represents redundancy of the features, when determining the candidate feature subset, Wherein p (f _k,C,f_i) represents the probability that the kth candidate feature occurs with class C and the ith best feature; p (f _k|f_i) represents the probability that the kth best feature occurs in the ith feature; p (c|f _i) represents the probability of occurrence in class C in the ith feature;

defining characteristic evaluation criteria:

I (f _k;f_i) represents mutual information consisting of the best feature subset and the candidate feature subset, the data being obtained by the formula: Where S represents the best feature subset selected, f= { F ₁,f₂,f₃……f_n } represents the candidate feature subset, and C represents the class;

Firstly, calculating mutual information of all candidate feature subsets and categories, screening out the feature with the largest median of the features F _i, and merging the feature into S, wherein F represents the candidate feature subset with the feature removed, the number K of the selected features is set according to the requirement, the feature with the largest J (F _k) value in the formula (3) in each cycle in the remaining candidate feature subset F is calculated by using the cycle, and the feature is merged into S until the cycle is ended.

Preferably, in step 3, the neural network adopts DNN, where the DNN includes an input layer, four hidden layers and an output layer, and the feature subset after screening is input into the DNN, so that the classification accuracy of multiple groups of thyroid cancer is improved through multiple iterations. Wherein x= (X ₁,X₂,X₃…,X_n)^T represents a feature subset matrix of thyroid cancer in multiple sets, z represents a sample label, normal samples are set to z=0, cancer samples are set to z= 1;W to represent feature weights in a neural network, σ (·) represents an activation function of the neural network, used in a hidden layer as an activation function, g (·) represents a classification function, and an output value is converted into a probability prediction.

Preferably, in step 3, adam is used as an optimizer, and cross entropy loss is used to calculate training errors of each layer:

Wherein n represents the number of features, Representing the fit of p _i, y _i represents the true sample label for feature i,Representing a predicted probability value representing a difference between the true sample label and the predicted probability; finally, using a Sigmoid function as a classifier at an output layer, and finally outputting the classification prediction precision of thyroid cancer.

The invention constructs a feature correlation and redundant weight to extract important correlation and redundant information together, proposes new feature correlation based on weight definition, the weight contains more comprehensive information of dynamic change features, and finally proposes new criteria for feature evaluation. The method is applied to multiple groups of data of thyroid cancer, and three groups and multiple groups of accuracy are obtained through neural network classification.

The present invention is not limited to the details of the prior art.

The beneficial effects of the invention are as follows:

The method for improving the classification precision of the thyroid cancer of multiple groups of chemical data by utilizing the depth feature selection algorithm provides a new feature correlation based on weight definition, the weight contains more comprehensive information of dynamic change features, and a new evaluation criterion is provided for evaluating the association degree and redundancy of the features. And finally, inputting the screened feature subsets into a DNN neural network with four hidden layers for training and predicting, and finally obtaining the prognosis prediction of the multigroup of the mathematical data based on the thyroid, thereby greatly improving the classification precision.

Drawings

FIG. 1 is a flow chart of a method for improving the classification accuracy of thyroid cancer with multiple sets of mathematical data by using a depth feature selection algorithm;

FIG. 2 is a comparison of data from a single set of study and multiple sets of study according to the present invention;

FIG. 3 is a comparison of the present invention with existing algorithms;

FIG. 4 is a comparison of the present invention with other depth feature selection algorithms.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the present invention more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments, but not limited thereto, and the present invention is not fully described and is according to the conventional technology in the art.

Example 1

A method for improving the classification accuracy of thyroid cancer with multiple sets of chemical data by using a depth feature selection algorithm is mainly divided into data, a method and performance evaluation as shown in figure 1. Wherein the data comprises: transcriptomic data, copy number variation, and DNA methylation data, the method comprising:

step 1: preprocessing data;

Example 2

A method for improving classification accuracy of thyroid cancer with multiple sets of chemical data by using depth feature selection algorithm, as in embodiment 1, except that in step 1, the preprocessing process is as follows:

The expression data are subjected to R language utilization and difference analysis, adjusted adjPvaule <0.5 is used for obtaining important genes, metadata files are matched with samples through the R language, tumor samples and normal samples are selected, data analysis is performed through a GISTIC2.0 platform to obtain samples and gene data, the methylation data are subjected to limma analysis in an R package to obtain differential expression genes and differential methylation CpG sites, the differential methylation genes are screened through FDRFILTER and logFCfiiler, and data preprocessing is completed.

The method adopts the working method of the multi-group chemical depth feature selection algorithm based on the feature correlation and the redundancy weight, and the performance evaluation mainly adopts Accuracy, precision, recall, F-measure.

Example 3

A method for improving the classification accuracy of thyroid cancer with multiple sets of data by using a depth feature selection algorithm, as in embodiment 1, except that in step 2, in the present invention, the feature selection part is based on feature correlation and redundancy weight. New feature correlations based on weight definitions are presented, the weights containing more comprehensive information of dynamically changing features. In order to evaluate the relevance and redundancy of features, new evaluation criteria are proposed.

Feature-dependent redundancy weights FRRW are defined, which are used to distinguish feature subsets with similar features, as shown in equation (1):

FR＝FRRW(f_k,f_i)*I(f_k;C|f_i) (2)

defining characteristic evaluation criteria:

Example 4

In step 3, the neural network adopts DNN, which includes an input layer, four hidden layers and an output layer, and the screened feature subset is input into the DNN, so that the classification accuracy of multiple groups of thyroid cancer is improved through multiple iterations. Wherein x= (X ₁,X₂,X₃…,X_n)^T represents a feature subset matrix of thyroid cancer in multiple sets, z represents a sample label, normal samples are set to z=0, cancer samples are set to z= 1;W to represent feature weights in a neural network, σ (·) represents an activation function of the neural network, used in a hidden layer as an activation function, g (·) represents a classification function, and an output value is converted into a probability prediction.

In this embodiment, the details of the four-layer neural network used for DNN are shown in table 1.

Table 1: neural network parameter information table

For DNN, four layers of neural networks are used, and the number of hidden layer neurons of the four layers of neural networks is changed according to different feature numbers. A large number of experiments prove that the invention has good effect when trained for 60 times. Finally, each batch-size (batch size) is set to 15 features.

FIG. 2 is a comparison of data from a single set of study and multiple sets of study according to the present invention, wherein the abscissa represents the number of features and the ordinate represents the corresponding accuracy rate when different feature numbers are retained. Exp, cnv, DNA methylation represent gene expression data, copy number variation data, DNA methylation data, respectively. RWDFS represents the multiple sets of histology data integrated with the three sets of histology data, which sets of histology have corresponding accuracy in retaining different feature numbers.

FIG. 3 is a comparison of the present invention with existing algorithms; wherein CWJR represents a conditional weight joint correlation algorithm, DCSF represents a dynamic variation algorithm of selected features along with categories, MRI represents a feature selection algorithm that maximizes independent classification information, mRMR represents a minimum redundancy maximum correlation criterion algorithm, and RWDFS represents an algorithm of the present embodiment.

Fig. 4 shows the comparison result of the present invention with other depth feature selection algorithms, wherein forgeNet represents the graph depth neural network algorithm, RDFS represents the gastric cancer classification algorithm, fDNN represents the feature extraction algorithm, and RWDFS represents the algorithm of the present embodiment. As can be seen from fig. 3 and 4, the algorithm Accuracy of the present embodiment is the highest.

While the foregoing is directed to embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and changes can be made without departing from the principles of the present invention, and it is intended to cover the modifications and changes as defined in the appended claims.

Claims

1. A method for improving the classification precision of thyroid cancer with multiple groups of chemical data by using a depth feature selection algorithm is characterized by comprising the following steps:

step 1: preprocessing data;

Step 3: inputting the screened feature subsets into a neural network for learning and classifying to obtain final classification results of multiple groups of cancers;

In the step 1, important genes are obtained from expression data through adjPvaule <0.5 adjusted in R language by utilizing difference analysis, metadata files are matched with samples through R language by copy number variation data, tumor samples and normal samples are selected, then data analysis is carried out through a GISTIC 2.0.0 platform to obtain samples and gene data, different expression genes and different methylation CpG sites are analyzed through limma in R package by methylation data, and different methylation genes are screened through FDRFILTER and logFCfiiler, so that pretreatment of the data is completed;

in step 2, feature-dependent redundancy weights FRRW are defined, and feature subsets with similar features are distinguished using the feature-dependent redundancy weights, as shown in equation (1):

FR＝FRRW(f_k,f_i)*I(f_k;C|f_i) (2)

defining characteristic evaluation criteria:

2. The method for improving the classification accuracy of thyroid cancer using depth feature selection algorithm according to claim 1, wherein in step 3, the neural network uses DNN, the DNN includes an input layer, four hidden layers and an output layer, x= (X ₁,X₂,X₃…,X_n)^T represents a feature subset matrix of thyroid cancer in multiple groups, z represents a sample label, normal sample is set to z=0, cancer sample is set to z= 1;W represents feature weight in the neural network, σ (·) represents an activation function of the neural network, and g (·) represents a classification function as the activation function used in the hidden layers, and the output value is converted into a probability prediction.

3. The method for improving the classification accuracy of thyroid cancer using depth feature selection algorithm as claimed in claim 2, wherein in step 3, adam is used as an optimizer, and cross entropy loss is used to calculate training errors of each layer:

Wherein n represents the number of features, Representing the fit of p _i, y _i represents the true sample label for feature i,Representing the predicted probability value; finally, using a Sigmoid function as a classifier at an output layer, and finally outputting the classification prediction precision of thyroid cancer.