CN116680594B - Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm - Google Patents

Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm Download PDF

Info

Publication number
CN116680594B
CN116680594B CN202310496632.0A CN202310496632A CN116680594B CN 116680594 B CN116680594 B CN 116680594B CN 202310496632 A CN202310496632 A CN 202310496632A CN 116680594 B CN116680594 B CN 116680594B
Authority
CN
China
Prior art keywords
feature
data
subset
probability
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310496632.0A
Other languages
Chinese (zh)
Other versions
CN116680594A (en
Inventor
赵龙
刘娇
司呈坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202310496632.0A priority Critical patent/CN116680594B/en
Publication of CN116680594A publication Critical patent/CN116680594A/en
Application granted granted Critical
Publication of CN116680594B publication Critical patent/CN116680594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method for improving the classification accuracy of thyroid cancer with multiple groups of chemical data by using a depth feature selection algorithm, and belongs to the technical field of biomedicine. Comprising the following steps: preprocessing data; screening the feature subset based on the feature correlation defined by the weights; and inputting the screened feature subsets into a neural network for learning and classifying to obtain the final classification result of the multiple groups of cancers. The invention provides a new feature correlation based on weight definition, the weight contains more comprehensive information of dynamic change features, and a new evaluation criterion is provided for evaluating the association degree and redundancy of the features. And finally, inputting the screened feature subsets into a DNN neural network with four hidden layers for training and predicting, and finally obtaining the prognosis prediction of the multigroup of the mathematical data based on the thyroid, thereby greatly improving the classification precision.

Description

Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm
Technical Field
The invention relates to a method for improving the classification precision of thyroid cancer of multiple groups of chemical data by using a depth feature selection algorithm, which can extract important correlation and redundant information, has further progress in solving the high-dimensional high-noise aspect, obviously improves the classification precision of the cancer by introducing multiple groups of chemical data, plays a key role in clinical prediction prognosis, and finally improves the classification performance of the multiple groups of chemical data of the thyroid cancer by training by using a four-layer DNN neural network, and belongs to the technical field of biomedicine.
Background
Thyroid cancer is one of inert cancers, the survival rate is high and can exceed 95%, but the incidence rate slightly fluctuates in the last 30 years, the death rate is reduced in the last 10 years, the survival is not obviously improved, but most thyroid cancer patients need to resect thyroid for life and take medicine, and the misdiagnosis rate of clinical diagnosis is high. Therefore, designing an efficient algorithm is critical to the clinical timely prediction of thyroid cancer. In recent years, genomic data technology has become an important tool for cancer prediction, but most of the existing thyroid cancer classification algorithms are based on a single set of histological data. The multiple groups of study data can make up for the incomplete information of a single group, is more beneficial to accurately analyzing the pathogenesis of the cancer, and provides necessary data support for diagnosis and prediction of thyroid cancer. Therefore, the method has important significance in improving the prognosis prediction of thyroid cancer while reducing the dimension by utilizing the multi-set of the characteristic correlation redundancy weight.
Thyroid data is a common cancer, but less research in the field of deep learning. Mourad et al improve classification accuracy by feature extraction of thyroid cancer patient clinical information, see :M.Mourad,S.Moubayed,A.Dezube,Y.Mourad,K.Park,A.Torreblanca-Zanca,J.S.Torrecilla,J.C.Cancilla,and J.Wang,"Machine learning and feature selection applied to seer data to reliably assess thyroid cancer prognosis,"Scientific reports,vol.10,no.1,p.5176,2020.Raweh et al in particular, improve prediction of various cancers including thyroid cancer using a mixed feature selection algorithm, see :A.A.Raweh,M.Nassef,and A.Badr,"Ahybridized feature selection and extraction approach for enhancing cancer prediction based on dna methylation,"IEEE Access,vol.6,pp.15212-15223,2018.Lang et al in particular, improve prediction of thyroid cancer risk using deep learning for medical image segmentation, see in particular :S.Lang,Y.Xu,L.Li,B.Wang,Y.Yang,Y.Xue,and K.Shi,"Joint detection of tap and cea based on deep learning medical image segmentation:risk prediction of thyroid cancer,"Journal of Healthcare Engineering,vol.2021,pp.1-9,2021.
The above researches all adopt single-group data to conduct classification prediction, and the classification performance is still insufficient.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for improving the classification precision of thyroid cancer with multiple groups of chemical data by utilizing a depth feature selection algorithm, which improves the prediction accuracy of thyroid cancer by integrating transcriptome data, copy number variation data and DNA methylation data and improves the classification performance of thyroid cancer by utilizing maximized feature correlation and minimized feature redundancy.
Term interpretation:
1. Data of group study: mainly comprises transcriptomics, lipidomics, immunohistology, RNA histology, image histology, ultrasonic histology and the like.
2. Multiple sets of study data: refers to the analysis of two or more histology data integration.
3. Expression data (Exp): reflecting the abundance of the resulting gene transcript mRNA in the cell, measured directly or indirectly, these data can be used to analyze which genes have altered expression, what correlations are between genes, and how the activity of the genes is affected under different conditions.
4. Copy Number Variation (CNV): is caused by rearrangement of the genome, and generally refers to a gene having a length of 1kb or more.
5. Methylation data (DNA Methylation): is a form of chemical modification of DNA that is capable of altering genetic manifestations without altering the DNA sequence.
6. Data integration: refers to that a plurality of kinds of histology data are integrated into another kind of data through operations such as preprocessing.
The invention mainly solves the following problems:
(1) The data interference caused by the problems of data redundancy and the like is solved. (2) In order to solve the degree of association between features, a new feature correlation based on weight definition is provided, and the weight contains more comprehensive information of the dynamic change features. (3) To solve the relevance and redundancy of the features, new evaluation criteria are proposed. (4) In order to solve the problem of low precision of multiple groups of chemical data, the invention provides a multiple groups of chemical depth feature selection algorithm for feature correlation and redundancy weight to improve the classification precision of thyroid cancer.
The invention adopts the following technical scheme:
A method for improving classification accuracy of thyroid cancer of multiple sets of chemical data by using a depth feature selection algorithm, comprising:
step 1: preprocessing data;
Step 2: screening the feature subset based on the feature correlation defined by the weights;
step 3: and inputting the screened feature subsets into a neural network for learning and classifying to obtain the final classification result of the multiple groups of cancers.
Preferably, in step 1, the expression data is used to obtain important genes by using the R language and adjPvaule <0.5 adjusted in the difference analysis, the metadata file is matched with the sample by using the R language to select tumor samples and normal samples, then the data analysis is performed by using the GISTIC2.0 platform to obtain the samples and gene data, the methylation data is used to analyze the difference expression genes and the difference methylation CpG sites by using the limma analysis in the R package, the difference methylation genes are screened by using FDRFILTER and logFCfiiler, and the pretreatment of the data is completed.
Preferably, in step 2, feature-related redundancy weights FRRW are defined, and feature-related redundancy weights are used to distinguish feature subsets having similar features, as shown in formula (1):
Where I (f k,fi; C) represents the joint mutual information of candidate feature subsets, best feature subsets and classes, and also represents the correlation and interaction when dynamically changing the selected subset is taken into consideration, P (f i |C) represents the probability that the ith best feature occurs in category C, p (f k |C) represents the probability that the kth candidate feature occurs in category C, and p (f i,fk, C) represents the probability that the kth candidate feature occurs with the ith best feature and category C;
H (f k) represents the entropy of information of the candidate subset, and the data is obtained as follows: Wherein p (f k) represents the probability that the candidate feature subset occurred in the kth candidate feature in the current subset;
H (f i) represents the information entropy of the best feature subset, and the data is obtained as follows: wherein p (f i) represents the probability that the ith best feature occurs in the current subset;
H (f k,fi) represents the joint entropy of the candidate feature subset and the best feature subset, and the data is obtained by the formula Wherein p (f k,,fi) represents the probability that the kth candidate feature and the ith best feature occur in the current feature subset;
I (f i;fk |c) represents that when the optimal feature subset is determined, candidate feature subset information obtained from the category is conditional mutual information, and the data is obtained by the formula:
feature correlation FR is defined, which measures the correlation between two evaluated features, as shown in equation (2):
FR=FRRW(fk,fi)*I(fk;C|fi) (2)
Wherein I (f k;C|fi) represents that the class information obtained from the best feature subset is conditional mutual information, or represents redundancy of the features, when determining the candidate feature subset, Wherein p (f k,C,fi) represents the probability that the kth candidate feature occurs with class C and the ith best feature; p (f k|fi) represents the probability that the kth best feature occurs in the ith feature; p (c|f i) represents the probability of occurrence in class C in the ith feature;
defining characteristic evaluation criteria:
I (f k;fi) represents mutual information consisting of the best feature subset and the candidate feature subset, the data being obtained by the formula: Where S represents the best feature subset selected, f= { F 1,f2,f3……fn } represents the candidate feature subset, and C represents the class;
Firstly, calculating mutual information of all candidate feature subsets and categories, screening out the feature with the largest median of the features F i, and merging the feature into S, wherein F represents the candidate feature subset with the feature removed, the number K of the selected features is set according to the requirement, the feature with the largest J (F k) value in the formula (3) in each cycle in the remaining candidate feature subset F is calculated by using the cycle, and the feature is merged into S until the cycle is ended.
Preferably, in step 3, the neural network adopts DNN, where the DNN includes an input layer, four hidden layers and an output layer, and the feature subset after screening is input into the DNN, so that the classification accuracy of multiple groups of thyroid cancer is improved through multiple iterations. Wherein x= (X 1,X2,X3…,Xn)T represents a feature subset matrix of thyroid cancer in multiple sets, z represents a sample label, normal samples are set to z=0, cancer samples are set to z= 1;W to represent feature weights in a neural network, σ (·) represents an activation function of the neural network, used in a hidden layer as an activation function, g (·) represents a classification function, and an output value is converted into a probability prediction.
Preferably, in step 3, adam is used as an optimizer, and cross entropy loss is used to calculate training errors of each layer:
Wherein n represents the number of features, Representing the fit of p i, y i represents the true sample label for feature i,Representing a predicted probability value representing a difference between the true sample label and the predicted probability; finally, using a Sigmoid function as a classifier at an output layer, and finally outputting the classification prediction precision of thyroid cancer.
The invention constructs a feature correlation and redundant weight to extract important correlation and redundant information together, proposes new feature correlation based on weight definition, the weight contains more comprehensive information of dynamic change features, and finally proposes new criteria for feature evaluation. The method is applied to multiple groups of data of thyroid cancer, and three groups and multiple groups of accuracy are obtained through neural network classification.
The present invention is not limited to the details of the prior art.
The beneficial effects of the invention are as follows:
The method for improving the classification precision of the thyroid cancer of multiple groups of chemical data by utilizing the depth feature selection algorithm provides a new feature correlation based on weight definition, the weight contains more comprehensive information of dynamic change features, and a new evaluation criterion is provided for evaluating the association degree and redundancy of the features. And finally, inputting the screened feature subsets into a DNN neural network with four hidden layers for training and predicting, and finally obtaining the prognosis prediction of the multigroup of the mathematical data based on the thyroid, thereby greatly improving the classification precision.
Drawings
FIG. 1 is a flow chart of a method for improving the classification accuracy of thyroid cancer with multiple sets of mathematical data by using a depth feature selection algorithm;
FIG. 2 is a comparison of data from a single set of study and multiple sets of study according to the present invention;
FIG. 3 is a comparison of the present invention with existing algorithms;
FIG. 4 is a comparison of the present invention with other depth feature selection algorithms.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the present invention more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments, but not limited thereto, and the present invention is not fully described and is according to the conventional technology in the art.
Example 1
A method for improving the classification accuracy of thyroid cancer with multiple sets of chemical data by using a depth feature selection algorithm is mainly divided into data, a method and performance evaluation as shown in figure 1. Wherein the data comprises: transcriptomic data, copy number variation, and DNA methylation data, the method comprising:
step 1: preprocessing data;
Step 2: screening the feature subset based on the feature correlation defined by the weights;
step 3: and inputting the screened feature subsets into a neural network for learning and classifying to obtain the final classification result of the multiple groups of cancers.
Example 2
A method for improving classification accuracy of thyroid cancer with multiple sets of chemical data by using depth feature selection algorithm, as in embodiment 1, except that in step 1, the preprocessing process is as follows:
The expression data are subjected to R language utilization and difference analysis, adjusted adjPvaule <0.5 is used for obtaining important genes, metadata files are matched with samples through the R language, tumor samples and normal samples are selected, data analysis is performed through a GISTIC2.0 platform to obtain samples and gene data, the methylation data are subjected to limma analysis in an R package to obtain differential expression genes and differential methylation CpG sites, the differential methylation genes are screened through FDRFILTER and logFCfiiler, and data preprocessing is completed.
The method adopts the working method of the multi-group chemical depth feature selection algorithm based on the feature correlation and the redundancy weight, and the performance evaluation mainly adopts Accuracy, precision, recall, F-measure.
Example 3
A method for improving the classification accuracy of thyroid cancer with multiple sets of data by using a depth feature selection algorithm, as in embodiment 1, except that in step 2, in the present invention, the feature selection part is based on feature correlation and redundancy weight. New feature correlations based on weight definitions are presented, the weights containing more comprehensive information of dynamically changing features. In order to evaluate the relevance and redundancy of features, new evaluation criteria are proposed.
Feature-dependent redundancy weights FRRW are defined, which are used to distinguish feature subsets with similar features, as shown in equation (1):
Where I (f k,fi; C) represents the joint mutual information of candidate feature subsets, best feature subsets and classes, and also represents the correlation and interaction when dynamically changing the selected subset is taken into consideration, P (f i |C) represents the probability that the ith best feature occurs in category C, p (f k |C) represents the probability that the kth candidate feature occurs in category C, and p (f i,fk, C) represents the probability that the kth candidate feature occurs with the ith best feature and category C;
H (f k) represents the entropy of information of the candidate subset, and the data is obtained as follows: Wherein p (f k) represents the probability that the candidate feature subset occurred in the kth candidate feature in the current subset;
H (f i) represents the information entropy of the best feature subset, and the data is obtained as follows: wherein p (f i) represents the probability that the ith best feature occurs in the current subset;
H (f k,fi) represents the joint entropy of the candidate feature subset and the best feature subset, and the data is obtained by the formula Wherein p (f k,,fi) represents the probability that the kth candidate feature and the ith best feature occur in the current feature subset;
I (f i;fk |c) represents that when the optimal feature subset is determined, candidate feature subset information obtained from the category is conditional mutual information, and the data is obtained by the formula:
feature correlation FR is defined, which measures the correlation between two evaluated features, as shown in equation (2):
FR=FRRW(fk,fi)*I(fk;C|fi) (2)
Wherein I (f k;C|fi) represents that the class information obtained from the best feature subset is conditional mutual information, or represents redundancy of the features, when determining the candidate feature subset, Wherein p (f k,C,fi) represents the probability that the kth candidate feature occurs with class C and the ith best feature; p (f k|fi) represents the probability that the kth best feature occurs in the ith feature; p (c|f i) represents the probability of occurrence in class C in the ith feature;
defining characteristic evaluation criteria:
I (f k;fi) represents mutual information consisting of the best feature subset and the candidate feature subset, the data being obtained by the formula: Where S represents the best feature subset selected, f= { F 1,f2,f3……fn } represents the candidate feature subset, and C represents the class;
Firstly, calculating mutual information of all candidate feature subsets and categories, screening out the feature with the largest median of the features F i, and merging the feature into S, wherein F represents the candidate feature subset with the feature removed, the number K of the selected features is set according to the requirement, the feature with the largest J (F k) value in the formula (3) in each cycle in the remaining candidate feature subset F is calculated by using the cycle, and the feature is merged into S until the cycle is ended.
Example 4
In step 3, the neural network adopts DNN, which includes an input layer, four hidden layers and an output layer, and the screened feature subset is input into the DNN, so that the classification accuracy of multiple groups of thyroid cancer is improved through multiple iterations. Wherein x= (X 1,X2,X3…,Xn)T represents a feature subset matrix of thyroid cancer in multiple sets, z represents a sample label, normal samples are set to z=0, cancer samples are set to z= 1;W to represent feature weights in a neural network, σ (·) represents an activation function of the neural network, used in a hidden layer as an activation function, g (·) represents a classification function, and an output value is converted into a probability prediction.
In this embodiment, the details of the four-layer neural network used for DNN are shown in table 1.
Table 1: neural network parameter information table
For DNN, four layers of neural networks are used, and the number of hidden layer neurons of the four layers of neural networks is changed according to different feature numbers. A large number of experiments prove that the invention has good effect when trained for 60 times. Finally, each batch-size (batch size) is set to 15 features.
Preferably, in step 3, adam is used as an optimizer, and cross entropy loss is used to calculate training errors of each layer:
Wherein n represents the number of features, Representing the fit of p i, y i represents the true sample label for feature i,Representing a predicted probability value representing a difference between the true sample label and the predicted probability; finally, using a Sigmoid function as a classifier at an output layer, and finally outputting the classification prediction precision of thyroid cancer.
FIG. 2 is a comparison of data from a single set of study and multiple sets of study according to the present invention, wherein the abscissa represents the number of features and the ordinate represents the corresponding accuracy rate when different feature numbers are retained. Exp, cnv, DNA methylation represent gene expression data, copy number variation data, DNA methylation data, respectively. RWDFS represents the multiple sets of histology data integrated with the three sets of histology data, which sets of histology have corresponding accuracy in retaining different feature numbers.
FIG. 3 is a comparison of the present invention with existing algorithms; wherein CWJR represents a conditional weight joint correlation algorithm, DCSF represents a dynamic variation algorithm of selected features along with categories, MRI represents a feature selection algorithm that maximizes independent classification information, mRMR represents a minimum redundancy maximum correlation criterion algorithm, and RWDFS represents an algorithm of the present embodiment.
Fig. 4 shows the comparison result of the present invention with other depth feature selection algorithms, wherein forgeNet represents the graph depth neural network algorithm, RDFS represents the gastric cancer classification algorithm, fDNN represents the feature extraction algorithm, and RWDFS represents the algorithm of the present embodiment. As can be seen from fig. 3 and 4, the algorithm Accuracy of the present embodiment is the highest.
While the foregoing is directed to embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and changes can be made without departing from the principles of the present invention, and it is intended to cover the modifications and changes as defined in the appended claims.

Claims (3)

1. A method for improving the classification precision of thyroid cancer with multiple groups of chemical data by using a depth feature selection algorithm is characterized by comprising the following steps:
step 1: preprocessing data;
Step 2: screening the feature subset based on the feature correlation defined by the weights;
Step 3: inputting the screened feature subsets into a neural network for learning and classifying to obtain final classification results of multiple groups of cancers;
In the step 1, important genes are obtained from expression data through adjPvaule <0.5 adjusted in R language by utilizing difference analysis, metadata files are matched with samples through R language by copy number variation data, tumor samples and normal samples are selected, then data analysis is carried out through a GISTIC 2.0.0 platform to obtain samples and gene data, different expression genes and different methylation CpG sites are analyzed through limma in R package by methylation data, and different methylation genes are screened through FDRFILTER and logFCfiiler, so that pretreatment of the data is completed;
in step 2, feature-dependent redundancy weights FRRW are defined, and feature subsets with similar features are distinguished using the feature-dependent redundancy weights, as shown in equation (1):
Where I (f k,fi; C) represents the joint mutual information of candidate feature subsets, best feature subsets and classes, and also represents the correlation and interaction when dynamically changing the selected subset is taken into consideration, P (f i |C) represents the probability that the ith best feature occurs in category C, p (f k |C) represents the probability that the kth candidate feature occurs in category C, and p (f i,fk, C) represents the probability that the kth candidate feature occurs with the ith best feature and category C;
H (f k) represents the entropy of information of the candidate subset, and the data is obtained as follows: Wherein p (f k) represents the probability that the candidate feature subset occurred in the kth candidate feature in the current subset;
H (f i) represents the information entropy of the best feature subset, and the data is obtained as follows: wherein p (f i) represents the probability that the ith best feature occurs in the current subset;
H (f k,fi) represents the joint entropy of the candidate feature subset and the best feature subset, and the data is obtained by the formula Wherein p (f k,,fi) represents the probability that the kth candidate feature and the ith best feature occur in the current feature subset;
I (f i;fk |c) represents that when the optimal feature subset is determined, candidate feature subset information obtained from the category is conditional mutual information, and the data is obtained by the formula:
feature correlation FR is defined, which measures the correlation between two evaluated features, as shown in equation (2):
FR=FRRW(fk,fi)*I(fk;C|fi) (2)
Wherein I (f k;C|fi) represents that the class information obtained from the best feature subset is conditional mutual information, or represents redundancy of the features, when determining the candidate feature subset, Wherein p (f k,C,fi) represents the probability that the kth candidate feature occurs with class C and the ith best feature; p (f k|fi) represents the probability that the kth best feature occurs in the ith feature; p (c|f i) represents the probability of occurrence in class C in the ith feature;
defining characteristic evaluation criteria:
I (f k;fi) represents mutual information consisting of the best feature subset and the candidate feature subset, the data being obtained by the formula: Where S represents the best feature subset selected, f= { F 1,f2,f3……fn } represents the candidate feature subset, and C represents the class;
Firstly, calculating mutual information of all candidate feature subsets and categories, screening out the feature with the largest median of the features F i, and merging the feature into S, wherein F represents the candidate feature subset with the feature removed, the number K of the selected features is set according to the requirement, the feature with the largest J (F k) value in the formula (3) in each cycle in the remaining candidate feature subset F is calculated by using the cycle, and the feature is merged into S until the cycle is ended.
2. The method for improving the classification accuracy of thyroid cancer using depth feature selection algorithm according to claim 1, wherein in step 3, the neural network uses DNN, the DNN includes an input layer, four hidden layers and an output layer, x= (X 1,X2,X3…,Xn)T represents a feature subset matrix of thyroid cancer in multiple groups, z represents a sample label, normal sample is set to z=0, cancer sample is set to z= 1;W represents feature weight in the neural network, σ (·) represents an activation function of the neural network, and g (·) represents a classification function as the activation function used in the hidden layers, and the output value is converted into a probability prediction.
3. The method for improving the classification accuracy of thyroid cancer using depth feature selection algorithm as claimed in claim 2, wherein in step 3, adam is used as an optimizer, and cross entropy loss is used to calculate training errors of each layer:
Wherein n represents the number of features, Representing the fit of p i, y i represents the true sample label for feature i,Representing the predicted probability value; finally, using a Sigmoid function as a classifier at an output layer, and finally outputting the classification prediction precision of thyroid cancer.
CN202310496632.0A 2023-05-05 2023-05-05 Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm Active CN116680594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310496632.0A CN116680594B (en) 2023-05-05 2023-05-05 Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310496632.0A CN116680594B (en) 2023-05-05 2023-05-05 Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm

Publications (2)

Publication Number Publication Date
CN116680594A CN116680594A (en) 2023-09-01
CN116680594B true CN116680594B (en) 2024-07-05

Family

ID=87779910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310496632.0A Active CN116680594B (en) 2023-05-05 2023-05-05 Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm

Country Status (1)

Country Link
CN (1) CN116680594B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198406B (en) * 2023-09-21 2024-06-11 亦康(北京)医药科技有限公司 Feature screening method, system, electronic equipment and medium
CN117133466B (en) * 2023-10-26 2024-05-24 中日友好医院(中日友好临床医学研究所) Survival prediction method and device based on transcriptomics and image histology

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052885A (en) * 2023-02-07 2023-05-02 齐鲁工业大学(山东省科学院) System, method, equipment and medium for improving prognosis prediction precision based on improved Relieff cancer histology feature selection algorithm

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7233931B2 (en) * 2003-12-26 2007-06-19 Lee Shih-Jong J Feature regulation for hierarchical decision learning
KR101752255B1 (en) * 2016-01-14 2017-06-30 중앙대학교 산학협력단 Method and Apparatus for selecting an optimal feature in classifying multi-label pattern, Apparatus for classifying multi-category document
CN107255785A (en) * 2017-04-28 2017-10-17 南京邮电大学 Based on the analog-circuit fault diagnosis method for improving mRMR
US11494415B2 (en) * 2018-05-23 2022-11-08 Tata Consultancy Services Limited Method and system for joint selection of a feature subset-classifier pair for a classification task
CN110135057B (en) * 2019-05-14 2021-03-02 北京工业大学 Soft measurement method for dioxin emission concentration in solid waste incineration process based on multilayer characteristic selection
CN111161882A (en) * 2019-12-04 2020-05-15 深圳先进技术研究院 Breast cancer life prediction method based on deep neural network
CN111709460A (en) * 2020-05-27 2020-09-25 西安理工大学 Mutual information characteristic selection method based on correlation coefficient
CN114091558A (en) * 2020-07-31 2022-02-25 中兴通讯股份有限公司 Feature selection method, feature selection device, network equipment and computer-readable storage medium
CN112966703A (en) * 2020-10-10 2021-06-15 天津大学 Feature selection method using redundant dynamic weights
CN112926640B (en) * 2021-02-22 2023-02-28 齐鲁工业大学 Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
CN113362888A (en) * 2021-06-02 2021-09-07 齐鲁工业大学 System, method, equipment and medium for improving gastric cancer prognosis prediction precision based on depth feature selection algorithm of random forest
CN113241122A (en) * 2021-06-11 2021-08-10 长春工业大学 Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network
CN113707293B (en) * 2021-07-30 2023-03-14 电子科技大学 Feature selection-based Chinese medicine main symptom selection method
CN114139634A (en) * 2021-12-03 2022-03-04 吉林大学 Multi-label feature selection method based on paired label weights
CN114566223A (en) * 2022-03-01 2022-05-31 青岛农业大学 Gene expression quantity characteristic selection method
CN115587301A (en) * 2022-10-10 2023-01-10 浙江工商大学 Fuzzy mutual information characteristic selection method based on dynamic interaction

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052885A (en) * 2023-02-07 2023-05-02 齐鲁工业大学(山东省科学院) System, method, equipment and medium for improving prognosis prediction precision based on improved Relieff cancer histology feature selection algorithm

Also Published As

Publication number Publication date
CN116680594A (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN116680594B (en) Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm
Ruiz et al. Incremental wrapper-based gene selection from microarray data for cancer classification
Piatetsky-Shapiro et al. Microarray data mining: facing the challenges
US9082083B2 (en) Machine learning method that modifies a core of a machine to adjust for a weight and selects a trained machine comprising a sequential minimal optimization (SMO) algorithm
Bonilla-Huerta et al. Hybrid framework using multiple-filters and an embedded approach for an efficient selection and classification of microarray data
Sathya et al. [Retracted] Cancer Categorization Using Genetic Algorithm to Identify Biomarker Genes
CN108335756B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
CN108206056B (en) Nasopharyngeal darcinoma artificial intelligence assists diagnosis and treatment decision-making terminal
CN112926640B (en) Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
Zolfaghari et al. Cancer prognosis and diagnosis methods based on ensemble learning
CN115274136A (en) Tumor cell line drug response prediction method integrating multiomic and essential genes
Bellazzi et al. The Gene Mover's Distance: Single-cell similarity via Optimal Transport
CN115881232A (en) ScRNA-seq cell type annotation method based on graph neural network and feature fusion
CN109801681B (en) SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm
Huerta et al. Fuzzy logic for elimination of redundant information of microarray data
CN108320797B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
Lin et al. Cluster analysis of genome-wide expression data for feature extraction
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
Babichev et al. Applying the deep learning techniques to solve classification tasks using gene expression data
JP2004535612A (en) Gene expression data management system and method
Chandrakar et al. Design of a novel ensemble model of classification technique for gene-expression data of lung cancer with modified genetic algorithm
CN117616505A (en) Systems and methods for correlating compounds with physiological conditions using fingerprinting
Bolón-Canedo et al. Feature selection in DNA microarray classification
Khabzaoui et al. A multicriteria genetic algorithm to analyze microarray data
Muhammad et al. Gvdeepnet: Unsupervised deep learning techniques for effective genetic variant classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant