CN117497198A - High-dimensional medical data feature subset screening method - Google Patents

High-dimensional medical data feature subset screening method Download PDF

Info

Publication number
CN117497198A
CN117497198A CN202311824917.9A CN202311824917A CN117497198A CN 117497198 A CN117497198 A CN 117497198A CN 202311824917 A CN202311824917 A CN 202311824917A CN 117497198 A CN117497198 A CN 117497198A
Authority
CN
China
Prior art keywords
feature
subset
feature subset
candidate feature
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311824917.9A
Other languages
Chinese (zh)
Other versions
CN117497198B (en
Inventor
柯朝甫
吴陆颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202311824917.9A priority Critical patent/CN117497198B/en
Publication of CN117497198A publication Critical patent/CN117497198A/en
Application granted granted Critical
Publication of CN117497198B publication Critical patent/CN117497198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of high-dimensional medical data processing, and discloses a high-dimensional medical data feature subset screening method. The invention is based on a BeSS algorithm, integrates sampling strategies, consistency scores and predicted performances, and determines the number and the composition of feature variables of an optimal feature subset; initializing the number of feature variables of the optimal feature subset to be 1 when K is unknown, enabling K=K+1, and iteratively obtaining the prediction performance of the feature subset under different feature variable numbers until the prediction performance converges, and obtaining the current feature variable number K and the optimal feature subset corresponding to the current feature variable number K; when K is fixed and known, acquiring a candidate feature subset with the highest consistency score as an optimal feature subset; the high-dimensional medical data feature subset screening method can automatically identify the variable feature number of the optimal feature subset, and can acquire the optimal feature subset under the condition of fixed number.

Description

High-dimensional medical data feature subset screening method
Technical Field
The invention relates to the technical field of high-dimensional medical data processing, in particular to a method for screening a feature subset of high-dimensional medical data.
Background
With the rapid development of high-throughput detection technology and information technology, the medical field has emerged a large amount of high-dimensional data such as various kinds of histologic data (genomics, metabolomics, transcriptomics data, etc.). The high-dimensional data contains rich information, so that a great opportunity is provided for accurate prediction of diseases, and meanwhile, a great challenge is brought to data analysis. The high-dimensional data generally has complex characteristics of high dimension, small sample, high noise and the like, and how to screen the optimal feature subset from the complex characteristics, so that the constructed prediction model has strong interpretation and prediction accuracy, and is a great difficulty in statistical analysis.
Currently, the methods for feature subset screening can be divided into three major categories: filtration, packaging, embedding. The filtering method screens the feature subset through feature evaluation, feature ordering, feature selection and filtering; although the algorithm has the advantages of strong universality, high running speed, low calculation cost and the like, the algorithm cannot effectively identify complex interaction relations among the predicted variables, and cannot always effectively screen out the optimal feature subsets. The packaging method mainly comprises the following two steps: (1) Constructing a prediction model, (2) obtaining an optimal feature subset based on some algorithm in the construction process, but the calculation time is long, and in high-dimensional data, particularly a relatively large sample, the time cost is difficult to measure. The embedding method combines the advantages of the filtering method and the packaging method, is excellent in processing the feature subset screening problem, and simultaneously has lower calculation cost.
In low-dimensional data, the number of features is small, and the calculation cost required by feature subset screening is far lower than that of high-dimensional data. Many scholars have proposed a nearly exhaustive approach, such as genetic algorithms, to better address the problem of optimal feature subset screening. As dimensions increase rapidly, the computational cost of an approximately exhaustive algorithm is difficult to measure, and many scholars have proposed different solutions. The most classical is LASSO algorithm, which constructs a first-order penalty function through regularization strategy, and forcedly assigns a variable coefficient with smaller effect to 0, so that the estimated parameter of regression is easier to be 0, and the effect of variable screening is achieved. The lower computational cost of LASSO makes it a popular method for high-dimensional feature subset screening. Since LASSO focuses on searching for the best combination, its model individual variable coefficients are less interpretable. Meanwhile, the LASSO algorithm requires adjustment of the value of the regularization parameter lambda. How to select the appropriate lambda value is a challenge and requires tuning in combination with cross-validation or cyclic coordinate descent methods. If the parameters are not properly selected, the selected feature subset may be inaccurate and even a valid feature subset may not be screened. LASSO algorithms may also be subject to instability during feature screening, subject to small changes in the data. In addition, some non-parametric algorithms are also widely used in high-dimensional data, such as random forest backward-culling. Compared with the traditional linear model, the interpretation is relatively poor, and as the sample size and the variable dimension increase, the calculation cost is rapidly increased, so that the method has larger limitation in many application scenes.
Recently, scholars have proposed a variable screening problem, namely a BeSS algorithm, based on a raw dual active set (PADS) to solve the logistic model and the Cox proportional hazards model. The method is based on a loss function of a model, performs quantitative comparison and scoring on the importance of each variable to the model, and performs variable screening and model fitting through information complementation between two original variables and dual variables. In high-dimensional space, the BeSS algorithm can quickly search for the best variable combinations, and its proposed sequential and golden section search strategies can quickly determine model sizes. The BeSS algorithm is low in calculation cost, and the strategy for determining the optimal subset based on the feature contribution size ordering in the model has a certain advantage in interpretation compared with the regularization strategy of LASSO.
However, beSS is a new proposed embedded algorithm, which has a fast running speed and good interpretation, but when applied to high-dimensional medical (especially small sample) data, the stability of the variable screening result under the fixed model size is poor, and the overfitting phenomenon easily occurs, so that the prediction effect is not ideal.
Meanwhile, when the best size of the model is determined by the BeSS algorithm, the judgment criterion is not sensitive enough, and the best size cannot be found. There are two main ways of searching for the best model size: (1) determining a model size (BeSS. Gs) based on the loss function. The best feature subset identified in this manner tends to be large in size, which results in the best feature subset containing too many false positive variables, resulting in a reduced predictive power. (2) A metric based on the goodness of fit of the statistical model (ss.seq). The strategy is more stringent and tends to give the feature subset loss part a variable with some predictive power.
Disclosure of Invention
Therefore, the invention aims to solve the technical problems of the prior art that the subset prediction capability is reduced and the prediction effect is not ideal due to the fact that the stability of the searched feature subset is poor, the fitting phenomenon is easy to occur and the number of the optimal variables is inaccurate.
In order to solve the technical problem, the invention provides a high-dimensional medical data feature subset screening method, when the number K of feature variables in an optimal feature subset is unknown, the method comprises the following steps:
s1: acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;
s2: selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set; wherein the initial value of K is 1;
s3: calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;
s4: based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, constructing a consistency scoring model, and acquiring the candidate feature subset with the highest consistency score as a preferable K feature subset when the number of the feature variables is K;
s5: calculating the truncated average of the predicted performances of all candidate feature subsets in the candidate feature set as the predicted performances of the preferred K feature subset;
s6: judging whether the predicted performance of the preferred K feature subset converges or not:
if the characteristics are converged, outputting a current optimal K characteristic subset which is an optimal characteristic subset, and determining the number of characteristic variables of the optimal characteristic subset as K;
if not, updating K=K+1, returning to the step S1, obtaining a plurality of pairs of training sets and corresponding test sets, selecting a new preferable K feature subset until the predicted performance of the preferable K feature subset converges, obtaining a current preferable K feature subset as an optimal feature subset, and determining the number of feature variables of the optimal feature subset as K.
In one embodiment of the present invention, the consistency score model is constructed based on the predicted performance of the candidate feature subset and the occurrence frequency of each feature variable in the candidate feature subset, and is expressed as:
indicate->Consistency score for group candidate feature subset, th ∈>The group candidate feature subset is from +.>Acquiring in a group training set; />Indicate->The ith feature variable in the subset of group candidate features, a>Representing the total number of feature variables in the candidate feature subset; />Indicate->Group candidate feature subset>Representing characteristic variable +.>Frequency of occurrence in candidate feature set, if->Comprises->Then->1, otherwise, 0; />Representing the number of candidate feature subsets in the candidate feature set, +.>,/>;/>Representing the weights.
In one embodiment of the present invention, the acquiring a high-dimensional medical dataset divides the high-dimensional medical dataset into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets thereof, wherein the method comprises the following steps:
randomly extracting a preset number of characteristic variables in the high-dimensional medical data set to form a training set, wherein the rest characteristic variables are corresponding test sets;
and repeating the random extraction for a plurality of times to obtain a plurality of new training sets and a plurality of corresponding test sets.
In one embodiment of the present invention, the selecting K feature variables in each training set to form a candidate feature subset by using a BeSS algorithm includes:
based on a BeSS algorithm, utilizing Taylor expansion to endow contribution values for each characteristic variable in a training set; and according to the sequence of the contribution values from big to small, acquiring feature variables corresponding to the first K contribution values to form a candidate feature subset.
In an embodiment of the present invention, if the high-dimensional medical data is a classification result, the preset regression prediction model is a logistic model.
In one embodiment of the invention, the computing the predicted performance of each candidate feature subset in its corresponding test set includes:
training a logistic model by using a training set corresponding to the candidate feature subset;
predicting the trained logistic model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes accuracy Acc of the logistic model and area under the subject's working characteristic curve AUC.
In an embodiment of the present invention, if the high-dimensional medical data is a survival outcome, the preset regression prediction model is a Cox model.
In one embodiment of the invention, the computing the predicted performance of each candidate feature subset in its corresponding test set includes:
training a Cox model by utilizing a training set corresponding to the candidate feature subset;
predicting the trained Cox model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes the area under the subject work characteristic curve AUC of the consistency index C-index of the Cox model versus median survival.
The embodiment of the invention also provides a high-dimensional medical data feature subset screening method, when the number K of feature variables in the optimal feature subset is a fixed value, the method comprises the following steps:
acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;
selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set;
calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;
and constructing a consistency scoring model based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, and acquiring the candidate feature subset with the highest consistency score as an optimal feature subset.
Compared with the prior art, the technical scheme of the invention has the following advantages:
according to the high-dimensional medical data feature subset screening method, when the number of feature variables in the optimal feature subset is unknown, a sampling strategy, consistency scores and prediction performances are fused on the basis of a BeSS algorithm to determine the number and the composition of the feature variables of the optimal feature subset; initializing the number K=1 of the feature variables of the optimal feature subset, enabling the number K=K+1 to be iterative, obtaining the prediction performance of the feature subset under different feature variable numbers until the prediction performance converges, and obtaining the current feature variable number K and the optimal feature subset corresponding to the current feature variable number K; the number of the feature variables in the optimal feature subset is iteratively identified, so that the identification result is not easily influenced by overfitting, and the method has good extrapolation capability, high calculation speed and high interpretation. The invention builds a consistency scoring model based on the occurrence frequency of the characteristic variable and the prediction performance, wherein the occurrence frequency of the characteristic variable refers to the occurrence frequency of each variable in all variable combinations by statistics, the variable is assigned, and the sum of scores of different variables in each combination is calculated, so that the group of prediction stability is ensured to be strong, and meanwhile, the group of prediction capability is ensured to be good when the prediction performance is integrated. The invention can also fix the number of the feature variables in the optimal feature subset by presetting the K value, thereby realizing the acquisition of the optimal feature subset under the condition of fixing the feature variables.
The high-dimensional medical data feature subset screening method can automatically identify the variable feature number of the optimal feature subset, can acquire the optimal feature subset under the condition of fixed number, improves the use scene of the method, and has good application prospect.
Drawings
In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings, in which
FIG. 1 is a flow chart of the steps of a method for screening feature subsets of high-dimensional medical data provided by the invention when the number of feature variables in an optimal feature subset is unknown;
FIG. 2 is a flow chart of the steps of the method for screening feature subsets of high-dimensional medical data provided by the invention when the number of feature variables in the optimal feature subset is fixed;
fig. 3 is a flowchart of an implementation of the mhess method provided by the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.
Referring to fig. 1, a flowchart of steps of a high-dimensional medical data feature subset screening method of the present invention, when the number K of feature variables in an optimal feature subset is unknown, specific steps include:
s1: acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;
the random sampling strategy is to randomly extract a preset number of characteristic variables in the high-dimensional medical data set to form a training set, and the rest characteristic variables are corresponding test sets; repeating the random extraction for a plurality of times to obtain a plurality of new training sets and a plurality of corresponding test sets;
s2: selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set; wherein the initial value of K is 1;
s3: calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;
s4: based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, constructing a consistency scoring model, and acquiring the candidate feature subset with the highest consistency score as a preferable K feature subset when the number of the feature variables is K;
s5: calculating the truncated average of the predicted performances of all candidate feature subsets in the candidate feature set as the predicted performances of the preferred K feature subset;
s6: judging whether the predicted performance of the preferred K feature subset converges or not:
if the characteristics are converged, outputting a current optimal K characteristic subset which is an optimal characteristic subset, and determining the number of characteristic variables of the optimal characteristic subset as K;
if not, updating K=K+1, returning to the step S1, obtaining a plurality of pairs of training sets and corresponding test sets, selecting a new preferable K feature subset until the predicted performance of the preferable K feature subset converges, obtaining a current preferable K feature subset as an optimal feature subset, and determining the number of feature variables of the optimal feature subset as K.
Specifically, in the present embodiment, a consistency score model is constructed based on the score performance of the candidate feature subset and the occurrence frequency of each candidate variable in the candidate feature subset in the candidate feature set, expressed as:
indicate->Consistency score for group candidate feature subset, th ∈>The group candidate feature subset is from +.>Acquiring in a group training set; />Indicate->The ith feature variable in the subset of group candidate features, a>Representing the total number of feature variables in the candidate feature subset; />Indicate->A subset of the set of candidate features,representing characteristic variable +.>Frequency of occurrence in candidate feature set, if->Comprises->Then1, otherwise, 0; />Representing the number of candidate feature subsets in the candidate feature set, +.>;/>Representing the weights.
The consistency scoring model of this embodiment counts the frequency of occurrence of each feature variable in all candidate feature subsets, i.e., variable combinations, assigns a score to the feature variable, and calculates the sum of the scores of the different variables in each candidate feature subset. To ensure that the candidate feature subset does not suffer from aberrations, the outcome is fine-tuned and controlled by increasing the predictive expressivity score.
Specifically, in this embodiment, in step S2, K feature variables are selected in each training set by using a ss algorithm to construct a candidate feature subset, and based on a plurality of training sets, a plurality of candidate feature subsets are obtained to construct a candidate feature set, including:
s2-1: based on a BeSS algorithm, utilizing Taylor expansion to endow contribution values for each characteristic variable in a training set;
s2-2: according to the order of the contribution values from big to small, obtaining feature variables corresponding to the first K contribution values to form a candidate feature subset
S2-3: a candidate feature subset is obtained based on each training set, and all the candidate feature subsets form a candidate feature set.
Specifically, in step S3, according to the type of the high-dimensional medical data, an appropriate preset regression prediction model is selected; when the high-dimensional medical data is a classification ending, the preset regression prediction model is a logistic model; when the high-dimensional medical data is a survival ending, the preset regression prediction model is a Cox model.
Based on the logistic model, calculating the predicted performance of each candidate feature subset in the corresponding test set, wherein the method specifically comprises the following steps:
training a logistic model by using a training set corresponding to the candidate feature subset;
predicting the trained logistic model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes accuracy Acc of the logistic model and area under the subject's working characteristic curve AUC.
Based on the Cox model, calculating the predicted performance of each candidate feature subset in the corresponding test set, wherein the method specifically comprises the following steps:
training a Cox model by utilizing a training set corresponding to the candidate feature subset;
predicting the trained Cox model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes the area under the subject work characteristic curve AUC of the consistency index C-index of the Cox model versus median survival.
In this embodiment, based on the predictive accuracy measure commonly used in medical data, the area under the working characteristic curve (AUC) of the subject is used in the logistic model, and the consistency index (C-index) is used as a predictive ability measurement standard in the Cox proportional hazards model. In addition, predictive performance is not a stable indicator. In order to accurately search for the optimal feature subset, the first highest peak searched is employed and the subsequent peaks are defined as alternatives. If the alternative peak is lower than the previous one, the search is stopped and the model size of the highest peak and its corresponding subset are taken as output results. Meanwhile, a parameter is set, and a larger feature subset is selected after a certain lifting is met, so that the model is practically significant in increasing.
Specifically, in an embodiment of the present invention, the method for calculating the predicted performance of the preferred K feature subset includes obtaining a truncated average of the predicted performance of all candidate feature subsets in the candidate feature set.
The computing a truncated average of the predicted performance of all candidate feature subsets in the candidate feature set, comprising: obtaining the predicted performance of all candidate feature subsets in the candidate feature set; and deleting the maximum value and the minimum value of the predicted performance, and calculating the average value of the residual predicted performance as the truncated average value of the predicted performance of all candidate feature subsets in the candidate feature set.
Based on the above embodiment, the present embodiment provides a method for screening a feature subset of high-dimensional medical data under a fixed number of feature variables, with reference to fig. 2, including the specific steps of:
s201: acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;
the random sampling strategy is to randomly extract a preset number of characteristic variables in the high-dimensional medical data set to form a training set, and the rest characteristic variables are corresponding test sets; repeating the random extraction for a plurality of times to obtain a plurality of new training sets and a plurality of corresponding test sets;
s202: selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set;
s203: calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;
s204: and constructing a consistency scoring model based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, and acquiring the candidate feature subset with the highest consistency score as an optimal feature subset.
Specifically, in the present embodiment, a consistency score model is constructed based on the score performance of the candidate feature subset and the occurrence frequency of each candidate variable in the candidate feature subset in the candidate feature set, expressed as:
indicate->Consistency score for group candidate feature subset, th ∈>The group candidate feature subset is from +.>Acquiring in a group training set; />Indicate->The ith feature variable in the subset of group candidate features, a>Representing the total number of feature variables in the candidate feature subset; />Indicate->Group candidate feature subset>Representing characteristic variable +.>Frequency of occurrence in candidate feature set, if->Comprises->Then->1, otherwise, 0; />Representing the number of candidate feature subsets in the candidate feature set, +.>,/>;/>Representing the weights.
The consistency scoring model of this embodiment counts the frequency of occurrence of each feature variable in all candidate feature subsets, i.e., variable combinations, assigns a score to the feature variable, and calculates the sum of the scores of the different variables in each candidate feature subset. To ensure that the candidate feature subset does not suffer from aberrations, the outcome is fine-tuned and controlled by increasing the predictive expressivity score.
Specifically, in this embodiment, in step S202, a preset number of feature variables are selected from each training set by using a ss algorithm to form a candidate feature subset, and based on a plurality of training sets, a plurality of candidate feature subsets are obtained to form a candidate feature set, which specifically includes:
s202-1: based on a BeSS algorithm, utilizing a Taylor expansion of a primary pair set to endow each characteristic variable in a training set with a contribution value;
s202-2: sorting from big to small according to the contribution values, and acquiring a preset number of feature variables from big to small to form a candidate feature subset;
s202-3: a candidate feature subset is obtained based on each training set, and all the candidate feature subsets form a candidate feature set.
Specifically, in step S203, according to the embodiment of the present invention, a suitable preset regression prediction model is selected according to the type of the high-dimensional medical data; when the high-dimensional medical data is a classification ending, the preset regression prediction model is a logistic model; when the high-dimensional medical data is a survival ending, the preset regression prediction model is a Cox model.
Based on the logistic model, calculating the predicted performance of each candidate feature subset in the corresponding test set, wherein the method specifically comprises the following steps:
training a logistic model by using a training set corresponding to the candidate feature subset;
predicting the trained logistic model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes accuracy Acc of the logistic model and area under the subject's working characteristic curve AUC.
Based on the Cox model, calculating the predicted performance of each candidate feature subset in the corresponding test set, wherein the method specifically comprises the following steps:
training a Cox model by utilizing a training set corresponding to the candidate feature subset;
predicting the trained Cox model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes the area under the subject work characteristic curve AUC of the consistency index C-index of the Cox model versus median survival.
Specifically, the method for calculating the predicted performance of the candidate feature subset, for obtaining the truncated average of the predicted performance of all candidate feature subsets in the candidate feature set, includes: obtaining the predicted performance of all candidate feature subsets in the candidate feature set; and deleting the maximum value and the minimum value of the predicted performance, and calculating the average value of the residual predicted performance as the truncated average value of the predicted performance of all candidate feature subsets in the candidate feature set.
In summary, considering the problem that the BeSS method is easy to generate over-fitting and poor in stability of the screened variable combination in high-dimensional data (especially small samples), the invention creatively fuses a sampling strategy, a consistency score and a predictive evaluation thought based on the BeSS algorithm, and provides an improved optimal feature subset screening method which is called mBeSS (modified best subset selection). The method not only can automatically identify the optimal number of the feature subsets in the high-dimensional medical data and give out the optimal feature subsets, but also can screen the corresponding optimal feature subsets under the condition of fixed variable number. In addition, the screening strategy has better overfitting resistance, and the screened feature subset has better extrapolation capability and interpretation.
Based on the above embodiment, in this embodiment, a simulation experiment is set to verify the prediction effect of the optimal feature subset screened by the high-dimensional medical data feature subset screening method mhess provided by the invention; a specific implementation step, referring to fig. 3; the sample size of the simulation experiment is set to be 100 or 200; the number of independent variables is set to 3 levels, which are 1000, 5000 and 10000 respectively, and mainly simulate high-dimensional medical and small sample data. Taking the simulation research mode in the BeSS package as a reference, the standard normal distribution with the original independent variable Z being subject to the variance of 1 with the mean value of 0 is drawn up, namely. Then, the independent variable Z is transformed to generate the independent variable X #>Wherein->And->All 0. The coefficients of 10 independent variables are not 0, the corresponding values are subject to even distribution, and a random term is added in the simulation>It is subject to a normal distribution with mean 1 and variance 5, i.e. +.>. Depending on the method of use, the present embodimentThe example simulates the following two cases:
setting the sample size to 200 in the case where the optimal model size is unknown;
in the case of a fixed model size, the sample size is set to 100 and the random error term is adjusted to be in the logistic modelEnsuring a certain predictive power.
In this example, the number of Model Size (MS), true (TP)/False (FP) positives, and predictions were used as evaluation indices. The model size is the number of independent variables used for constructing a prediction model; in the case of determining the best predicted effect, the smaller the model size, the better. The number of true/false positives is the number of variables with a true prediction effect and variables without a true prediction effect contained in independent variables used for constructing the prediction model. The prediction is performed by constructing an evaluation index of the prediction capacity of a prediction model according to the screened variable combination; area under test (AUC) and accuracy (Acc) are used in a logistic model; AUC using a consistency index (C-index) and median time to live in the Cox model; the range of the predictive expression evaluation index is [0,1], and the larger the value is, the better the predictive effect is.
Referring to table 1, the logistic regression simulation result under the model size automatically identified by the high-dimensional medical data feature subset screening method provided by the invention is shown; referring to table 2, the Cox regression simulation results under the model size automatically identified by the high-dimensional medical data feature subset screening method provided by the invention are shown; according to tables 1 and 2, in the feature subset screening method of high-dimensional medical data provided by the invention, whether the feature subset screening method is based on a logistic model or a Cox model, the feature subset screening method of high-dimensional medical data provided by the invention can show better prediction performance than other methods, and has smaller model size.
TABLE 1 logistic regression simulation results at automatically identified model sizes
N=200 Method p=1000 p=5000 p=10000
AUC glmnet 0.797(0.051) 0.769(0.067) 0.762(0.057)
BeSS.gs 0.710(0.051) 0.703(0.072) 0.691(0.064)
BeSS.seq 0.786(0.071) 0.773(0.086) 0.762(0.085)
mBeSS 0.834(0.044) 0.815(0.064) 0.798(0.067)
Acc glmnet 0.724(0.036) 0.702(0.051) 0.699(0.044)
BeSS.gs 0.686(0.035) 0.674(0.058) 0.666(0.052)
BeSS.seq 0.711(0.057) 0.701(0.071) 0.692(0.068)
mBeSS 0.750(0.038) 0.735(0.054) 0.721(0.054)
MS glmnet 18.43(11.889) 18.38(12.405) 21.91(16.358)
BeSS.gs 19.00(2.934) 15.74(2.163) 14.99(1.691)
BeSS.seq 3.26(1.528) 2.79(1.559) 2.50(1.360)
mBeSS 5.79(2.124) 5.75(2.618) 5.12(2.724)
TP glmnet 5.34(0.742) 4.76(1.288) 4.70(1.150)
BeSS.gs 4.66(1.037) 4.05(1.41) 3.70(1.307)
BeSS.seq 3.14(1.518) 2.75(1.579) 2.41(1.371)
mBeSS 4.58(1.156) 4.03(1.337) 3.45(1.282)
FP glmnet 13.09(11.544) 13.62(11.787) 17.21(15.752)
BeSS.gs 14.34(3.232) 11.69(3.228) 11.29(2.571)
BeSS.seq 0.12(0.356) 0.04(0.197) 0.09(0.288)
mBeSS 1.21(1.725) 1.72(2.248) 1.67(2.270)
TABLE 2 Cox regression simulation results with automatically identified model sizes
N=200 Method p=1000 p=5000 p=10000
C-index glmnet 0.746(0.058) 0.743(0.062) 0.748(0.06)
BeSS.gs 0.750(0.045) 0.713(0.047) 0.711(0.040)
BeSS.seq 0.755(0.049) 0.747(0.059) 0.760(0.057)
mBeSS 0.766(0.035) 0.765(0.045) 0.776(0.043)
AUC glmnet 0.776(0.065) 0.773(0.071) 0.779(0.069)
BeSS.gs 0.780(0.051) 0.738(0.055) 0.735(0.050)
BeSS.seq 0.785(0.055) 0.777(0.067) 0.792(0.065)
mBeSS 0.797(0.041) 0.797(0.052) 0.809(0.049)
MS glmnet 9.240(5.725) 8.71(7.016) 10.37(6.934)
BeSS.gs 10.31(8.204) 23.12(9.540) 29.17(6.852)
BeSS.seq 4.20(1.310) 3.80(1.717) 4.03(1.586)
mBeSS 6.31(2.718) 6.00(2.947) 6.34(3.023)
TP glmnet 4.65(1.507) 4.21(1.641) 4.49(1.547)
BeSS.gs 4.95(0.968) 4.98(1.015) 5.04(0.909)
BeSS.seq 4.02(1.333) 3.60(1.563) 3.82(1.513)
mBeSS 4.83(1.006) 4.49(1.299) 4.67(1.155)
FP glmnet 4.59(4.905) 4.50(6.056) 5.88(5.899)
BeSS.gs 5.36(8.021) 18.14(9.666) 24.13(6.922)
BeSS.seq 0.18(0.411) 0.20(0.471) 0.21(0.433)
mBeSS 1.48(2.363) 1.51(2.513) 1.67(2.659)
Referring to table 3, the logistic regression simulation result of the high-dimensional medical data feature subset screening method provided by the invention under the fixed model size is shown; referring to table 4, the Cox regression simulation result of the high-dimensional medical data feature subset screening method provided by the invention under the fixed model size is shown; according to tables 3 and 4, the high-dimensional medical data feature subset screening method provided by the invention has a better prediction effect under the condition of different fixed variable numbers.
TABLE 3 logistic regression simulation results at fixed model size
N=100 Method k=2 k=4 k=6
p=1000 AUC BeSS 0.711(0.073) 0.757(0.084) 0.760(0.084)
mBeSS 0.726(0.064) 0.761(0.077) 0.767(0.076)
Acc BeSS 0.652(0.056) 0.689(0.066) 0.691(0.067)
mBeSS 0.663(0.048) 0.690(0.061) 0.698(0.060)
TP BeSS 1.54(0.610) 2.54(0.989) 2.97(1.150)
mBeSS 1.65(0.539) 2.61(0.909) 3.03(1.105)
FP BeSS 0.46(0.610) 1.46(0.989) 3.03(1.150)
mBeSS 0.35(0.539) 1.39(0.909) 2.97(1.105)
p=5000 AUC BeSS 0.676(0.090) 0.684(0.100) 0.674(0.093)
mBeSS 0.672(0.089) 0.691(0.100) 0.690(0.089)
Acc BeSS 0.625(0.070) 0.632(0.077) 0.626(0.071)
mBeSS 0.623(0.068) 0.638(0.077) 0.637(0.071)
TP BeSS 1.25(0.702) 1.69(1.032) 1.85(1.048)
mBeSS 1.21(0.701) 1.73(1.053) 1.96(0.994)
FP BeSS 0.75(0.702) 2.31(1.032) 4.15(1.048)
mBeSS 0.79(0.701) 2.27(1.053) 4.04(0.994)
p=10000 AUC BeSS 0.644(0.094) 0.636(0.103) 0.631(0.099)
mBeSS 0.644(0.099) 0.647(0.099) 0.663(0.092)
Acc BeSS 0.601(0.073) 0.596(0.078) 0.594(0.074)
mBeSS 0.600(0.078) 0.606(0.073) 0.615(0.071)
TP BeSS 0.97(0.717) 1.17(1.045) 1.34(1.112)
mBeSS 0.97(0.745) 1.25(0.978) 1.63(0.981)
FP BeSS 1.03(0.717) 3.83(1.045) 4.66(1.112)
mBeSS 1.03(0.745) 3.75(0.978) 4.37(0.981)
TABLE 4 Cox regression simulation results at fixed model size
N=100 Method k=2 k=4 k=6
p=1000 C-index BeSS 0.680(0.052) 0.734(0.072) 0.766(0.068)
mBeSS 0.685(0.053) 0.747(0.062) 0.766(0.069)
AUC BeSS 0.701(0.06) 0.761(0.082) 0.795(0.075)
mBeSS 0.705(0.061) 0.775(0.071) 0.795(0.076)
TP BeSS 1.59(0.570) 2.94(0.993) 4.03(1.235)
mBeSS 1.69(0.506) 3.07(0.924) 4.02(1.310)
FP BeSS 0.41(0.570) 1.06(0.993) 1.97(1.235)
mBeSS 0.31(0.506) 0.93(0.924) 1.98(1.310)
p=5000 C-index BeSS 0.669(0.066) 0.711(0.08) 0.714(0.091)
mBeSS 0.686(0.061) 0.717(0.083) 0.739(0.082)
AUC BeSS 0.688(0.076) 0.733(0.088) 0.737(0.100)
mBeSS 0.708(0.069) 0.739(0.094) 0.763(0.094)
TP BeSS 1.44(0.656) 2.47(1.141) 2.92(1.555)
mBeSS 1.61(0.618) 2.59(1.164) 3.31(1.433)
FP BeSS 0.56(0.656) 1.53(1.141) 3.08(1.555)
mBeSS 0.39(0.618) 1.41(1.164) 2.79(1.433)
p=10000 C-index BeSS 0.651(0.083) 0.680(0.088) 0.698(0.096)
mBeSS 0.654(0.086) 0.698(0.092) 0.710(0.100)
AUC BeSS 0.669(0.096) 0.700(0.100) 0.718(0.107)
mBeSS 0.671(0.096) 0.719(0.104) 0.733(0.113)
TP BeSS 1.29(0.701) 2.07(1.103) 2.57(1.486)
mBeSS 1.33(0.711) 2.36(1.159) 2.83(1.551)
FP BeSS 0.71(0.701) 1.93(1.103) 3.43(1.486)
mBeSS 0.67(0.711) 1.64(1.159) 3.17(1.551)
Referring to table 5, there are 4 real data examples utilized in the present embodiment; in this embodiment, the data is randomly divided into a training set and a test set, the training set containing two-thirds of the observations and the test set containing the remaining observations. The same evaluation index as the simulation experiment was used to compare the true data results.
TABLE 5 brief introduction to real data
Data name Number of observations Number of independent variables End type Data come from
gravier 168 2905 Two categories https://github.com/ramhiser/datamicroarray/wiki/Gravier-(2010)
psoriasis 170 18482 Two categories https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30999
comorbid 1467 344 Survival data British biological bank (UKB)
10846 412 54677 Survival data https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10846
Referring to table 6, the model result is automatically identified for the real data; results show that in Gao Weizu data, compared with BeSS and glmcet (R software package for realizing LASSO), the model prediction performance of the high-dimensional medical data feature subset screening method provided by the invention is often better; if the predicted performance is similar, the high-dimensional medical data feature subset screening method provided by the invention has the capability of finding a smaller model.
TABLE 6 automatic identification of model size results for real data
Data Method glmnet BeSS.gs BeSS.seq mBeSS
gravier MS 11.28(7.027) 10.66(2.392) 8.97(1.087) 5.63(3.472)
AUC 0.738(0.09) 0.695(0.069) 0.681(0.07) 0.740(0.074)
Acc 0.715(0.074) 0.711(0.057) 0.71(0.061) 0.722(0.055)
psoriasis MS 8.77(3.33) 3.03(1.93) 4.36(2.439) 5.75(2.928)
AUC 0.968(0.028) 0.882(0.156) 0.966(0.028) 0.973(0.028)
Acc 0.951(0.035) 0.85(0.173) 0.95(0.029) 0.956(0.03)
comorbid MS 9.3(5.668) 14.24(15.992) 2.3(0.81) 7.66(4.207)
C-index 0.641(0.023) 0.578(0.041) 0.62(0.019) 0.636(0.024)
AUC 0.624(0.029) 0.568(0.037) 0.595(0.023) 0.614(0.029)
10846 MS 3.94(5.626) 34.12(10.829) 1.24(0.495) 4.39(4.087)
C-index 0.586(0.073) 0.611(0.041) 0.624(0.039) 0.631(0.039)
AUC 0.589(0.078) 0.618(0.053) 0.626(0.052) 0.63(0.048)
Referring to table 7, the results for the real data at a fixed number are shown; the result shows that under the condition of fixed number, the feature subset identified by the high-dimensional medical data feature subset screening method provided by the invention has better prediction performance.
TABLE 7 results of real data at fixed model size
Data Method k=2 k=4 k=6
gravier AUC BeSS 0.673(0.094) 0.713(0.097) 0.740(0.082)
mBeSS 0.709(0.089) 0.729(0.082) 0.736(0.072)
Acc BeSS 0.692(0.066) 0.707(0.069) 0.720(0.058)
mBeSS 0.709(0.058) 0.721(0.059) 0.721(0.057)
k=2 k=4 k=6
psoriasis AUC BeSS 0.698(0.184) 0.78(0.192) 0.882(0.147)
mBeSS 0.919(0.149) 0.971(0.056) 0.968(0.034)
Acc BeSS 0.673(0.189) 0.753(0.204) 0.862(0.160)
mBeSS 0.901(0.148) 0.96(0.059) 0.955(0.033)
k=3 k=5 k=8
comorbid C-index BeSS 0.625(0.022) 0.634(0.02) 0.644(0.021)
mBeSS 0.627(0.023) 0.637(0.021) 0.644(0.02)
AUC BeSS 0.603(0.028) 0.612(0.026) 0.624(0.026)
mBeSS 0.605(0.029) 0.615(0.028) 0.625(0.026)
k=2 k=4 k=5
10846 C-index BeSS 0.607(0.05) 0.621(0.053) 0.626(0.054)
mBeSS 0.616(0.046) 0.625(0.043) 0.630(0.053)
AUC BeSS 0.608(0.057) 0.627(0.06) 0.628(0.060)
mBeSS 0.619(0.056) 0.628(0.052) 0.633(0.058)
According to the experimental data, the invention provides the high-dimensional medical data feature subset screening method mBeSS, which not only can efficiently search variable combinations under a specific number in high-dimensional medical data (especially small samples), but also can automatically identify the size of the optimal feature subset, and the identified feature subset is not easily influenced by over fitting and has good extrapolation capability. Its predictive power and model size are often better than the common methods BeSS and LASSO. In medical research, high-dimensional data are extremely common, and the method provided by the invention has the advantages of high calculation speed, good prediction effect, high interpretation and wide application prospect.
Based on the embodiment, the high-dimensional medical data feature subset screening method provided by the invention constructs a consistency scoring model based on the predicted performance and the occurrence frequency of feature variables; when the number K of the feature variables in the optimal feature subset is a fixed value, acquiring a candidate feature subset with the highest consistency score as the optimal feature subset; when the number K of the feature variables in the optimal feature subset is unknown, acquiring a candidate feature subset with the highest consistency score as a preferable K feature subset when the number K of the feature variables is the number K; calculating a truncated mean as a predicted representation of the preferred K feature subset; if the predicted performance of the optimal K feature subset converges, outputting the current optimal K feature subset as an optimal feature subset, and determining the number of feature variables of the optimal feature subset as K; if the set is not converged, updating K=K+1, re-acquiring a plurality of pairs of training sets and corresponding test sets, and selecting a new preferred K feature subset until the predicted performance of the preferred K feature subset is converged.
According to the high-dimensional medical data feature subset screening method, when the number of feature variables in the optimal feature subset is unknown, a sampling strategy, consistency scores and prediction performances are fused on the basis of a BeSS algorithm to determine the number and the composition of the feature variables of the optimal feature subset; initializing the number K=1 of the feature variables of the optimal feature subset, enabling the number K=K+1 to be iterative, obtaining the prediction performance of the feature subset under different feature variable numbers until the prediction performance converges, and obtaining the current feature variable number K and the optimal feature subset corresponding to the current feature variable number K; the number of the feature variables in the optimal feature subset is iteratively identified, so that the identification result is not easily influenced by overfitting, and the method has good extrapolation capability, high calculation speed and high interpretation. The invention builds a consistency scoring model based on the occurrence frequency of the characteristic variable and the prediction performance, wherein the occurrence frequency of the characteristic variable refers to the occurrence frequency of each variable in all variable combinations by statistics, the variable is assigned, and the sum of scores of different variables in each combination is calculated, so that the group of prediction stability is ensured to be strong, and meanwhile, the group of prediction capability is ensured to be good when the prediction performance is integrated. The invention can also fix the number of the feature variables in the optimal feature subset by presetting the K value, thereby realizing the acquisition of the optimal feature subset under the condition of fixing the feature variables. The high-dimensional medical data feature subset screening method can automatically identify the variable feature number of the optimal feature subset, can acquire the optimal feature subset under the condition of fixed number, improves the use scene of the method, and has good application prospect.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (9)

1. A method for screening a feature subset of high-dimensional medical data, comprising, when the number K of feature variables in an optimal feature subset is unknown:
s1: acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;
s2: selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set; wherein the initial value of K is 1;
s3: calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;
s4: based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, constructing a consistency scoring model, and acquiring the candidate feature subset with the highest consistency score as a preferable K feature subset when the number of the feature variables is K;
s5: calculating the truncated average of the predicted performances of all candidate feature subsets in the candidate feature set as the predicted performances of the preferred K feature subset;
s6: judging whether the predicted performance of the preferred K feature subset converges or not:
if the characteristics are converged, outputting a current optimal K characteristic subset which is an optimal characteristic subset, and determining the number of characteristic variables of the optimal characteristic subset as K;
if not, updating K=K+1, returning to the step S1, obtaining a plurality of pairs of training sets and corresponding test sets, selecting a new preferable K feature subset until the predicted performance of the preferable K feature subset converges, obtaining a current preferable K feature subset as an optimal feature subset, and determining the number of feature variables of the optimal feature subset as K.
2. The method of claim 1, wherein the constructing a consistency score model based on the predicted performance of the candidate feature subset and the occurrence frequency of each feature variable in the candidate feature subset in the candidate feature set is expressed as:
indicate->Consistency score for group candidate feature subset, th ∈>The group candidate feature subset is from +.>Acquiring in a group training set; />Indicate->The ith feature variable in the subset of group candidate features, a>,/>Representing the total number of feature variables in the candidate feature subset; />Indicate->Group candidate feature subset>Representing characteristic variable +.>Frequency of occurrence in candidate feature set, if->Comprises->Then->1, otherwise, 0; />Representing the number of candidate feature subsets in the candidate feature set, +.>,/>;/>Representing the weights.
3. The method of claim 1, wherein the acquiring a high-dimensional medical dataset divides the high-dimensional medical dataset into a training set and a test set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets thereof, wherein the method comprises the following steps:
randomly extracting a preset number of characteristic variables in the high-dimensional medical data set to form a training set, wherein the rest characteristic variables are corresponding test sets;
and repeating the random extraction for a plurality of times to obtain a plurality of new training sets and a plurality of corresponding test sets.
4. The method for screening feature subsets of high-dimensional medical data according to claim 1, wherein the selecting K feature variables in each training set to form a candidate feature subset by using a BeSS algorithm comprises:
based on a BeSS algorithm, utilizing Taylor expansion to endow contribution values for each characteristic variable in a training set; and according to the sequence of the contribution values from big to small, acquiring feature variables corresponding to the first K contribution values to form a candidate feature subset.
5. The method of claim 1, wherein the predetermined regression prediction model is a logistic model if the high-dimensional medical data is a classification outcome.
6. The method of claim 5, wherein said computing the predicted performance of each candidate feature subset in its corresponding test set comprises:
training a logistic model by using a training set corresponding to the candidate feature subset;
predicting the trained logistic model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes accuracy Acc of the logistic model and area under the subject's working characteristic curve AUC.
7. The method of claim 1, wherein the predetermined regression prediction model is a Cox model if the high-dimensional medical data is a survival outcome.
8. The method of claim 7, wherein the computing the predicted performance of each candidate feature subset in its corresponding test set comprises:
training a Cox model by utilizing a training set corresponding to the candidate feature subset;
predicting the trained Cox model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;
the predicted performance includes the area under the subject work characteristic curve AUC of the consistency index C-index of the Cox model versus median survival.
9. A method for screening feature subsets of high-dimensional medical data, characterized in that when the number K of feature variables in an optimal feature subset is a fixed value, the method comprises:
acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;
selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set;
calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;
and constructing a consistency scoring model based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, and acquiring the candidate feature subset with the highest consistency score as an optimal feature subset.
CN202311824917.9A 2023-12-28 2023-12-28 High-dimensional medical data feature subset screening method Active CN117497198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311824917.9A CN117497198B (en) 2023-12-28 2023-12-28 High-dimensional medical data feature subset screening method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311824917.9A CN117497198B (en) 2023-12-28 2023-12-28 High-dimensional medical data feature subset screening method

Publications (2)

Publication Number Publication Date
CN117497198A true CN117497198A (en) 2024-02-02
CN117497198B CN117497198B (en) 2024-03-01

Family

ID=89680375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311824917.9A Active CN117497198B (en) 2023-12-28 2023-12-28 High-dimensional medical data feature subset screening method

Country Status (1)

Country Link
CN (1) CN117497198B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971240A (en) * 2017-03-16 2017-07-21 河海大学 The short-term load forecasting method that a kind of variables choice is returned with Gaussian process
CN110765418A (en) * 2019-10-09 2020-02-07 清华大学 Intelligent set evaluation method and system for basin water and sand research model
CN114334033A (en) * 2021-12-31 2022-04-12 广东海洋大学 Screening method, system and terminal for molecular descriptors of anti-breast cancer candidate drugs
CN114724715A (en) * 2022-04-12 2022-07-08 南京邮电大学 Multi-machine learning model feature selection method based on optimal AUC

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971240A (en) * 2017-03-16 2017-07-21 河海大学 The short-term load forecasting method that a kind of variables choice is returned with Gaussian process
CN110765418A (en) * 2019-10-09 2020-02-07 清华大学 Intelligent set evaluation method and system for basin water and sand research model
CN114334033A (en) * 2021-12-31 2022-04-12 广东海洋大学 Screening method, system and terminal for molecular descriptors of anti-breast cancer candidate drugs
CN114724715A (en) * 2022-04-12 2022-07-08 南京邮电大学 Multi-machine learning model feature selection method based on optimal AUC

Also Published As

Publication number Publication date
CN117497198B (en) 2024-03-01

Similar Documents

Publication Publication Date Title
WO2019233189A1 (en) Method for detecting sensor network abnormal data
CN113092981B (en) Wafer data detection method and system, storage medium and test parameter adjustment method
CN111428733B (en) Zero sample target detection method and system based on semantic feature space conversion
CN108241901B (en) Transformer early warning evaluation method and device based on prediction data
CN112926635B (en) Target clustering method based on iterative self-adaptive neighbor propagation algorithm
CN110826618A (en) Personal credit risk assessment method based on random forest
KR20210032140A (en) Method and apparatus for performing pruning of neural network
CN112328891A (en) Method for training search model, method for searching target object and device thereof
CN113076734A (en) Similarity detection method and device for project texts
WO2019223104A1 (en) Method and apparatus for determining event influencing factors, terminal device, and readable storage medium
CN112463763B (en) MySQL database parameter screening method based on RF algorithm
CN109390032B (en) Method for exploring disease-related SNP (single nucleotide polymorphism) combination in data of whole genome association analysis based on evolutionary algorithm
CN110852443A (en) Feature stability detection method, device and computer readable medium
CN117497198B (en) High-dimensional medical data feature subset screening method
CN113537693A (en) Personnel risk level obtaining method, terminal and storage device
CN116579842B (en) Credit data analysis method and system based on user behavior data
CN117153297A (en) Cement concrete compressive strength detection method, system and electronic equipment
CN111081321B (en) CNS drug key feature identification method
CN111026661B (en) Comprehensive testing method and system for software usability
Yang et al. Adaptive density peak clustering for determinging cluster center
CN107330105B (en) Robustness evaluation method and device for similar image retrieval algorithm
CN116662859B (en) Non-cultural-heritage data feature selection method
CN108829659A (en) A kind of reference recognition methods, equipment and computer can storage mediums
CN112215290B (en) Fisher score-based Q learning auxiliary data analysis method and Fisher score-based Q learning auxiliary data analysis system
Thapa Using ROC-curve to Illustrate the Use of Binomial Logistic Regression Model in Wine Quality Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant