CN117497198A

CN117497198A - High-dimensional medical data feature subset screening method

Info

Publication number: CN117497198A
Application number: CN202311824917.9A
Authority: CN
Inventors: 柯朝甫; 吴陆颖
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-02-02
Anticipated expiration: 2043-12-28
Also published as: CN117497198B

Abstract

The invention relates to the technical field of high-dimensional medical data processing, and discloses a high-dimensional medical data feature subset screening method. The invention is based on a BeSS algorithm, integrates sampling strategies, consistency scores and predicted performances, and determines the number and the composition of feature variables of an optimal feature subset; initializing the number of feature variables of the optimal feature subset to be 1 when K is unknown, enabling K=K+1, and iteratively obtaining the prediction performance of the feature subset under different feature variable numbers until the prediction performance converges, and obtaining the current feature variable number K and the optimal feature subset corresponding to the current feature variable number K; when K is fixed and known, acquiring a candidate feature subset with the highest consistency score as an optimal feature subset; the high-dimensional medical data feature subset screening method can automatically identify the variable feature number of the optimal feature subset, and can acquire the optimal feature subset under the condition of fixed number.

Description

High-dimensional medical data feature subset screening method

Technical Field

The invention relates to the technical field of high-dimensional medical data processing, in particular to a method for screening a feature subset of high-dimensional medical data.

Background

With the rapid development of high-throughput detection technology and information technology, the medical field has emerged a large amount of high-dimensional data such as various kinds of histologic data (genomics, metabolomics, transcriptomics data, etc.). The high-dimensional data contains rich information, so that a great opportunity is provided for accurate prediction of diseases, and meanwhile, a great challenge is brought to data analysis. The high-dimensional data generally has complex characteristics of high dimension, small sample, high noise and the like, and how to screen the optimal feature subset from the complex characteristics, so that the constructed prediction model has strong interpretation and prediction accuracy, and is a great difficulty in statistical analysis.

Currently, the methods for feature subset screening can be divided into three major categories: filtration, packaging, embedding. The filtering method screens the feature subset through feature evaluation, feature ordering, feature selection and filtering; although the algorithm has the advantages of strong universality, high running speed, low calculation cost and the like, the algorithm cannot effectively identify complex interaction relations among the predicted variables, and cannot always effectively screen out the optimal feature subsets. The packaging method mainly comprises the following two steps: (1) Constructing a prediction model, (2) obtaining an optimal feature subset based on some algorithm in the construction process, but the calculation time is long, and in high-dimensional data, particularly a relatively large sample, the time cost is difficult to measure. The embedding method combines the advantages of the filtering method and the packaging method, is excellent in processing the feature subset screening problem, and simultaneously has lower calculation cost.

In low-dimensional data, the number of features is small, and the calculation cost required by feature subset screening is far lower than that of high-dimensional data. Many scholars have proposed a nearly exhaustive approach, such as genetic algorithms, to better address the problem of optimal feature subset screening. As dimensions increase rapidly, the computational cost of an approximately exhaustive algorithm is difficult to measure, and many scholars have proposed different solutions. The most classical is LASSO algorithm, which constructs a first-order penalty function through regularization strategy, and forcedly assigns a variable coefficient with smaller effect to 0, so that the estimated parameter of regression is easier to be 0, and the effect of variable screening is achieved. The lower computational cost of LASSO makes it a popular method for high-dimensional feature subset screening. Since LASSO focuses on searching for the best combination, its model individual variable coefficients are less interpretable. Meanwhile, the LASSO algorithm requires adjustment of the value of the regularization parameter lambda. How to select the appropriate lambda value is a challenge and requires tuning in combination with cross-validation or cyclic coordinate descent methods. If the parameters are not properly selected, the selected feature subset may be inaccurate and even a valid feature subset may not be screened. LASSO algorithms may also be subject to instability during feature screening, subject to small changes in the data. In addition, some non-parametric algorithms are also widely used in high-dimensional data, such as random forest backward-culling. Compared with the traditional linear model, the interpretation is relatively poor, and as the sample size and the variable dimension increase, the calculation cost is rapidly increased, so that the method has larger limitation in many application scenes.

Recently, scholars have proposed a variable screening problem, namely a BeSS algorithm, based on a raw dual active set (PADS) to solve the logistic model and the Cox proportional hazards model. The method is based on a loss function of a model, performs quantitative comparison and scoring on the importance of each variable to the model, and performs variable screening and model fitting through information complementation between two original variables and dual variables. In high-dimensional space, the BeSS algorithm can quickly search for the best variable combinations, and its proposed sequential and golden section search strategies can quickly determine model sizes. The BeSS algorithm is low in calculation cost, and the strategy for determining the optimal subset based on the feature contribution size ordering in the model has a certain advantage in interpretation compared with the regularization strategy of LASSO.

However, beSS is a new proposed embedded algorithm, which has a fast running speed and good interpretation, but when applied to high-dimensional medical (especially small sample) data, the stability of the variable screening result under the fixed model size is poor, and the overfitting phenomenon easily occurs, so that the prediction effect is not ideal.

Meanwhile, when the best size of the model is determined by the BeSS algorithm, the judgment criterion is not sensitive enough, and the best size cannot be found. There are two main ways of searching for the best model size: (1) determining a model size (BeSS. Gs) based on the loss function. The best feature subset identified in this manner tends to be large in size, which results in the best feature subset containing too many false positive variables, resulting in a reduced predictive power. (2) A metric based on the goodness of fit of the statistical model (ss.seq). The strategy is more stringent and tends to give the feature subset loss part a variable with some predictive power.

Disclosure of Invention

Therefore, the invention aims to solve the technical problems of the prior art that the subset prediction capability is reduced and the prediction effect is not ideal due to the fact that the stability of the searched feature subset is poor, the fitting phenomenon is easy to occur and the number of the optimal variables is inaccurate.

In order to solve the technical problem, the invention provides a high-dimensional medical data feature subset screening method, when the number K of feature variables in an optimal feature subset is unknown, the method comprises the following steps:

s1: acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;

s2: selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set; wherein the initial value of K is 1;

s3: calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;

s4: based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, constructing a consistency scoring model, and acquiring the candidate feature subset with the highest consistency score as a preferable K feature subset when the number of the feature variables is K;

s5: calculating the truncated average of the predicted performances of all candidate feature subsets in the candidate feature set as the predicted performances of the preferred K feature subset;

s6: judging whether the predicted performance of the preferred K feature subset converges or not:

if the characteristics are converged, outputting a current optimal K characteristic subset which is an optimal characteristic subset, and determining the number of characteristic variables of the optimal characteristic subset as K;

if not, updating K=K+1, returning to the step S1, obtaining a plurality of pairs of training sets and corresponding test sets, selecting a new preferable K feature subset until the predicted performance of the preferable K feature subset converges, obtaining a current preferable K feature subset as an optimal feature subset, and determining the number of feature variables of the optimal feature subset as K.

In one embodiment of the present invention, the consistency score model is constructed based on the predicted performance of the candidate feature subset and the occurrence frequency of each feature variable in the candidate feature subset, and is expressed as:

；

indicate->Consistency score for group candidate feature subset, th ∈>The group candidate feature subset is from +.>Acquiring in a group training set; />Indicate->The ith feature variable in the subset of group candidate features, a>，Representing the total number of feature variables in the candidate feature subset; />Indicate->Group candidate feature subset>Representing characteristic variable +.>Frequency of occurrence in candidate feature set, if->Comprises->Then->1, otherwise, 0; />Representing the number of candidate feature subsets in the candidate feature set, +.>，/>；/>Representing the weights.

In one embodiment of the present invention, the acquiring a high-dimensional medical dataset divides the high-dimensional medical dataset into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets thereof, wherein the method comprises the following steps:

randomly extracting a preset number of characteristic variables in the high-dimensional medical data set to form a training set, wherein the rest characteristic variables are corresponding test sets;

and repeating the random extraction for a plurality of times to obtain a plurality of new training sets and a plurality of corresponding test sets.

In one embodiment of the present invention, the selecting K feature variables in each training set to form a candidate feature subset by using a BeSS algorithm includes:

based on a BeSS algorithm, utilizing Taylor expansion to endow contribution values for each characteristic variable in a training set; and according to the sequence of the contribution values from big to small, acquiring feature variables corresponding to the first K contribution values to form a candidate feature subset.

In an embodiment of the present invention, if the high-dimensional medical data is a classification result, the preset regression prediction model is a logistic model.

In one embodiment of the invention, the computing the predicted performance of each candidate feature subset in its corresponding test set includes:

training a logistic model by using a training set corresponding to the candidate feature subset;

predicting the trained logistic model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;

the predicted performance includes accuracy Acc of the logistic model and area under the subject's working characteristic curve AUC.

In an embodiment of the present invention, if the high-dimensional medical data is a survival outcome, the preset regression prediction model is a Cox model.

training a Cox model by utilizing a training set corresponding to the candidate feature subset;

predicting the trained Cox model by using a test set corresponding to the candidate feature subset to obtain a prediction performance;

the predicted performance includes the area under the subject work characteristic curve AUC of the consistency index C-index of the Cox model versus median survival.

The embodiment of the invention also provides a high-dimensional medical data feature subset screening method, when the number K of feature variables in the optimal feature subset is a fixed value, the method comprises the following steps:

acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;

selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set;

calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;

and constructing a consistency scoring model based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, and acquiring the candidate feature subset with the highest consistency score as an optimal feature subset.

Compared with the prior art, the technical scheme of the invention has the following advantages:

according to the high-dimensional medical data feature subset screening method, when the number of feature variables in the optimal feature subset is unknown, a sampling strategy, consistency scores and prediction performances are fused on the basis of a BeSS algorithm to determine the number and the composition of the feature variables of the optimal feature subset; initializing the number K=1 of the feature variables of the optimal feature subset, enabling the number K=K+1 to be iterative, obtaining the prediction performance of the feature subset under different feature variable numbers until the prediction performance converges, and obtaining the current feature variable number K and the optimal feature subset corresponding to the current feature variable number K; the number of the feature variables in the optimal feature subset is iteratively identified, so that the identification result is not easily influenced by overfitting, and the method has good extrapolation capability, high calculation speed and high interpretation. The invention builds a consistency scoring model based on the occurrence frequency of the characteristic variable and the prediction performance, wherein the occurrence frequency of the characteristic variable refers to the occurrence frequency of each variable in all variable combinations by statistics, the variable is assigned, and the sum of scores of different variables in each combination is calculated, so that the group of prediction stability is ensured to be strong, and meanwhile, the group of prediction capability is ensured to be good when the prediction performance is integrated. The invention can also fix the number of the feature variables in the optimal feature subset by presetting the K value, thereby realizing the acquisition of the optimal feature subset under the condition of fixing the feature variables.

The high-dimensional medical data feature subset screening method can automatically identify the variable feature number of the optimal feature subset, can acquire the optimal feature subset under the condition of fixed number, improves the use scene of the method, and has good application prospect.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings, in which

FIG. 1 is a flow chart of the steps of a method for screening feature subsets of high-dimensional medical data provided by the invention when the number of feature variables in an optimal feature subset is unknown;

FIG. 2 is a flow chart of the steps of the method for screening feature subsets of high-dimensional medical data provided by the invention when the number of feature variables in the optimal feature subset is fixed;

fig. 3 is a flowchart of an implementation of the mhess method provided by the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

Referring to fig. 1, a flowchart of steps of a high-dimensional medical data feature subset screening method of the present invention, when the number K of feature variables in an optimal feature subset is unknown, specific steps include:

the random sampling strategy is to randomly extract a preset number of characteristic variables in the high-dimensional medical data set to form a training set, and the rest characteristic variables are corresponding test sets; repeating the random extraction for a plurality of times to obtain a plurality of new training sets and a plurality of corresponding test sets;

Specifically, in the present embodiment, a consistency score model is constructed based on the score performance of the candidate feature subset and the occurrence frequency of each candidate variable in the candidate feature subset in the candidate feature set, expressed as:

；

indicate->Consistency score for group candidate feature subset, th ∈>The group candidate feature subset is from +.>Acquiring in a group training set; />Indicate->The ith feature variable in the subset of group candidate features, a>，Representing the total number of feature variables in the candidate feature subset; />Indicate->A subset of the set of candidate features,representing characteristic variable +.>Frequency of occurrence in candidate feature set, if->Comprises->Then1, otherwise, 0; />Representing the number of candidate feature subsets in the candidate feature set, +.>，；/>Representing the weights.

The consistency scoring model of this embodiment counts the frequency of occurrence of each feature variable in all candidate feature subsets, i.e., variable combinations, assigns a score to the feature variable, and calculates the sum of the scores of the different variables in each candidate feature subset. To ensure that the candidate feature subset does not suffer from aberrations, the outcome is fine-tuned and controlled by increasing the predictive expressivity score.

Specifically, in this embodiment, in step S2, K feature variables are selected in each training set by using a ss algorithm to construct a candidate feature subset, and based on a plurality of training sets, a plurality of candidate feature subsets are obtained to construct a candidate feature set, including:

s2-1: based on a BeSS algorithm, utilizing Taylor expansion to endow contribution values for each characteristic variable in a training set;

s2-2: according to the order of the contribution values from big to small, obtaining feature variables corresponding to the first K contribution values to form a candidate feature subset

S2-3: a candidate feature subset is obtained based on each training set, and all the candidate feature subsets form a candidate feature set.

Specifically, in step S3, according to the type of the high-dimensional medical data, an appropriate preset regression prediction model is selected; when the high-dimensional medical data is a classification ending, the preset regression prediction model is a logistic model; when the high-dimensional medical data is a survival ending, the preset regression prediction model is a Cox model.

Based on the logistic model, calculating the predicted performance of each candidate feature subset in the corresponding test set, wherein the method specifically comprises the following steps:

Based on the Cox model, calculating the predicted performance of each candidate feature subset in the corresponding test set, wherein the method specifically comprises the following steps:

In this embodiment, based on the predictive accuracy measure commonly used in medical data, the area under the working characteristic curve (AUC) of the subject is used in the logistic model, and the consistency index (C-index) is used as a predictive ability measurement standard in the Cox proportional hazards model. In addition, predictive performance is not a stable indicator. In order to accurately search for the optimal feature subset, the first highest peak searched is employed and the subsequent peaks are defined as alternatives. If the alternative peak is lower than the previous one, the search is stopped and the model size of the highest peak and its corresponding subset are taken as output results. Meanwhile, a parameter is set, and a larger feature subset is selected after a certain lifting is met, so that the model is practically significant in increasing.

Specifically, in an embodiment of the present invention, the method for calculating the predicted performance of the preferred K feature subset includes obtaining a truncated average of the predicted performance of all candidate feature subsets in the candidate feature set.

The computing a truncated average of the predicted performance of all candidate feature subsets in the candidate feature set, comprising: obtaining the predicted performance of all candidate feature subsets in the candidate feature set; and deleting the maximum value and the minimum value of the predicted performance, and calculating the average value of the residual predicted performance as the truncated average value of the predicted performance of all candidate feature subsets in the candidate feature set.

Based on the above embodiment, the present embodiment provides a method for screening a feature subset of high-dimensional medical data under a fixed number of feature variables, with reference to fig. 2, including the specific steps of:

s201: acquiring a high-dimensional medical data set, and dividing the high-dimensional medical data set into a training set and a testing set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets;

s202: selecting K feature variables from each training set by using a BeSS algorithm to form a candidate feature subset; based on a plurality of training sets, acquiring a plurality of candidate feature subsets to form a candidate feature set;

s203: calculating the prediction performance of each candidate feature subset in the corresponding test set by using a preset regression prediction model;

s204: and constructing a consistency scoring model based on the predicted performance of the candidate feature subsets and the occurrence frequency of each feature variable in the candidate feature subsets in the candidate feature sets, and acquiring the candidate feature subset with the highest consistency score as an optimal feature subset.

；

Specifically, in this embodiment, in step S202, a preset number of feature variables are selected from each training set by using a ss algorithm to form a candidate feature subset, and based on a plurality of training sets, a plurality of candidate feature subsets are obtained to form a candidate feature set, which specifically includes:

s202-1: based on a BeSS algorithm, utilizing a Taylor expansion of a primary pair set to endow each characteristic variable in a training set with a contribution value;

s202-2: sorting from big to small according to the contribution values, and acquiring a preset number of feature variables from big to small to form a candidate feature subset;

s202-3: a candidate feature subset is obtained based on each training set, and all the candidate feature subsets form a candidate feature set.

Specifically, in step S203, according to the embodiment of the present invention, a suitable preset regression prediction model is selected according to the type of the high-dimensional medical data; when the high-dimensional medical data is a classification ending, the preset regression prediction model is a logistic model; when the high-dimensional medical data is a survival ending, the preset regression prediction model is a Cox model.

Specifically, the method for calculating the predicted performance of the candidate feature subset, for obtaining the truncated average of the predicted performance of all candidate feature subsets in the candidate feature set, includes: obtaining the predicted performance of all candidate feature subsets in the candidate feature set; and deleting the maximum value and the minimum value of the predicted performance, and calculating the average value of the residual predicted performance as the truncated average value of the predicted performance of all candidate feature subsets in the candidate feature set.

In summary, considering the problem that the BeSS method is easy to generate over-fitting and poor in stability of the screened variable combination in high-dimensional data (especially small samples), the invention creatively fuses a sampling strategy, a consistency score and a predictive evaluation thought based on the BeSS algorithm, and provides an improved optimal feature subset screening method which is called mBeSS (modified best subset selection). The method not only can automatically identify the optimal number of the feature subsets in the high-dimensional medical data and give out the optimal feature subsets, but also can screen the corresponding optimal feature subsets under the condition of fixed variable number. In addition, the screening strategy has better overfitting resistance, and the screened feature subset has better extrapolation capability and interpretation.

Based on the above embodiment, in this embodiment, a simulation experiment is set to verify the prediction effect of the optimal feature subset screened by the high-dimensional medical data feature subset screening method mhess provided by the invention; a specific implementation step, referring to fig. 3; the sample size of the simulation experiment is set to be 100 or 200; the number of independent variables is set to 3 levels, which are 1000, 5000 and 10000 respectively, and mainly simulate high-dimensional medical and small sample data. Taking the simulation research mode in the BeSS package as a reference, the standard normal distribution with the original independent variable Z being subject to the variance of 1 with the mean value of 0 is drawn up, namely. Then, the independent variable Z is transformed to generate the independent variable X #>Wherein->And->All 0. The coefficients of 10 independent variables are not 0, the corresponding values are subject to even distribution, and a random term is added in the simulation>It is subject to a normal distribution with mean 1 and variance 5, i.e. +.>. Depending on the method of use, the present embodimentThe example simulates the following two cases:

setting the sample size to 200 in the case where the optimal model size is unknown;

in the case of a fixed model size, the sample size is set to 100 and the random error term is adjusted to be in the logistic modelEnsuring a certain predictive power.

In this example, the number of Model Size (MS), true (TP)/False (FP) positives, and predictions were used as evaluation indices. The model size is the number of independent variables used for constructing a prediction model; in the case of determining the best predicted effect, the smaller the model size, the better. The number of true/false positives is the number of variables with a true prediction effect and variables without a true prediction effect contained in independent variables used for constructing the prediction model. The prediction is performed by constructing an evaluation index of the prediction capacity of a prediction model according to the screened variable combination; area under test (AUC) and accuracy (Acc) are used in a logistic model; AUC using a consistency index (C-index) and median time to live in the Cox model; the range of the predictive expression evaluation index is [0,1], and the larger the value is, the better the predictive effect is.

Referring to table 1, the logistic regression simulation result under the model size automatically identified by the high-dimensional medical data feature subset screening method provided by the invention is shown; referring to table 2, the Cox regression simulation results under the model size automatically identified by the high-dimensional medical data feature subset screening method provided by the invention are shown; according to tables 1 and 2, in the feature subset screening method of high-dimensional medical data provided by the invention, whether the feature subset screening method is based on a logistic model or a Cox model, the feature subset screening method of high-dimensional medical data provided by the invention can show better prediction performance than other methods, and has smaller model size.

TABLE 1 logistic regression simulation results at automatically identified model sizes

N=200	Method	p=1000	p=5000	p=10000
					AUC	glmnet	0.797(0.051)	0.769(0.067)	0.762(0.057)
	BeSS.gs	0.710(0.051)	0.703(0.072)	0.691(0.064)
						BeSS.seq	0.786(0.071)	0.773(0.086)	0.762(0.085)
	mBeSS	0.834(0.044)	0.815(0.064)	0.798(0.067)
					Acc	glmnet	0.724(0.036)	0.702(0.051)	0.699(0.044)
	BeSS.gs	0.686(0.035)	0.674(0.058)	0.666(0.052)
						BeSS.seq	0.711(0.057)	0.701(0.071)	0.692(0.068)
	mBeSS	0.750(0.038)	0.735(0.054)	0.721(0.054)
					MS	glmnet	18.43(11.889)	18.38(12.405)	21.91(16.358)
	BeSS.gs	19.00(2.934)	15.74(2.163)	14.99(1.691)
						BeSS.seq	3.26(1.528)	2.79(1.559)	2.50(1.360)
	mBeSS	5.79(2.124)	5.75(2.618)	5.12(2.724)
					TP	glmnet	5.34(0.742)	4.76(1.288)	4.70(1.150)
	BeSS.gs	4.66(1.037)	4.05(1.41)	3.70(1.307)
						BeSS.seq	3.14(1.518)	2.75(1.579)	2.41(1.371)
	mBeSS	4.58(1.156)	4.03(1.337)	3.45(1.282)
					FP	glmnet	13.09(11.544)	13.62(11.787)	17.21(15.752)
	BeSS.gs	14.34(3.232)	11.69(3.228)	11.29(2.571)
						BeSS.seq	0.12(0.356)	0.04(0.197)	0.09(0.288)
	mBeSS	1.21(1.725)	1.72(2.248)	1.67(2.270)

TABLE 2 Cox regression simulation results with automatically identified model sizes

N=200	Method	p=1000	p=5000	p=10000
					C-index	glmnet	0.746(0.058)	0.743(0.062)	0.748(0.06)
	BeSS.gs	0.750(0.045)	0.713(0.047)	0.711(0.040)
						BeSS.seq	0.755(0.049)	0.747(0.059)	0.760(0.057)
	mBeSS	0.766(0.035)	0.765(0.045)	0.776(0.043)
					AUC	glmnet	0.776(0.065)	0.773(0.071)	0.779(0.069)
	BeSS.gs	0.780(0.051)	0.738(0.055)	0.735(0.050)
						BeSS.seq	0.785(0.055)	0.777(0.067)	0.792(0.065)
	mBeSS	0.797(0.041)	0.797(0.052)	0.809(0.049)
					MS	glmnet	9.240(5.725)	8.71(7.016)	10.37(6.934)
	BeSS.gs	10.31(8.204)	23.12(9.540)	29.17(6.852)
						BeSS.seq	4.20(1.310)	3.80(1.717)	4.03(1.586)
	mBeSS	6.31(2.718)	6.00(2.947)	6.34(3.023)
					TP	glmnet	4.65(1.507)	4.21(1.641)	4.49(1.547)
	BeSS.gs	4.95(0.968)	4.98(1.015)	5.04(0.909)
						BeSS.seq	4.02(1.333)	3.60(1.563)	3.82(1.513)
	mBeSS	4.83(1.006)	4.49(1.299)	4.67(1.155)
					FP	glmnet	4.59(4.905)	4.50(6.056)	5.88(5.899)
	BeSS.gs	5.36(8.021)	18.14(9.666)	24.13(6.922)
						BeSS.seq	0.18(0.411)	0.20(0.471)	0.21(0.433)
	mBeSS	1.48(2.363)	1.51(2.513)	1.67(2.659)

Referring to table 3, the logistic regression simulation result of the high-dimensional medical data feature subset screening method provided by the invention under the fixed model size is shown; referring to table 4, the Cox regression simulation result of the high-dimensional medical data feature subset screening method provided by the invention under the fixed model size is shown; according to tables 3 and 4, the high-dimensional medical data feature subset screening method provided by the invention has a better prediction effect under the condition of different fixed variable numbers.

TABLE 3 logistic regression simulation results at fixed model size

N=100	Method	k=2	k=4	k=6
					p=1000	AUC	BeSS	0.711(0.073)	0.757(0.084)	0.760(0.084)
	mBeSS	0.726(0.064)	0.761(0.077)	0.767(0.076)
						Acc	BeSS	0.652(0.056)	0.689(0.066)	0.691(0.067)
	mBeSS	0.663(0.048)	0.690(0.061)	0.698(0.060)
						TP	BeSS	1.54(0.610)	2.54(0.989)	2.97(1.150)
	mBeSS	1.65(0.539)	2.61(0.909)	3.03(1.105)
						FP	BeSS	0.46(0.610)	1.46(0.989)	3.03(1.150)
	mBeSS	0.35(0.539)	1.39(0.909)	2.97(1.105)
					p=5000	AUC	BeSS	0.676(0.090)	0.684(0.100)	0.674(0.093)
	mBeSS	0.672(0.089)	0.691(0.100)	0.690(0.089)
						Acc	BeSS	0.625(0.070)	0.632(0.077)	0.626(0.071)
	mBeSS	0.623(0.068)	0.638(0.077)	0.637(0.071)
						TP	BeSS	1.25(0.702)	1.69(1.032)	1.85(1.048)
	mBeSS	1.21(0.701)	1.73(1.053)	1.96(0.994)
						FP	BeSS	0.75(0.702)	2.31(1.032)	4.15(1.048)
	mBeSS	0.79(0.701)	2.27(1.053)	4.04(0.994)
					p=10000	AUC	BeSS	0.644(0.094)	0.636(0.103)	0.631(0.099)
	mBeSS	0.644(0.099)	0.647(0.099)	0.663(0.092)
						Acc	BeSS	0.601(0.073)	0.596(0.078)	0.594(0.074)
	mBeSS	0.600(0.078)	0.606(0.073)	0.615(0.071)
						TP	BeSS	0.97(0.717)	1.17(1.045)	1.34(1.112)
	mBeSS	0.97(0.745)	1.25(0.978)	1.63(0.981)
						FP	BeSS	1.03(0.717)	3.83(1.045)	4.66(1.112)
	mBeSS	1.03(0.745)	3.75(0.978)	4.37(0.981)

TABLE 4 Cox regression simulation results at fixed model size

N=100	Method	k=2	k=4	k=6
					p=1000	C-index	BeSS	0.680(0.052)	0.734(0.072)	0.766(0.068)
	mBeSS	0.685(0.053)	0.747(0.062)	0.766(0.069)
						AUC	BeSS	0.701(0.06)	0.761(0.082)	0.795(0.075)
	mBeSS	0.705(0.061)	0.775(0.071)	0.795(0.076)
						TP	BeSS	1.59(0.570)	2.94(0.993)	4.03(1.235)
	mBeSS	1.69(0.506)	3.07(0.924)	4.02(1.310)
						FP	BeSS	0.41(0.570)	1.06(0.993)	1.97(1.235)
	mBeSS	0.31(0.506)	0.93(0.924)	1.98(1.310)
					p=5000	C-index	BeSS	0.669(0.066)	0.711(0.08)	0.714(0.091)
	mBeSS	0.686(0.061)	0.717(0.083)	0.739(0.082)
						AUC	BeSS	0.688(0.076)	0.733(0.088)	0.737(0.100)
	mBeSS	0.708(0.069)	0.739(0.094)	0.763(0.094)
						TP	BeSS	1.44(0.656)	2.47(1.141)	2.92(1.555)
	mBeSS	1.61(0.618)	2.59(1.164)	3.31(1.433)
						FP	BeSS	0.56(0.656)	1.53(1.141)	3.08(1.555)
	mBeSS	0.39(0.618)	1.41(1.164)	2.79(1.433)
					p=10000	C-index	BeSS	0.651(0.083)	0.680(0.088)	0.698(0.096)
	mBeSS	0.654(0.086)	0.698(0.092)	0.710(0.100)
						AUC	BeSS	0.669(0.096)	0.700(0.100)	0.718(0.107)
	mBeSS	0.671(0.096)	0.719(0.104)	0.733(0.113)
						TP	BeSS	1.29(0.701)	2.07(1.103)	2.57(1.486)
	mBeSS	1.33(0.711)	2.36(1.159)	2.83(1.551)
						FP	BeSS	0.71(0.701)	1.93(1.103)	3.43(1.486)
	mBeSS	0.67(0.711)	1.64(1.159)	3.17(1.551)

Referring to table 5, there are 4 real data examples utilized in the present embodiment; in this embodiment, the data is randomly divided into a training set and a test set, the training set containing two-thirds of the observations and the test set containing the remaining observations. The same evaluation index as the simulation experiment was used to compare the true data results.

TABLE 5 brief introduction to real data

Data name	Number of observations	Number of independent variables	End type	Data come from
					gravier	168	2905	Two categories	https://github.com/ramhiser/datamicroarray/wiki/Gravier-(2010)
psoriasis	170	18482	Two categories	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30999
					comorbid	1467	344	Survival data	British biological bank (UKB)
10846	412	54677	Survival data	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10846

Referring to table 6, the model result is automatically identified for the real data; results show that in Gao Weizu data, compared with BeSS and glmcet (R software package for realizing LASSO), the model prediction performance of the high-dimensional medical data feature subset screening method provided by the invention is often better; if the predicted performance is similar, the high-dimensional medical data feature subset screening method provided by the invention has the capability of finding a smaller model.

TABLE 6 automatic identification of model size results for real data

Data	Method	glmnet	BeSS.gs	BeSS.seq	mBeSS
						gravier	MS	11.28(7.027)	10.66(2.392)	8.97(1.087)	5.63(3.472)
	AUC	0.738(0.09)	0.695(0.069)	0.681(0.07)	0.740(0.074)
							Acc	0.715(0.074)	0.711(0.057)	0.71(0.061)	0.722(0.055)
psoriasis	MS	8.77(3.33)	3.03(1.93)	4.36(2.439)	5.75(2.928)
							AUC	0.968(0.028)	0.882(0.156)	0.966(0.028)	0.973(0.028)
	Acc	0.951(0.035)	0.85(0.173)	0.95(0.029)	0.956(0.03)
						comorbid	MS	9.3(5.668)	14.24(15.992)	2.3(0.81)	7.66(4.207)
	C-index	0.641(0.023)	0.578(0.041)	0.62(0.019)	0.636(0.024)
							AUC	0.624(0.029)	0.568(0.037)	0.595(0.023)	0.614(0.029)
10846	MS	3.94(5.626)	34.12(10.829)	1.24(0.495)	4.39(4.087)
							C-index	0.586(0.073)	0.611(0.041)	0.624(0.039)	0.631(0.039)
	AUC	0.589(0.078)	0.618(0.053)	0.626(0.052)	0.63(0.048)

Referring to table 7, the results for the real data at a fixed number are shown; the result shows that under the condition of fixed number, the feature subset identified by the high-dimensional medical data feature subset screening method provided by the invention has better prediction performance.

TABLE 7 results of real data at fixed model size

Data		Method	k=2	k=4	k=6
						gravier	AUC	BeSS	0.673(0.094)	0.713(0.097)	0.740(0.082)
		mBeSS	0.709(0.089)	0.729(0.082)	0.736(0.072)
							Acc	BeSS	0.692(0.066)	0.707(0.069)	0.720(0.058)
		mBeSS	0.709(0.058)	0.721(0.059)	0.721(0.057)
									k=2	k=4	k=6
psoriasis	AUC	BeSS	0.698(0.184)	0.78(0.192)	0.882(0.147)
								mBeSS	0.919(0.149)	0.971(0.056)	0.968(0.034)
	Acc	BeSS	0.673(0.189)	0.753(0.204)	0.862(0.160)
								mBeSS	0.901(0.148)	0.96(0.059)	0.955(0.033)
			k=3	k=5	k=8
						comorbid	C-index	BeSS	0.625(0.022)	0.634(0.02)	0.644(0.021)
		mBeSS	0.627(0.023)	0.637(0.021)	0.644(0.02)
							AUC	BeSS	0.603(0.028)	0.612(0.026)	0.624(0.026)
		mBeSS	0.605(0.029)	0.615(0.028)	0.625(0.026)
									k=2	k=4	k=5
10846	C-index	BeSS	0.607(0.05)	0.621(0.053)	0.626(0.054)
								mBeSS	0.616(0.046)	0.625(0.043)	0.630(0.053)
	AUC	BeSS	0.608(0.057)	0.627(0.06)	0.628(0.060)
								mBeSS	0.619(0.056)	0.628(0.052)	0.633(0.058)

According to the experimental data, the invention provides the high-dimensional medical data feature subset screening method mBeSS, which not only can efficiently search variable combinations under a specific number in high-dimensional medical data (especially small samples), but also can automatically identify the size of the optimal feature subset, and the identified feature subset is not easily influenced by over fitting and has good extrapolation capability. Its predictive power and model size are often better than the common methods BeSS and LASSO. In medical research, high-dimensional data are extremely common, and the method provided by the invention has the advantages of high calculation speed, good prediction effect, high interpretation and wide application prospect.

Based on the embodiment, the high-dimensional medical data feature subset screening method provided by the invention constructs a consistency scoring model based on the predicted performance and the occurrence frequency of feature variables; when the number K of the feature variables in the optimal feature subset is a fixed value, acquiring a candidate feature subset with the highest consistency score as the optimal feature subset; when the number K of the feature variables in the optimal feature subset is unknown, acquiring a candidate feature subset with the highest consistency score as a preferable K feature subset when the number K of the feature variables is the number K; calculating a truncated mean as a predicted representation of the preferred K feature subset; if the predicted performance of the optimal K feature subset converges, outputting the current optimal K feature subset as an optimal feature subset, and determining the number of feature variables of the optimal feature subset as K; if the set is not converged, updating K=K+1, re-acquiring a plurality of pairs of training sets and corresponding test sets, and selecting a new preferred K feature subset until the predicted performance of the preferred K feature subset is converged.

According to the high-dimensional medical data feature subset screening method, when the number of feature variables in the optimal feature subset is unknown, a sampling strategy, consistency scores and prediction performances are fused on the basis of a BeSS algorithm to determine the number and the composition of the feature variables of the optimal feature subset; initializing the number K=1 of the feature variables of the optimal feature subset, enabling the number K=K+1 to be iterative, obtaining the prediction performance of the feature subset under different feature variable numbers until the prediction performance converges, and obtaining the current feature variable number K and the optimal feature subset corresponding to the current feature variable number K; the number of the feature variables in the optimal feature subset is iteratively identified, so that the identification result is not easily influenced by overfitting, and the method has good extrapolation capability, high calculation speed and high interpretation. The invention builds a consistency scoring model based on the occurrence frequency of the characteristic variable and the prediction performance, wherein the occurrence frequency of the characteristic variable refers to the occurrence frequency of each variable in all variable combinations by statistics, the variable is assigned, and the sum of scores of different variables in each combination is calculated, so that the group of prediction stability is ensured to be strong, and meanwhile, the group of prediction capability is ensured to be good when the prediction performance is integrated. The invention can also fix the number of the feature variables in the optimal feature subset by presetting the K value, thereby realizing the acquisition of the optimal feature subset under the condition of fixing the feature variables. The high-dimensional medical data feature subset screening method can automatically identify the variable feature number of the optimal feature subset, can acquire the optimal feature subset under the condition of fixed number, improves the use scene of the method, and has good application prospect.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A method for screening a feature subset of high-dimensional medical data, comprising, when the number K of feature variables in an optimal feature subset is unknown:

2. The method of claim 1, wherein the constructing a consistency score model based on the predicted performance of the candidate feature subset and the occurrence frequency of each feature variable in the candidate feature subset in the candidate feature set is expressed as:

；

indicate->Consistency score for group candidate feature subset, th ∈>The group candidate feature subset is from +.>Acquiring in a group training set; />Indicate->The ith feature variable in the subset of group candidate features, a>，/>Representing the total number of feature variables in the candidate feature subset; />Indicate->Group candidate feature subset>Representing characteristic variable +.>Frequency of occurrence in candidate feature set, if->Comprises->Then->1, otherwise, 0; />Representing the number of candidate feature subsets in the candidate feature set, +.>，/>；/>Representing the weights.

3. The method of claim 1, wherein the acquiring a high-dimensional medical dataset divides the high-dimensional medical dataset into a training set and a test set based on a random sampling strategy; repeatedly dividing to obtain a plurality of pairs of training sets and corresponding test sets thereof, wherein the method comprises the following steps:

4. The method for screening feature subsets of high-dimensional medical data according to claim 1, wherein the selecting K feature variables in each training set to form a candidate feature subset by using a BeSS algorithm comprises:

5. The method of claim 1, wherein the predetermined regression prediction model is a logistic model if the high-dimensional medical data is a classification outcome.

6. The method of claim 5, wherein said computing the predicted performance of each candidate feature subset in its corresponding test set comprises:

7. The method of claim 1, wherein the predetermined regression prediction model is a Cox model if the high-dimensional medical data is a survival outcome.

8. The method of claim 7, wherein the computing the predicted performance of each candidate feature subset in its corresponding test set comprises:

9. A method for screening feature subsets of high-dimensional medical data, characterized in that when the number K of feature variables in an optimal feature subset is a fixed value, the method comprises: