CN110110753B

CN110110753B - Effective mixed characteristic selection method based on elite flower pollination algorithm and ReliefF

Info

Publication number: CN110110753B
Application number: CN201910266518.2A
Authority: CN
Inventors: 阎朝坤; 罗慧敏; 张戈; 马敬敬; 王建林
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2023-08-25
Anticipated expiration: 2039-04-03
Also published as: CN110110753A

Abstract

The invention provides an effective mixed characteristic selection method based on elite pollination algorithm and ReliefF. The method comprises the following steps: step 1, initializing a population consisting of M individuals by adopting a dual-initial population strategy based on Relieff feature ordering and randomization; step 2, updating the population by adopting a binary elite pollination algorithm, and calculating the fitness value of each individual in the population to obtain a global optimal solution in the population; step 3, searching the neighborhood of the global optimal solution by adopting a tabu search algorithm to determine a candidate solution, and updating a tabu table according to the adaptability value of the candidate solution; step 4, selecting an individual with the largest fitness value from the tabu list as an elite individual, and replacing the individual with the smallest fitness value in the population with the elite individual to form a new population; and 5, taking the steps 2 to 4 as one iteration, and repeating the steps 2 to 4 until the current iteration number reaches the set iteration number. The present invention can achieve high classification accuracy as compared to other feature selection methods.

Description

Effective mixed characteristic selection method based on elite flower pollination algorithm and ReliefF

Technical Field

The invention relates to the technical field of bioinformatics, in particular to an effective mixed characteristic selection method based on elite flower pollination algorithm and ReliefF.

Background

With the rapid development of biomedical technology and critical technology in the health field, a great deal of bioinformatic and clinical medical data, especially molecular biological experimental data, have been grown and accumulated at a speed and scale which were not previously available. The medical big data contains a large amount of valuable information, and the data is subjected to data mining, so that the incidence rules and the risk factors related to the diseases and the interactions among the incidence rules and the risk factors are found, and references are provided for clinical diagnosis and treatment of the diseases. In recent years, researchers in the related field analyze microarray data, and the feasibility and effectiveness of the method are proved through comparative analysis of experimental results, so that a great deal of theoretical support is provided for the research in the field. However, as the gene data has more noise and redundant genes, the selection of some redundant genes is unavoidable in the process of screening effective genes; meanwhile, the dimension of the gene data is higher, the time required in the complex calculation process is longer, and the efficiency of characteristic gene selection is lower. Aiming at the problems, from the characteristics of gene microarray data, a machine learning related method is used for analyzing and processing a tumor gene data set, a plurality of feature selection algorithms with high classification precision are provided, and the effectiveness is verified through comparison of experimental results. Therefore, feature selection has been widely applied to the field of bioinformatics, and has important research significance and value for disease diagnosis and clinical treatment.

Feature selection techniques, which were first developed in the 60 s of the last century, were essentially aimed at selecting, from a set of features of the raw data, an optimal subset of features that meet certain criteria for classification or regression tasks. The method is mainly used for solving the related problems of statistics, signal processing and the like. Because the relation between the features and the category and between the features is not considered, the early feature selection algorithm performs poorly in the application direction. Since the last 90 s of the century, machine learning of large-scale data began to appear in the human field of view, traditional feature selection algorithms have been severely challenged, and we have urgent need for an efficient global search technique to better solve the feature selection problem, evolutionary computing has recently received widespread attention from the feature selection community because they are known for their global search capability. However, there is no comprehensive guideline regarding the advantages and disadvantages of the alternative method and the most suitable field of application. Researchers are continually trying to optimize machine learning and random strategy algorithms while introducing new intelligent algorithms to improve their computational efficiency and quality of the selected feature subsets.

At present, the feature selection algorithm is classified according to the search strategies, and three feature selection algorithms based on different search strategies are mainly available: feature selection algorithms based on exhaustive search strategies, feature selection algorithms based on random search strategies, and feature selection algorithms based on meta heuristic search strategies.

(1) Feature selection algorithm based on exhaustive search strategy: the exhaustive and branch-and-bound methods are the methods mainly employed by global optimizations. The exhaustive method may also be referred to as a depletion search, where a satisfactory optimal feature subset is selected by searching each feature subset, e.g. a backtracking method, because it can traverse all feature sets, thus a global optimal feature subset can be found. However, if the number of original features is large, the search space will naturally become large, and the execution efficiency of the depletion search will be reduced, which is not practical. The branch-and-bound method is a method for shortening the search time through pruning operation, and is the only method capable of obtaining the optimal result in the current global search, but the method requires that the number of the optimal feature subsets is preset before the search starts and the evaluation function has monotonicity. Meanwhile, when the feature waiting for processing has a high dimension, it needs to be performed a plurality of times, and these requirements limit its application.

(2) Feature selection algorithm based on random search strategy: the feature selection is combined with Genetic Algorithm (GA), simulated Annealing (SA), tabu Search (TS) and the like in the searching process, and the probability and sampling process are used as theoretical supports. And carrying out weight assignment on each feature to be selected according to the classification effectiveness, judging the importance of the feature according to a defined or adaptively acquired threshold value, and outputting the feature with the weight exceeding the threshold value. The random search method takes the classified performance as a judgment standard or obtains a better application effect. However, there is a problem that the time complexity is high, and the output feature set cannot be guaranteed to be the optimal feature subset.

(3) Feature selection algorithm based on meta heuristic search strategy: it is an approximation algorithm that trades off computational effort against search optimality. The optimal feature subset is generated through continuous iteration by using reasonable heuristic rule design. The selection of the individual optimal features, the forward selection of the sequence, the backward selection of the sequence, the bidirectional selection, etc. can be classified according to the difference between the initial feature set and the search direction. The meta-heuristic search has low complexity and high execution efficiency, and is very widely applied to practical problems. However, during feature selection, once a feature is deleted, it cannot be withdrawn, which may result in the algorithm falling into a local optimum.

At present, the feature selection algorithm is classified according to the evaluation strategies, and four feature selection algorithms based on different evaluation strategies are mainly available:

(1) Based on filtering (Filters)

The filter type characteristic selection algorithm is completely independent of the classification algorithm and is irrelevant to the classification performance and other parameters of the classification algorithm. The filtered feature selection algorithm may be considered as a data preprocessing process. The filtering type characteristic selection algorithm often uses an independent evaluation function, and different filtering type characteristic selection algorithms can be obtained by changing the evaluation function and the searching mode. The versatility of the filtered algorithm makes it useful to address various feature selection issues, but since the filtered model easily ignores the relevance of features and provides feature subsets that may contain redundant information, the feature subset classification performance of the selection is often lower than other algorithms. Common filtering algorithms include information gain, mutual information, t-detection, etc.

(2) Based on encapsulation (Wrappers)

The packaged feature selection algorithm uses the classification performance of the feature subsets to obtain the best feature subset. The packaged feature selection algorithm combines the feature selection process with the learning algorithm to find a feature subset that optimizes the classification performance of the learning algorithm. Unlike filtered feature selection algorithms, the encapsulated model typically relies on a classifier to select feature subsets, which yields higher classification accuracy, but may suffer from overfitting of the optimal feature subset to a given classification task and learning algorithm, as well as some long run-time.

(3) Based on a Hybrid Algorithm

The filtering feature selection algorithm and the packaging feature selection algorithm have advantages and disadvantages, and the hybrid feature selection algorithm provides a way to utilize the advantages of the two algorithms. Typical hybrid feature selection algorithms utilize both independent evaluation functions and learning algorithms to evaluate feature subsets: and selecting a group of candidate optimal subsets by using an independent evaluation function, and then selecting a final optimal feature subset from the candidate optimal subsets by using a learning algorithm.

(4) Based on embedded (Embedded Solutions)

Some learning algorithms have a fixed structure, and feature choices can be embedded in the learning algorithm, which we can use to construct embedded feature selection algorithms. The embedded model may combine the features of both the filtered and encapsulated models. For example, the basic unit node of the algorithm has a selection function, and each node selects the characteristic with high classification capability, and the generation process of the decision tree is also the characteristic selection process. However, constructing a mathematical model of an embedded feature selection classifier is quite complex.

In summary, selecting the most valuable subset of features from the original input data, and improving the classification accuracy as much as possible is the goal that the feature selection algorithm needs to achieve. However, many intelligent algorithms currently do not cover both of these objectives.

Disclosure of Invention

Aiming at the problem that the existing feature selection algorithm cannot simultaneously cover the two targets of selecting the most valuable optimal feature subset consisting of related features from the original input data and improving the classification accuracy as much as possible, the invention provides an effective mixed feature selection method based on elite pollination algorithm and ReliefF, which can further improve the classification accuracy of the features while selecting the optimal feature subset.

The effective mixed characteristic selection method based on elite pollination algorithm and ReliefF provided by the invention comprises the following steps:

step 1, initializing a population consisting of M individuals by adopting a dual-initial population strategy based on Relieff feature ordering and randomization;

step 2, updating the population by adopting a binary elite pollination algorithm, and calculating the fitness value of each individual in the population by adopting a set fitness function to obtain a global optimal solution in the population;

step 3, searching the neighborhood of the global optimal solution by adopting a tabu search algorithm according to a set tabu table to determine a candidate solution, and updating the tabu table according to the fitness value of the candidate solution;

step 4, selecting an individual with the largest fitness value from the tabu list as an elite individual, and replacing the individual with the smallest fitness value in the population with the elite individual to form a new population;

and 5, taking the steps 2 to 4 as one iteration, and repeating the steps 2 to 4 until the current iteration number reaches the set iteration number.

Further, the step 1 specifically includes:

step 1.1, dividing M individuals into two populations on average: a first population and a second population;

step 1.2, initializing a first group by adopting a randomization process to form a first type of initial solution, specifically: for feature X at position j in individual i in the first population _ij Randomly generating a random number r, r.epsilon.0, 1]Feature X if random number r is smaller than set initialization probability P _ij Is selected, otherwise X _ij Not selected; setting the selected feature to 1 and the unselected feature to 0 for each individual; the solution formed by the initialized first group is used as a first type initial solution;

step 1.3, initializing a second group by adopting a weight sorting process to form a second type initial solution, which specifically comprises the following steps: calculating the weight of each feature corresponding to each individual in the second population according to a set ReliefF weight formula, and randomly selecting a plurality of features from the front topN features with larger weight values for each individual; the solution formed by the initialized second group is used as a second type initial solution;

and 1.4, merging the first type initial solution and the second type initial solution to obtain an initial optimal solution of the population.

Further, the set initialization probability P is calculated according to the formulas (3) and (4):

wherein ,represents the jth eigenvalue in individual i at the t-th iteration, a represents the adaptive conversion factor,C ₁ and C₂ The variation factor is represented, and T represents the set number of iterations.

Further, in step 2, the updating the population by using the binary elite pollination algorithm specifically includes:

if cross pollination is used, the individuals in the population are pollinated according to formula (5)Updating:

wherein , and />The positions of the individuals i at the t+1st and t iterations are respectively represented; f is the current global optimal solution; gamma is a scale factor; l (lambda) is the step size of the Levy flight; Γ (λ) is a standard gamma function, λ ε [1,2 ]]The method comprises the steps of carrying out a first treatment on the surface of the S is the moving step.

Further, in step 2, the updating the population by using the binary elite pollination algorithm specifically further includes:

step 2.1, if self-pollination operation is adopted, selecting n optimal individuals from the population according to the fitness value, randomly selecting an individual m and an individual k from the n selected optimal individuals, and updating an individual i in the population according to a formula (7) to obtain a new individual i:

wherein A is an adaptive conversion factor, and />The positions of individual m and individual k at the t-th iteration are represented respectively; c (C) ₁ and C₂ Representing a variation factor; t represents the set number of iterations.

Step 2.2, calculating the fitness value of the new individual i according to a set fitness function, if the fitness value of the new individual i is larger than the fitness value before updating the individual i, adopting the new individual i to replace the individual i before updating, otherwise, discarding the new individual i;

step 2.3, repeating the steps 2.1 to 2.2 until all individuals in the population are updated.

Further, in step 2, the set fitness function is specifically:

wherein ,acc represents the classification accuracy of the sample, num _c Indicates the number of correctly classified samples, num _i The number of samples representing a classification error, N representing the number of selected features corresponding to the samples of the fitness value to be calculated, N representing the number of all features corresponding to the samples of the fitness value to be calculated, α representing the weight of the classification accuracy, β representing the weight of the feature selection, α+β=1.

Further, the step 3 specifically includes:

step 3.1, setting initialization parameters: initializing a tabu table with a tabuLength length, and generating a number of neighborhood solutions with numNEighbor;

step 3.2, selecting an initial solution, wherein the initial solution is an optimal solution generated by local search in a flower pollination algorithm in the current iteration process;

step 3.3, if the current iteration times are judged to be equal to the maximum iteration times, ending the iteration process, and taking the current optimal solution as a final optimal solution; otherwise, carrying out the step 3.4;

step 3.4, generating a neighborhood solution through the current solution to form a candidate solution;

step 3.5, if the candidate solution is judged to be not in the tabu list and the fitness value of the candidate solution is larger than that of the initial solution, replacing the initial solution by the candidate solution, adding the candidate solution into the tabu list, and repeating the step 3.3; and if judging that the candidate solution is known to be in the tabu list, repeating the step 3.3.

Further, the step 4 specifically includes:

step 4.1, sorting all individuals in the tabu list according to the fitness value;

step 4.2, storing the individual with the largest fitness value into elite population;

and 4.3, updating the elite population after the current iteration process is finished, replacing worst individuals in the population with elite individuals in the elite population, and carrying out the next iteration.

The invention has the beneficial effects that:

according to the elite pollination algorithm and ReliefF-based effective mixed feature selection method provided by the invention, the important feature subset is selected in the initialization process of feature ordering based on the ReliefF algorithm, and the elite strategy improves the convergence speed of the pollination algorithm. Besides the simplified redundancy feature in initialization, the characteristics of weak local searching capability and easy sinking into local optimum of the pollination algorithm are considered, so that the pollination algorithm is improved by adopting tabu searching and self-adaptive Gaussian variation strategies, the diversity of population can be increased, and the local searching performance is improved. The best feature subset searched is brought into a classification algorithm for classification verification in combination with 10-fold intersection. Tests on eight public biomedical datasets verify that the invention can effectively simplify the number of gene expression levels and achieve high classification accuracy compared to other feature selection methods.

Drawings

FIG. 1 is a schematic flow chart of an efficient hybrid feature selection method based on elite pollination algorithm and Relief according to an embodiment of the present invention;

FIG. 2 is a diagram of a random initialization solution X according to an embodiment of the present invention _ij Schematic of (2);

FIG. 3 is a schematic diagram of population initialization based on Relieff ordering according to an embodiment of the present invention;

FIG. 4 is a second flow chart of an efficient hybrid feature selection method based on elite pollination algorithm and Relief according to an embodiment of the invention;

FIG. 5 is a schematic diagram showing the comparison of the average feature numbers of different intelligent algorithms based on the same data set in feature selection according to an embodiment of the present invention;

FIG. 6 is a schematic diagram showing comparison of average fitness values of different intelligent algorithms based on the same data set in feature selection according to an embodiment of the present invention;

FIG. 7 is a schematic diagram showing a comparison of the operation time of different intelligent algorithms based on the same data set in feature selection according to an embodiment of the present invention;

FIG. 8 is a graph showing average fitness values versus BCFPA for the ReliefF-EFPATS, EFPATS and the BCFPA provided in the examples of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The characteristic selection method based on elite pollination algorithm EFPA and ReliefF is called a ReliefF-EFPATS method for short. As shown in FIG. 1, the method of the Relieff-EFPATS provided by the invention comprises the following steps:

s101, initializing a population consisting of M individuals by adopting a dual-initial population strategy based on Relieff feature ordering and randomization;

specifically, the present step comprises the following sub-steps:

s1011, equally dividing M individuals into two populations: a first population and a second population;

s1012, initializing a first group by adopting a randomization process to form a first type of initial solution, wherein the method specifically comprises the following steps: for feature X at position j in individual i in the first population _ij Randomly generating a random number r, r.epsilon.0, 1]Feature X if random number r is smaller than set initialization probability P _ij Is selected, otherwise X _ij Not selected; setting the selected features to 1 and the unselected features to 0 for each individual; the solution formed by the initialized first group is used as a first type initial solution;

s1013, initializing a second group by adopting a weight sorting process to form a second type initial solution, wherein the method specifically comprises the following steps: calculating the weight of each feature corresponding to each individual in the second population according to a set ReliefF weight formula, and randomly selecting a plurality of features from the front topN features with larger weight values for each individual; the solution formed by the initialized second group is used as a second type initial solution;

as an implementation manner, the set ReliefF weight formula specifically includes:

wherein X is a training sample set, X _i ∈{X ₁ ,X ₂ ,…X _m -a }; y is a class label set, Y= { Y ₁ ,Y ₂ …Y _n }. Randomly selecting an individual X from a training set _i Is of the class Y _i (i.epsilon.n). W (f) is the weight of feature f, t is the number of iterations, diff (f, X ₁ ,X ₂ ) Representing individual X ₁ And individual X ₂ As to the distinction of the feature f,first from sum X _i Searching k nearest neighbor individuals H among individuals of the same kind _j Then sum X _i Finding k nearest neighbor individuals M among individuals of different classes _j (j ε k), this step is repeated t times.

S1014, combining the first type initial solution and the second type initial solution to obtain an initial optimal solution of the population.

For example, M flowers are randomly selected as the initial population. For the problem of feature selection, binary coding is typically used to represent feature subsets, each flower being modeled as a binary string as a candidate solution. One bit in the binary character string represents one feature corresponding to one flower, and the length of the binary character string represents the total number of features corresponding to the flower. Wherein, the value of the j-th bit is '1' to indicate that the j-th feature is selected, and the value of the j-th bit is '0' to indicate that the j-th feature is not selected. M/2 individuals were randomly initialized, and the initialization solution is shown in FIG. 2. The remaining M/2 individuals were formed by the weight ordering of the Relieff algorithm, the weight calculation formula is shown below. Firstly, calculating the weight of all features of each individual, selecting topN features which are ranked at the top, and then selecting M/2 different feature subsets from the features to form M/2 different initial populations. The initialized population based on the ranking is shown in fig. 3.

In modeling each flower as a binary string, the following can be used:

the position of pollen is converted into a binary feature vector using Sigmoid function and a random strategy in the binary pollination algorithm BFPA. The Sigmoid function is a Sigmoid function, also known as an Sigmoid growth curve, which maps all variables between (0, 1). The pollen stud of each flower is then converted to a binary variable 0 or 1 in combination with a random strategy, with a "1" representing this corresponding feature being selected and a "0" representing the feature being discarded. Meanwhile, an adaptive conversion factor A is introduced into the sigmoid conversion function as a calculation formula for calculating the initialization probability P, as shown in the following formulas (3) and (4). The effect of introducing the adaptive transformation factor a is to enhance the uncertainty of converting a linear solution to a discrete solution, while also enhancing the ability to traverse the solution space. And at a later stage of the algorithm implementation, when converging on the optimal value, the transformation is enhanced to improve the search location.

wherein ,represents the jth eigenvalue in individual i at the t-th iteration, a represents the adaptive conversion factor,C ₁ and C₂ The variation factor is represented, and T represents the set number of iterations. Whether each feature in the new individual is selected depends on the value of the sigmoid function.

S102, updating the population by adopting a binary elite pollination algorithm, and calculating the fitness value of each individual in the population by adopting a set fitness function to obtain a global optimal solution in the population;

specifically, binary elite pollination algorithms are divided into cross pollination operations and self pollination operations (also known as adaptive gaussian mutations). The method comprises the following substeps:

if cross pollination is adopted, step S1021 is performed:

s1021, according to formula (5), for individuals in the populationUpdating:

Specifically, for cross pollination operations, at the initialization stage, the various parameters include population n of flowers, conversion rate p and randomly initializing the search space for pollen. In the pollination stage, the binary flower pollination algorithm BFAA is continuously iterated, and the optimal solution is found through the operators of global pollination, clonal selection and local pollination until the convergence condition is met. Cross pollination is the propagation of pollen by bee or insect flight, following Levy flight profile during flight, with local and global pollination controlled by the switching probability p.epsilon.0, 1. Global pollination is performed when the transition probability p > rand, and the current optimal solution is updated according to equation (5).

Among these, the most efficient method of generating the movement step S is the mantgna algorithm, which calculates the movement step S by using two gaussian distributions of U and V transforms:

wherein U-N (0, sigma) _u ² ),V～N(0，σ _V ² )， Due to sigma _u and σ_V The movement step S and the direction of flight of the Levy flight can be randomly changed from one flower to another, which can be large or small. This not only increases the diversity of the search space, but also improves the global optimization capability of the BFPA algorithm.

If cross pollination is adopted, steps S1022 to S1024 are performed:

s1022, selecting n optimal individuals from the population according to the fitness value, randomly selecting an individual m and an individual k from the n optimal individuals, and updating an individual i in the population according to a formula (7) to obtain a new individual i:

The self-adaptive transfer factor A is used for enhancing conversion of a linear solution to a discrete solution, and by setting the self-adaptive transfer factor, the capability of an individual to jump out of local optimum is improved, and the convergence speed of an original pollination algorithm is increased.

S1023, calculating the fitness value of the new individual i according to a set fitness function, if the fitness value of the new individual i is larger than the fitness value before updating the individual i, adopting the new individual i to replace the individual i before updating, otherwise, discarding the new individual i;

feature selection can be regarded as a multi-objective optimization problem requiring setting a suitable objective function (referred to herein as an fitness function) as the optimization objective for the algorithm. The fitness function achieves two contradictory goals; and selecting the minimum feature number and maximally improving the classification accuracy. The smaller the number of feature subsets selected each time, the higher the classification precision, and the better classification effect of the proposed model is proved.

Each solution is evaluated according to a proposed fitness function that depends on the search algorithm, the classifier, the number of features to obtain the optimal solution, the classification accuracy of the solution, and the number of features selected in the solution. In order to balance between the number of features selected in each solution (minimum) and the classification accuracy (maximum), as an implementation, the fitness function set is specifically:

wherein ,acc represents the classification accuracy of the sample, num _c Indicates the number of correctly classified samples, num _i The number of samples representing a classification error, N representing the number of samples of the fitness value to be calculated corresponding to the selected feature, N representing the number of samples of the fitness value to be calculated corresponding to all features, a being the weight of the classification accuracy, beta being the weight of the feature selection,α+β＝1。

s1024, repeating steps S1022 to S1023 until all individuals in the population are updated.

S103, searching a neighborhood of the global optimal solution by adopting a tabu search algorithm according to a set tabu table to determine a candidate solution, and updating the tabu table according to the fitness value of the candidate solution;

specifically, the present step comprises the following sub-steps:

s1031, setting initialization parameters: initializing a taboo table with a length of tabuLength, and generating a number of neighborhood solutions of numeighbor;

s1032, selecting an initial solution, wherein the initial solution is an optimal solution generated by local search in a flower pollination algorithm in the current iteration process;

s1033, if the current iteration times are judged to be equal to the maximum iteration times, ending the iteration process, and taking the current optimal solution as a final optimal solution; otherwise, go to step S1034;

s1034, randomly selecting a feature through the current solution to carry out single-point mutation so as to generate a neighborhood solution and form a candidate solution;

s1035, if the candidate solution is judged to be not in the tabu list and the fitness value of the candidate solution is larger than that of the initial solution, replacing the initial solution by the candidate solution, adding the candidate solution into the tabu list, and repeating the step S1033; if it is determined that the candidate solution is found in the tabu list, step S1033 is repeated.

The tabu search is a neighborhood search algorithm which mimics the characteristics of human memory function. The core of the tabu search is to have a local search and tabu mechanism, and at each iteration, the algorithm searches the neighborhood of the optimal solution to obtain a new solution with improved functional value.

S104, selecting an individual with the largest fitness value from the tabu list as an elite individual, and replacing the individual with the smallest fitness value in the population by the elite individual to form a new population;

specifically, the present step comprises the following sub-steps:

s1041, sorting all individuals in the tabu list according to the fitness value;

s1042, storing the individual with the largest fitness value into elite population;

s1043, updating elite population after the current iteration process is finished, replacing worst elite individuals in the population with elite individuals in the elite population, and carrying out the next iteration.

In the improved pollination algorithm provided by the embodiment of the invention, new individuals are generated in the searching process due to the existence of the fitness value, so that on one hand, the phenomenon keeps population diversity, so that the algorithm has better global searching capability, on the other hand, the convergence speed of the algorithm is slowed down, and the calculation accuracy is reduced when the calculation times are limited. In order to improve the convergence rate of the algorithm, an elite strategy is introduced after each iteration is finished, in order to keep the scale of the population unchanged, elite individuals are replaced with worst solutions, and if elite individuals are added into a new generation population, the individuals with the minimum fitness value in the new generation population can be eliminated.

S105, taking S102 to S104 as one iteration, and repeating S102 to S104 until the current iteration number reaches the set iteration number.

From the above embodiments, the searching process of the present invention searches based on the binary elite pollination algorithm in combination with the effective hybrid approach of tabu searching. The initialization process for feature ordering based on the RefiefF algorithm aims at selecting important feature subsets, and the elite strategy improves the convergence speed of the pollination algorithm. Besides the simplified redundancy feature in initialization, the characteristics of weak local searching capability and easy sinking into local optimum of the pollination algorithm are considered, so that the pollination algorithm is improved by adopting tabu searching and self-adaptive Gaussian variation strategies, the diversity of population can be increased, and the local searching performance is improved.

To verify the effectiveness of the present method, 10-fold cross-validation was used to test the selection performance of the improved pollination algorithm from the following aspects.

1. Data set and evaluation index

The biological data set used in this experiment is shown in table 1:

table 1: data set description

The feature subset is evaluated by combining a 10-fold cross validation method with a KNN classifier, the feature subset in the 10-fold cross validation process data set is randomly divided into ten parts, nine parts are alternately used as training data sets, and the rest part is used as a test set for testing. Each experiment will obtain a corresponding accuracy (or error rate), and in this experiment, all algorithms will find the average value of ten results when they are performed, as an estimate of the accuracy of the algorithm.

As shown in fig. 4, feature selection is performed on the microarray dataset m×n according to the flow shown in fig. 4, and performance test is performed on the results of feature selection.

(1) Average feature subset number (AvgN)

Under eight biological data sets, the feature subset selection capability of different algorithms under the same data set can be judged through the selected feature subset number. As shown in tables 2 and 3, selecting fewer features from the analysis results means eliminating redundant features and reducing search space, and the ReliefF-efats selects about 6 times less functions than BCROSAT, IG-GA, and ISFLA. The features selected by the ReliefF-EFPATS are 2 times less compared to ABC.

(2) Average accuracy (Acc%)

The average accuracy is also an important indicator, as shown in tables 2 and 3, and it can be seen that the ReliefF-efats achieves the best average accuracy (Acc) compared to other algorithms on most data sets. For the data sets SRBCT and Lung Cancer, the Relieff-EFPATS has Acc similar to BBHA, which achieves higher accuracy.

(3) Standard deviation (std)

In order to verify the robustness of the algorithm, the experiment is run 10 times to obtain the average accuracy of the corresponding index and select the standard deviation corresponding to the average feature number. The standard deviation is the amplitude of the change of a group of numbers, and obviously, the smaller the standard deviation is, the more stable the experimental result is proved.

(4) Average fitness value (Avgf%)

Average fitness value and two targets that can well balance the maximum classification accuracy of feature selection and the optimal length of the subset. As shown in FIG. 6, the Relieff-EFPATS is superior to the other three algorithms in terms of average fitness of ALL-AML, colonTumor, MLL, lung Cancer. For the dataset CNS, the mean fitness of the Relieff-EFPATS was slightly worse than BBHA, but significantly better than the other four algorithms.

(5) Run Time (Time)

Feature selection is to reduce the dimension of the original data and improve the efficiency of the search mechanism. The time consumption of feature selection of the high vitamin dataset is also considered herein. The runtime of an algorithm depends on the convergence capacity of the algorithm and the size of the data set. Figure 7 shows the average calculated time comparison for all algorithms. The speed of convergence of the ReliefF-efats is approximately 3 times faster than ISFLA and IG-GA. The rapidity of the ReliefF-efats is substantially similar to that of BBHA on both data sets. As can be seen from fig. 7, the proposed algorithm ReliefF-efats achieves higher performance over eight baseline disease datasets in a short time. In the case of SRBCT, the execution time of the Relieff-EFPATS is slightly longer than that of the BBHA algorithm. The larger the sample size of the dataset, the longer the run time. For example, lung Cancer runs more time than other data sets. In general, the proposed ReliefF-EFPATS is more efficient in terms of time cost compared to BBHA, BCROSAT, ISFLA, IG-GA and ABC.

2. Comparing with other algorithms

(1) Comparison with other algorithms of the direction

In order to intuitively see the performance of the improved cloned pollination algorithm, different algorithms are introduced for comparison, and the black hole algorithm is combined with chi-square detection BBHA (Black Hole Algorithm), genetic algorithm GA combined with information gain (IG-GA), improved frog-leaping algorithm ISFLA (Improved Shuffled Frog Leaping Algorithm), binary artificial bee colony algorithm Binary Artificial Bee Colony (ABC) and binary coral reef algorithm Binary Coral Reefs Optimization algorithm (BCROSAT). And experiments were performed on eight biological datasets ALL-AML, colonTumor, CNS, MLL, SRBCT, lymphoma, lung Cancer. The experimental results are shown in table 2.

Table 2: comparing the Relieff-EFPATS with the previous algorithm

/>

(2) Comparison with method EFPATS and algorithm BCFP

To further test the impact of the improvement strategy, the algorithm of the present invention, reliefF-EFPATS, was compared with elite pollinating algorithm EFPATS and binary clone pollinating algorithm BCFPA. As can be seen from table 3, the hybrid algorithm ReliefF-efats is for two targets: the classification accuracy and the number of selected attributes are much better than the binary elite pollination algorithm EFPATS, further proving that the ReliefF algorithm helps to speed up convergence. In summary, the ReliefF-efats achieves better classification performance and has good robustness, except for most data sets. Based on the search algorithm FPA, the new hybrid algorithm Relieff-EFPATS makes the search more efficient.

Table 3: experimental results of ReliefF-EFPATS, EFPATS and BCFPA

(3) Example analysis

The effectiveness of some of the present invention in feature selection has been described above by an average feature subset selection number and an average fitness value. The classification accuracy and standard deviation of the ReliefF-EFPATS on each dataset are then analyzed to determine its stability. As shown in Table 2, the mean accuracy of the Relieff-EFPATS was more stable with minimal standard deviation except in the data sets SRBCT and Lung Cancer. From the results of table 3 and fig. 8, it can be seen that, in addition to lung cancer microarray datasets, reliefF-EFPATS has highly competitive results compared to these EFPATS and BCFPA. The binary clone pollination algorithm (BCFPA) performed worse on both Acc and Avgf across nearly all data sets.

To reveal the search process of the ReliefF-EFPATS, the optimal solutions obtained by the ReliefF-EFPATS for all data sets are listed in table 4, which further demonstrates the effectiveness of the relative advantage-based selection strategy. As can be seen from table 4: the optimal gene can be searched out through the algorithm Relieff-EFPATS of the invention. For example leukemia ALL-AML and Lymphoma lymphomas, the gene M23197 (CD 33 antigen) has been identified by literature search as playing a key role in ALL-AML. Whereas in lymphoma dataset, GENEs GENE639X and GENE1610X have very high correlation with disease lymphoma. This further demonstrates the effectiveness of the proposed method in finding important features of high vitamin medical datasets.

Table 4: description of obtaining optimal solution by Relieff-EFPATS

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The tumor gene characteristic selection method based on elite pollination algorithm and ReliefF is characterized by being applied to a characteristic selection process of a biomedical data set, wherein each sample in the biomedical data set contains a plurality of gene characteristics, and one sample corresponds to one individual, and the method comprises the following steps:

step 1, initializing a population consisting of M individuals by adopting a dual-initial population strategy based on Relieff feature ordering and randomization; the method specifically comprises the following steps:

step 1.1, dividing M individuals into two populations on average: a first tumor gene signature population and a second tumor gene signature population;

step 1.2, initializing a first tumor gene feature group by adopting a randomization process to form a first type of initial solution, wherein the method specifically comprises the following steps: simulating each individual into a binary character string, wherein one bit position in the binary character string corresponds to one gene characteristic contained in a sample corresponding to the individual, the length of the binary character string represents the total number of gene characteristics contained in the sample corresponding to the individual, and the j-th gene characteristic X in the individual i in the first tumor gene characteristic group _ij Randomly generating a random number r, r.epsilon.0, 1]If the random number r is smaller than the set initialization probability P, the gene characteristic X _ij Is selected, otherwise X _ij Not selected; for each individual, setting the bit corresponding to the selected gene feature to 1, and setting the bit corresponding to the unselected gene feature to 0, thereby obtaining the binary string corresponding to the individual, namely the initial solution of the individual; the solution formed by the initialized first group is used as a first type initial solution;

step 1.3, initializing a second tumor gene feature group by adopting a weight sorting process to form a second type of initial solution, wherein the method specifically comprises the following steps: calculating the weight of each gene characteristic contained in each individual corresponding sample in the second tumor gene characteristic group according to the set Relieff weight formula; for each individual, randomly selecting a plurality of gene features from the front TopN gene features with larger weight values, setting the bit corresponding to the selected gene feature as 1, and setting the bit corresponding to the unselected gene feature as 0, thereby obtaining a binary character string corresponding to the individual, namely an initial solution of the individual; the solution formed by the initialized second group is used as a second type initial solution;

the set Relieff weight formula specifically comprises:

wherein X is a training sample set, X _i ∈{X ₁ ,X ₂ ,…X _m -a }; y is a class label set, Y= { Y ₁ ,Y ₂ …Y _n Randomly selecting an individual X from a training sample set _i The tumor disease category of (C) is Y _i W (f) is the weight of the gene signature f, T is the number of iterations, diff (f, T ₁ ,T ₂ ) Representing the individual T ₁ And individual T ₂ For the difference in the gene signature f,T ₁ ＝X _i ,T ₂ ＝H _j or M _j ；H _j Representing the sum X of the slave _i K nearest neighbor individuals found in the same class of individuals, M _j Representing the sum X of the slave _i K nearest neighbor individuals found among individuals of different classes;

step 1.4, merging the first type initial solution and the second type initial solution to obtain an initial optimal solution of the population;

step 2, updating the population by adopting a binary elite pollination algorithm, and calculating the fitness value of each individual in the population by adopting a set fitness function to obtain a global optimal solution in the population; the set fitness function is specifically:

wherein ,acc represents the classification accuracy, num, of classifying samples in a biomedical dataset with a KNN classifier based on selected gene characteristics _c Indicates the number of correctly classified samples, num _i The number of samples representing classification errors, N representing the number of selected gene features corresponding to the samples of the fitness value to be calculated, N representing the number of all gene features corresponding to the samples of the fitness value to be calculated, α representing the weight of classification accuracy, β representing the weight of feature selection, α+β=1;

and 5, taking the steps 2 to 4 as one iteration, repeating the steps 2 to 4 until the current iteration times reach the set iteration times, and outputting a global optimal solution at the moment, wherein the characteristic with the characteristic value of 1 in the global optimal solution is the selected tumor gene characteristic which is favorable for disease classification.

2. The method according to claim 1, wherein the set initialization probability P is calculated according to the formula (3) and the formula (4):

wherein ,representing the jth eigenvalue in individual i at the t-th iterationA represents an adaptive conversion factor,C ₁ and C₂ The variation factor is represented, and T represents the set number of iterations.

3. The method of claim 1, wherein in step 2, said updating said population using a binary elite pollination algorithm comprises:

if cross pollination is used, updating the individuals i in the population according to equation (5):

4. A method according to claim 3, wherein in step 2, said updating said population using a binary elite pollination algorithm further comprises:

wherein A is an adaptive conversion factor， and />The positions of individual m and individual k at the t-th iteration are represented respectively; c (C) ₁ and C₂ Representing a variation factor; t represents the set iteration number;

5. The method according to claim 1, wherein the step 3 is specifically:

step 3.4, randomly selecting a feature through the current solution to carry out single-point mutation so as to generate a neighborhood solution and form a candidate solution;

6. The method according to claim 1, wherein the step 4 is specifically: