CN109117956B

CN109117956B - Method for determining optimal feature subset

Info

Publication number: CN109117956B
Application number: CN201810732008.5A
Authority: CN
Inventors: 杨玲波; 黄敬峰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-07-05
Filing date: 2018-07-05
Publication date: 2021-08-24
Anticipated expiration: 2038-07-05
Also published as: CN109117956A

Abstract

The invention discloses a method for determining an optimal characteristic subset, which comprises the following steps: acquiring a high-resolution image, preprocessing and object-oriented segmentation to obtain a ground object data set; calculating various characteristics of the ground object, including shape, index, spectrum, texture and the like; selecting samples from an original surface feature object data set, wherein the samples comprise training samples and testing samples; based on a cross validation method and machine learning methods such as a random forest, a gradient boosting decision tree, a support vector machine and the like, calculating the importance of each feature by using a training sample, and screening the features by using an improved enhanced feature recursive screening method to obtain the classification precision score of each feature subset under different feature quantities; and determining the optimal feature subset for classification of each method according to the principle of highest score, and removing the residual features as redundant features. The method is simple, rapid and accurate.

Description

Method for determining optimal feature subset

Technical Field

The invention relates to the technical field of acquisition of optimal classification feature subsets, in particular to a method for determining an optimal feature subset.

Background

The feature screening is a process of eliminating redundant features from the original feature set to obtain an optimal feature subset effective for classification, so that the classification calculation time can be reduced, and the classification precision can be improved. The evaluation method of the feature subset is usually based on predefined indexes, such as classification accuracy or class separability. Feature screening is an important step in a machine learning method, and excessive features may cause reduction of classification accuracy and improvement of classification time, and is called dimensionality disaster (Pacifici et al 2009). The ways of feature screening are mainly classified into three types, filtering, packaging, and embedding (Weston et al 2003). The filtering method uses a characteristic subset evaluation method which is independent of a classifier, and an embedded method and a packaging method use characteristic screening which is combined with the classifier. For the embedded feature screening method, the feature screening is a part of the learning algorithm and is bound with a specific machine learning method; for the packed type, a specific learning algorithm is packed to evaluate the optimal feature subset, and the error of the classification result is minimized, and finally a classifier is established.

Recursive feature screening (RFE) is a feature screening technique that is widely applied, and it evaluates the importance of each feature through a training model and ranks them, gradually removes the least important features from a feature set, and evaluates the performance of a feature subset through cross validation, so as to obtain an optimal feature set (Guyon 2001). Because the RFE method is an embedded method, the feature subset acquired by the RFE method can better obtain higher classification precision. However, less important features may have a greater impact on classification accuracy when combined with other features, and feature screening purely by importance ranking may result in a performance degradation of the best feature subset (Chen and Jeong 2007). To solve the problem, Chen and Jeong 2007 provides an Enhanced recursive feature screening (EnRFE) method, which improves the performance of the best feature subset obtained by searching through features with lower importance and improved classification precision after elimination. However, the method still has two disadvantages, one is that the efficiency is low, and the other is that the least important features are directly removed when the features capable of improving the classification accuracy are not searched, which may also cause the performance of the removed feature subset to be greatly reduced.

Aiming at the situation, the invention improves the two problems of the EnRFE method, thereby improving the feature screening efficiency and the performance of the selected optimal feature subset, and establishes a complete technical process from image preprocessing, feature calculation, feature screening to image classification based on the method.

Disclosure of Invention

The invention aims to provide a simple, quick and accurate determination method of an optimal feature subset for mass feature screening and redundant feature elimination in machine learning, which is based on an improved enhanced feature recursive screening method and improves the efficiency of feature screening by limiting the depth of feature search and improving the parallel computing capability of a search algorithm.

A method for determining an optimal subset of features, comprising the steps of:

step 1, acquiring a high-resolution image, preprocessing and object-oriented segmentation to obtain a surface feature object data set;

step 2, calculating the shape class characteristics, the index class characteristics, the spectrum class characteristics and the texture class characteristics of each object in the surface feature object data set obtained in the step 1 to serve as an initial characteristic set;

step 3, selecting samples from the surface feature object data set obtained in the step 1 to obtain training samples and test samples;

step 4, inputting the training sample obtained in the step 3 into a random forest method, a gradient boosting decision tree method or a support vector machine method, calculating the importance of each type of features in the initial feature set in the step 2, and sequencing the features from low to high according to the importance to obtain a sequenced feature set;

step 5, removing the first feature (i.e. the feature with the lowest importance) in the sorted feature set to obtain a first feature subset, evaluating the score of the feature subset by using a cross validation method, removing the second feature (i.e. the feature with the lowest importance) in the sorted feature set to obtain a second feature subset, evaluating the score of the feature subset by using the cross validation method, and so on to obtain the kth feature subset, and evaluating the score of the feature subset by using the cross validation method; screening out the characteristic subset with the highest score from the kth characteristic subset of the first characteristic subset and the second characteristic subset … …;

step 6, inputting the training sample obtained in the step 3 into a random forest method, a gradient boosting decision tree method or a support vector machine method, calculating the importance of each type of features in the feature subset with the highest score screened in the step 5, sorting the features according to the importance from low to high to obtain a new sorted feature set, repeating the step 5, and screening out a new feature subset with the highest score;

step 7, repeating the step 6, and recording the score of the feature subset with the highest score in each iteration until the feature subset is an empty set;

and 8, selecting the feature subset with the highest score as the optimal feature subset according to the score conditions of the feature subsets with different feature quantities obtained in the step 7.

In step 1, the pretreatment comprises: geometric correction, radiometric calibration and atmospheric correction.

In step 2, the shape features include length, area, and the like, the index features include an improved Normalized difference water index (Modified Normalized difference water index), a Normalized Difference Vegetation Index (NDVI), an Enhanced vegetation index (Enhanced vegetation index, EVI), and the like, the spectrum features include an average and a variance of spectra of each band, and the texture features include texture based on a gray level co-occurrence matrix.

In step 3, the training samples are 60% -80% of the total number of the training samples and the testing samples, the testing samples are 20% -40% of the total number of the training samples and the testing samples, and the selecting method is layered random sampling. Further preferably, the training samples are 70% of the total number of the training samples and the test samples, the test samples are 30% of the total number of the training samples and the test samples, and the selection method is hierarchical random sampling. And (3) selecting samples from the surface feature object data set obtained in the step 1 by using methods such as visual interpretation, ground survey and the like, wherein the samples comprise training samples and test samples.

In step 5, k is the feature search depth, the value of k can be set manually according to the actual situation, and the value of k is less than or equal to the total number of features in the initial feature set. The improved and enhanced recursive feature screening method enhances the algorithm synchronous searching capability by limiting the searching depth k, and modifies the feature selection basis from the simple basis to the importance to the highest cross validation score, thereby improving the classification capability of the obtained optimal feature subset. The method limits the feature search depth, the maximum search depth needs to comprehensively consider the search precision and efficiency, the number of the cores is set to be the same as that of the CPU cores of the computer but not less than 4, the number of the cores can be set to be 4-15, namely k is 4-15, further optimization is carried out, the maximum search depth is set to be 5-10, namely k is 5-10, and most optimization is carried out, and k is 7.

And 8, after obtaining the optimal feature subset, classifying the original ground feature object data set by using methods such as a random forest, a gradient feature decision tree, a support vector machine and the like based on the obtained optimal feature subset, and evaluating the classification precision by using a test sample.

Compared with the prior art, the invention has the following advantages:

the invention relates to an optimal feature subset determination method based on an improved enhanced feature recursive screening method, which reduces the feature screening time and improves the performance of the optimal feature subset, thereby improving the classification precision of a machine learning method. The method is simple, rapid and accurate, the efficiency of feature screening is improved by limiting the depth of feature search and improving the parallel computing capability of a search algorithm, and on the other hand, the evaluation basis of feature selection is modified from the importance level to the cross validation score level, so that the performance of the optimal feature subset is improved.

Drawings

FIG. 1 is a flow chart of an optimal feature subset determination method based on an improved enhanced feature recursive screening method according to the present invention;

FIG. 2 is a diagram of the geographic location and raw image of a test area;

FIG. 3 is a distribution diagram of various types of ground feature samples in a test area;

FIG. 4 shows the results of the enhanced feature recursive screening method based on RF, GBDT, SVM models.

Fig. 5 is a result of identifying regional crops based on the best feature subset obtained by screening, wherein fig. 5(a) is an identification result of the RF method, fig. 5(b) is an identification result of the GBDT method, fig. 5(c) is an identification result of the SVM method, fig. 5(d) is an enlargement of a result of the rape planting area, and fig. 5(e) is an enlargement of a result of the chive planting area.

Detailed Description

The invention is further illustrated with reference to the figures and examples.

As shown in fig. 1, which is a flowchart of the optimal feature subset determining method based on the improved enhanced feature recursive screening method of the present invention, geometric correction, radiometric calibration, and atmospheric correction are performed on an acquired high-resolution satellite image; secondly, dividing the image of the research area into ground object objects by using a multi-scale division method, and using the ground object objects as basic units for classification and identification; then, according to visual interpretation and other modes, a part of the ground object objects is extracted as a sample and is divided into a training sample and a test sample; then, calculating four major characteristics of the spectrum, the texture, the shape and the index of each object, wherein the characteristics are large in quantity and high in redundancy, and characteristic screening is needed to obtain an optimal characteristic subset; based on the improved enhanced feature recursive screening method, training data are utilized, and based on RF (Random Forest), GBDT (Gradient Boosting Decision Tree) and SVM (Support Vector Machine) models respectively, the optimal feature subset of each model is calculated and obtained; and finally, after the optimal feature subset is obtained, classifying and identifying all objects based on RF, GBDT and SVM methods, and evaluating the identification precision by using the test sample.

An optimal feature subset determination method based on an improved enhanced feature recursive screening method comprises the following steps:

a, acquiring a high-resolution image, preprocessing and carrying out object-oriented segmentation to obtain a ground feature object data set;

specifically, the obtained high-resolution remote sensing image should be a cloudless clear sky image, and different ground objects in the image can be clearly identified. After the image is obtained, the image needs to be preprocessed, which mainly comprises geometric correction, radiometric calibration and atmospheric correction. The geometric correction can be carried out by acquiring control points on the ground or selecting control points (such as Google Earth) on other high-resolution image base maps, selecting corresponding homonymous points on the image to be corrected, and carrying out geometric fine correction on the image by using a polynomial correction method. Radiometric calibration is corrected using radiometric calibration coefficients of the corresponding satellites; and (4) correcting the atmosphere by using an atmospheric radiation transmission model such as 6S and the like to obtain a surface reflectivity image. And obtaining the ground object as a basic unit for classification by utilizing a multi-scale segmentation method for the corrected image. The test area (as shown in FIGS. 2 and 3) used a total of 5 views, including data from three satellites, e.g., Sentinal-2A, Landsat-8, and GF-1 WFV. FIG. 2 is a diagram of the geographic location and raw image of a test area; FIG. 3 is a distribution diagram of various types of ground feature patterns in a test area.

B, calculating various characteristics of the ground object, including shape, index, spectrum, texture and the like, as an initial characteristic set;

specifically, the number of shape features is 12, which are area, length, width, compact, density, asymmetry, roundness, insulatic, rectangle, main direction, circle index, shape index, and shape index.

The texture parameter calculation firstly needs to perform principal component transformation on each scene image, acquire the first principal component band containing the most information, and perform texture calculation on the first principal component band. There are 8 texture features of each scene image, which are GLCM (Gray-Level Co-occurring Matrix) homogeneity, GLCM contrast, GLCM discrete, GLCM entry, GLCMang.2nd moment, GLCM mean, GLCM StdDev, and GLCM corrlation. The 5 scene images obtain 40 features in total

Spectral features the spectral Mean and Mean square deviation V ariance of the object are calculated for all bands of the 5 scene image. Wherein, 2 scenes of Sentinal-2 AMSI image, 10 bands of each scene image, 2 scenes of Landsat-8OLI image, 7 bands of each scene image, 1 scene of GF-1WFV image, 4 bands of each scene image. The total image has 38 wave bands and 76 spectral characteristics.

The index features include a Normalized Difference Vegetation Index (NDVI), an Enhanced Vegetation Index (EVI), a surface water index (LSWI), and a Modified Normalized Difference Water Index (MNDWI). NDVI (Rouse et al 1974) is one of the most widely used vegetation indexes and has wide application in the remote sensing monitoring fields of crop extraction, crop growth and yield and the like (Fuller 1998; Wardlow et al 2007). The EVI (hue et al.1994) aims at the defect that NDVI is easy to saturate when the vegetation density is high, and by decoupling vegetation canopy signals and atmospheric impedance, vegetation information in a remote sensing image is enhanced, and the sensitivity and the detection capability of a vegetation index in a vegetation dense area are improved (hue et al.2002). The LSWI index is then more sensitive to changes in vegetation canopy moisture content and is less susceptible to atmospheric effects than NDVI (Gao 1996; Jurgens 1997). MNDWI (Xu2006) can effectively distinguish water bodies, vegetation and built-up areas (Mansaray et al 2017). The calculation formula of each index is shown in formulas 1-4, NIR in the formula represents a near infrared band reflectivity value, Red represents a Red light band reflectivity value, SWIR represents a short wave infrared reflectivity value, Blue represents a Blue light band reflectivity value, and Green represents a Green light band reflectivity value. Since there are two short-wave infrared bands in Sentinel2A, when the NDWI and MNDWI indices are calculated using Sentinel2A images, the average of the two SWIR bands is substituted into a formula for calculation. Since the GF-1WFV image has no short wave infrared band, the GF-1WFV image only calculates NDVI and EVI index. Thus, a total of 18 exponential features are obtained.

And C, selecting samples from the original ground object data set by using methods such as visual interpretation, ground survey and the like, wherein the samples comprise training samples and testing samples.

Specifically, 2025 objects are randomly selected from the image multi-scale segmentation objects as sample data by using a visual interpretation mode, wherein 649 winter wheat objects, 230 rape objects, 176 chive objects and 970 other objects are selected from the image multi-scale segmentation objects. The types of other objects are mainly the types of ground objects such as buildings, water bodies, wastelands, roads, forest lands, greenhouses and the like, and the distribution of samples is shown in figure 3. By using a layered random sampling method, 1418 samples of 70% are respectively extracted from winter wheat, rape, chive and other sample objects to be used as training samples, and in the model training process of participating in feature screening and machine learning, 607 samples of the rest 30% are used as test samples for analyzing the precision of the final classification result.

And D, calculating the importance of each feature by using a training sample based on a cross validation method and machine learning methods such as a random forest, a gradient boosting decision tree or a support vector machine, and screening the features by using an improved enhanced feature recursive screening method to obtain the classification precision score of each feature subset under different feature quantities.

Specifically, an Enhanced recursive feature screening (EnRFE) technology is used and Improved, and an Improved EnRFE method (Improved EnREF) is used for feature screening, and the specific method is as follows:

(a) inputting the training sample into a random forest method, a gradient boosting decision tree method or a support vector machine method, calculating the importance of various features in the initial feature set, and sequencing the features from low to high according to the importance to obtain a sequenced feature set;

(b) removing a first feature (namely, a feature with the lowest importance) in the sorted feature set to obtain a first feature subset, evaluating the score of the feature subset by using a cross validation method, removing a second feature (namely, a feature with the lowest importance) in the sorted feature set to obtain a second feature subset, evaluating the score of the feature subset by using the cross validation method, and so on to obtain a kth feature subset, and evaluating the score of the feature subset by using the cross validation method; screening out the characteristic subset with the highest score from the kth characteristic subset of the first characteristic subset and the second characteristic subset … …;

k is a feature search depth, and a value of k can be set manually according to actual conditions, in the embodiment, the feature search depth is limited, and the maximum search depth is set to be 7;

(c) inputting the training sample into a random forest method, a gradient boosting decision tree method or a support vector machine method, calculating the importance of each type of features in the feature subset with the highest score screened in the step (b), sequencing the features from low to high according to the importance to obtain a new sequenced feature set, repeating the step (b), and screening a new feature subset with the highest score;

(d) repeating the step (c), and recording the score of the feature subset with the highest score in each iteration until the feature subset is an empty set;

and E, selecting the feature subset with the highest score as the optimal feature subset according to the obtained score conditions of the feature subsets with different feature quantities. According to the principle of highest score, determining the optimal feature subset of each classification method, and removing the residual features as redundant features;

in particular, the improved EnRFE method is used for optimal feature subset screening. The relationship between the feature quantity and the cross validation accuracy of the RF, GBDT and SVM models is shown in FIG. 4, and FIG. 4 shows the result of the enhanced feature recursive screening method based on the RF, GBDT and SVM models. From fig. 4, it can be seen that the cross validation accuracy of the three classification methods shows the characteristic of rapid increase and slow decrease as the number of features increases. When the number of the features is small (less than 10), the classification precision of the three methods is rapidly increased along with the increase of the number of the selected features; when the number of the features is 10-20, the verification precision slowly rises; when the number of the features reaches 20-40, the verification accuracy of the three methods reaches the highest point, and the variation amplitude is small; when the number of features is gradually increased, the cross-validation accuracy of all 3 methods shows a trend of decreasing. The GBDT method has the advantages that the descending amplitude is the minimum, and the GBDT method has better robustness for characteristic redundancy; the accuracy of the RF method then shows a slow but significant downward trend; the accuracy of the SVM method is greatly reduced, particularly in the process that the number of the features is increased from 50 to 70, the accuracy is sharply reduced from 0.87 to 0.83, after the number of the features is more than 70, the overall accuracy is not obviously reduced, but the accuracy stability is low, the amplitude is large, the SVM method is easily influenced by redundant features, and the robustness is relatively low. The highest accuracy of the cross-validation of the GBDT and RF methods is close, both around 0.90, while the accuracy of the SVM method is relatively lower, around 0.88. And finally selecting 30 features as the optimal feature subset according to the highest score principle.

Step F, classifying the original ground feature object data set by using methods such as a random forest, a gradient feature decision tree, a support vector machine and the like based on the obtained optimal feature subset, and evaluating the classification precision by using a test sample;

specifically, the training sample sets are used to respectively train RF, GBDT, and SVM classification models, and the trained classification models are used to classify the ground feature objects in the funxing city, so as to obtain the spatial distribution of crops of Winter wheat (Winter wheat), rape (oiled rape), and Green onion (Green onion), and the result is shown in fig. 5. Fig. 5 is a result of identifying regional crops based on the best feature subset obtained by screening, wherein fig. 5(a) is an identification result of the RF method, fig. 5(b) is an identification result of the GBDT method, fig. 5(c) is an identification result of the SVM method, fig. 5(d) is an enlargement of a result of the rape planting area, and fig. 5(e) is an enlargement of a result of the chive planting area. From fig. 5, it can be seen that the crop identification results of the three classification methods are substantially similar.

The crop extraction precision of each classification method is verified by using a test sample set, and the result shows that the highest Overall classification precision is obtained by combining the GBDT method based on the optimal feature subset obtained by the improved enhanced feature recursive screening method, the OA (Overall accuracy) is 92.5%, and the kappa coefficient is 0.882; secondly, an RF method is adopted, the overall classification precision is 91.7%, and the kappa coefficient is 0.867; the accuracy of the SVM method is relatively lowest, with an OA of 90.5% and a kappa coefficient of 0.853.

Claims

1. A method for determining an optimal subset of features, comprising the steps of:

step 5, removing the first feature in the sorted feature set to obtain a first feature subset, evaluating the score of the feature subset by using a cross validation method, removing the second feature in the sorted feature set to obtain a second feature subset, evaluating the score of the feature subset by using a cross validation method, and repeating the steps to obtain the kth feature subset, and evaluating the score of the feature subset by using the cross validation method; screening out the feature subset with the highest score from the first feature subset, the second feature subset to the kth feature subset;

2. The method for determining the optimal subset of features of claim 1, wherein in step 1, the preprocessing comprises: geometric correction, radiometric calibration and atmospheric correction.

3. The method for determining the optimal feature subset of claim 1, wherein in step 3, the training samples are 60% to 80% of the total number of the training samples and the testing samples, and the testing samples are 20% to 40% of the total number of the training samples and the testing samples.