CN116662859A

CN116662859A - Non-cultural-heritage data feature selection method

Info

Publication number: CN116662859A
Application number: CN202310636101.7A
Authority: CN
Inventors: 赵雪青; 杨晗; 师昕; 刘浩; 吴祯鴻
Original assignee: Xian Polytechnic University
Current assignee: Xian Polytechnic University
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-08-29
Anticipated expiration: 2043-05-31
Also published as: CN116662859B

Abstract

The invention discloses a non-genetic culture data characteristic selection method, which comprises the following steps: acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set; moving the firefly individual with lower fitness towards the direction of the firefly individual with higher fitness, updating the position of the firefly and recalculating the fitness of the firefly individual; and outputting an optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual. Compared with the original data, the non-genetic culture data subjected to the feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy, and better data information completeness is maintained, so that the optimization of the non-genetic culture level classification effect is realized, and the purposes of reducing the data redundancy and optimizing the resources are achieved.

Description

Non-cultural-heritage data feature selection method

Technical Field

The invention belongs to the technical field of data mining methods, and particularly relates to a non-genetic cultural data feature selection method.

Background

In recent years, non-genetic culture is increasingly valued by the state and society, and particularly with the rapid development of information technology, the digital construction of the non-genetic culture is also increasingly strong, and various non-genetic culture information resources are continuously emerging. By classifying and analyzing the non-genetic culture level, a more reasonable decision scheme can be provided for the related departments to divide the level of the future non-genetic culture, so that the non-genetic culture is more effectively protected. However, the existing non-genetic cultural data has a high dimension, which greatly increases the cost of classifying and analyzing the non-genetic cultural data. In addition, the existing non-cultural-relics data information has certain uncertainty, and unimportant characteristics not only increase the redundancy of data, but also cause that ideal effects cannot be achieved when the non-cultural-relics are predicted. Therefore, in order to more effectively analyze cultural data, reducing the cost of data processing, it is necessary to perform feature selection on the cultural data to reduce the data dimension and eliminate unimportant features.

Currently, a firefly algorithm in a swarm intelligent algorithm is designed by inspiring the action of firefly flickering, and is proposed by Xin-She Yang in 2008. Compared with other intelligent algorithms, the firefly algorithm has better performance, but the construction of the adaptability function of the standard firefly algorithm generally cannot ensure that the selected feature subset has smaller information loss, and meanwhile, the algorithm has the problems of low search precision and slow convergence speed in the optimizing process. The neighborhood rough set converts the equivalent relation of the rough set theory into the coverage relation of information particles in the neighborhood space by introducing the concepts of neighborhood granulation and measurement space, and can effectively measure the uncertainty of data information. Therefore, it is necessary to combine the neighborhood rough set with the firefly algorithm, and to conduct improved researches on the aspects of fitness function construction, search updating strategy and the like of the firefly algorithm. For processing high-dimensional complex cultural data.

Disclosure of Invention

The invention aims to provide a non-genetic culture data characteristic selection method which can delete irrelevant or low-importance attributes under the condition of not influencing the final classification result of non-genetic culture data.

The technical scheme adopted by the invention is as follows: the non-cultural-heritage data characteristic selection method comprises the following steps:

step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm;

step 2, calculating fitness Fit of individuals in the firefly population by using the neighborhood granularity rough entropy and the attribute set importance _NGRE ；

Step 3, enabling the firefly individual with low fitness to move towards the direction of the firefly individual with high fitness, updating the position of the firefly and recalculating the fitness of the firefly individual;

step 4, judging whether the current iteration reaches the maximum iteration number T _max If not, returning to the execution step 3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the globally optimal firefly individual.

The present invention is also characterized in that,

the step 1 specifically comprises the following steps: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; wherein, the characteristic subset of the non-genetic culture data set, namely firefly number N is 50, and the maximum iteration number T _max For 30, a firefly population fag= { S with size N is randomly initialized ₁ ,S ₂ ,...,S _N Initial position s= { S for each firefly _i1 ,S _i2 ,...,S _id I is more than or equal to 1 and less than or equal to N, and d represents a feature number; setting an initial attractive force beta ₀ Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T _max The method comprises the steps of carrying out a first treatment on the surface of the Encoding each individual, i.e., each feature subset, using a sigmoid function prior to calculating the fitness of each firefly individual, fromAnd converting the value into 0 and 1 forms, and defining a sigmoid function as follows:

in the step 2, the calculation formula of the neighborhood granularity rough entropy is as follows:

NGRE(S)＝NGK(D|S)×NE _r (D|S) (2)

in formula (2), NGK (D|S) and NE _r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:

in the formula (3) and the formula (4), delta _S (x _i ) Is a neighborhood class of samples in feature subset S, |δ _S∪D (x _i ) The i is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space;

calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:

in the formula (5), lambda ₁ And lambda (lambda) ₂ Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda ₁ +λ ₂ =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is neighborhood granularity coarse entropy.

Step 3, comparing the sizes of the fitness of the firefly individuals, enabling the firefly individuals with lower fitness to move towards the direction of the firefly individuals with higher fitness, calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, updating the position of fireflies, and recalculating the fitness of the firefly individuals; the method specifically comprises the following steps:

step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:

in formula (6), beta ₀ Is the attractive force when r=0, γ is the light absorption coefficient, r _ij Is firefly individual x _i And x _j A distance therebetween;

step 3.2, for any two fireflies S _i And S is _j E FAG, if S _j Is higher than S _i The firefly S _i Towards S _j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

in the formula (7), d represents the space dimension of firefly individual, namely the characteristic dimension, alpha E [0,1]Is a step factor, beta (r) _ij ) Is firefly x _i And x _j The attractive force between them, (rand-1/2) is [ -0.5,0.5]Random numbers in the interval, t is the iteration number;

step 3.3, updating the firefly individual S by using the formula (5) _i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.

The step 4 further comprises dividing an optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to the proportion of 7:3, classifying the divided feature subsets by adopting a CART decision tree model, and selecting an initial root node of a CART decision tree by calculating the base index of each feature in the training set T in the classifying process to divide the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:

in the formula (8), T represents the number of non-genetic culture data in the training set T and C _k I represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and the training set T is divided into T by supposing the value of the characteristic A ₁ And T ₂ Two categories, then |T ₁ I and T ₂ The I respectively represents the non-genetic culture data amount contained in each category;

for each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.

The beneficial effects of the invention are as follows: according to the non-genetic culture data feature selection method, compared with original data, the non-genetic culture data subjected to feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy and keeps better data information completeness, so that optimization of non-genetic culture level classification effects is realized, and the purposes of reducing data redundancy and optimizing resources are achieved.

Drawings

FIG. 1 is a flow chart of a non-cultural data feature selection method of the present invention;

FIG. 2 is a graph of the comparison results of example 3 using AUC, ACC, F1 evaluation criteria on three comparison methods in the non-cultural data feature selection method of the present invention;

FIG. 3 is a graph of the comparison results of example 3 employing feature subset scale evaluation metrics on three comparison methods in a non-cultural data feature selection method of the present invention.

Detailed Description

The invention will be described in detail with reference to the accompanying drawings and detailed description.

Example 1

As shown in fig. 1, the method comprises the following steps:

step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method is implemented according to the following steps:

and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the number N of fireflies (i.e. feature subsets) is 50, the maximum number T of iterations _max 30. Randomly initializing firefly population FAG= { S with size of N ₁ ,S ₂ ,...,S _N Initial position s= { S for each firefly _i1 ,S _i2 ,...,S _id 1.ltoreq.i.ltoreq.N, d representing the number of features. Setting an initial attractive force beta ₀ Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T _max . Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:

and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method is implemented according to the following steps:

the neighborhood granularity coarse entropy calculation formula is as follows:

NGRE(S)＝NGK(D|S)×NEr(D|S) (2)

in the formula, NGK (D|B) and NE _r (D|B) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:

wherein delta _S (x _i ) For the neighborhood class of samples in the attribute subset S, |δ _S∪D (x _i ) The i is the neighborhood class of samples in the attribute subset S and the decision attribute D, and U is the sample space.

wherein lambda is ₁ And lambda (lambda) ₂ Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda ₁ +λ ₂ ＝1，Fit _NGRE Is the fitness of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.

And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method is implemented according to the following steps:

comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:

wherein beta is ₀ Is the attractive force when r=0, γ is the light absorption coefficient, r _ij Is firefly individual x _i And x _j Distance between them.

For any two fireflies S _i And S is _j E FAG, if S _j Is higher than S _i The firefly S _i Towards S _j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

where d represents the spatial dimension (i.e., characteristic dimension) of firefly individual, α ε [0,1]Is a step factor, beta (r) _ij ) Is firefly x _i And x _j The attractive force between them, (rand-1/2) is [ -0.5,0.5]The random number in the interval, t, is the number of iterations.

Updating firefly individual S using equation (5) _i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.

Step 4, judging whether the current iteration reaches the maximum iteration number T _max (maximum iteration count T in the present invention) _max 30), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the overall optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm.

Example 2

In order to verify the effectiveness of the non-genetic cultural data feature selection method, the method utilizes a CART decision tree algorithm to execute classification operation on the processed non-genetic cultural data set and evaluate classification results. The method is implemented according to the following steps:

judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:

wherein, T represents the number of non-genetic culture data in the training set T and C _k The I represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, and K is the number of the non-genetic culture levels, and the value of K is 2 in the invention. Let the value of feature A divide training set T into T ₁ And T ₂ Two categories, then |T ₁ I and T ₂ The i indicates the amount of non-genetic cultural data contained in each category, respectively.

For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.

For the classification results, AUC, accuracy (hereinafter referred to as ACC), F1-score (hereinafter referred to as F1), and feature subset size were used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is a harmonic mean of Precision and Recall, and its value range is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model.

Through the mode, the non-genetic cultural data characteristic selection method of the invention carries out characteristic selection on the collected non-genetic cultural data A to generate a group of [ x ] ₁ ,x ₁ ,...,x _n ]Feature subset of a vector set, where n is the largest dimension of the dataset features where x _i =0 or 1, indicating whether the current feature is selected to screen out the key feature in the data, and reject the redundant data feature. The invention can generate a group of feature subsets, a decision maker can select an optimization scheme of the feature subsets according to decision requirements, and then generate new cultural data B based on the selected feature subset scheme and the non-genetic cultural data A. At this time, the cultural data B has a lower dimension than the non-cultural data a. When classifying the cultural data, the non-genetic cultural data B has lower dimensionality, and keeps better classification performance, so that the optimization of computing resources is realized.

Example 3

As shown in fig. 1, the method is specifically implemented according to the following steps:

step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method comprises the following steps: and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the number N of fireflies (i.e. feature subsets) is 50, the maximum number T of iterations _max 30. Randomly initializing firefly population FAG= { S with size of N ₁ ,S ₂ ,...,S _N Each of (E)Initial position s= { S corresponding to firefly only _i1 ,S _i2 ,...,S _id 1.ltoreq.i.ltoreq.N, d representing the number of features. Setting an initial attractive force beta ₀ Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T _max . Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:

and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method comprises the following steps: the neighborhood granularity coarse entropy calculation formula is as follows:

NGRE(S)＝NGK(D|S)×NE _r (D|S) (2)

in the formula, NGK (D|S) and NE _r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:

wherein delta _S (x _i ) Is a neighborhood class of samples in feature subset S, |δ _S∪D (x _i ) I is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space.

And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method comprises the following steps: comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

where d represents the spatial dimension (i.e., characteristic dimension) of firefly individual, α ε [0,1]Is the step lengthFactor, beta (r) _ij ) Is firefly x _i And x _j The attractive force between them, (rand-1/2) is [ -0.5,0.5]The random number in the interval, t, is the number of iterations.

Step 4, judging whether the current iteration reaches the maximum iteration number T _max (maximum iteration count T in the present invention) _max 30), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the overall optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm. In order to verify the effectiveness of the non-genetic cultural data feature selection method, the CART decision tree model is utilized to perform classification operation on the processed non-genetic cultural data set, and the classification result is evaluated. The method is implemented according to the following steps:

wherein, T represents the number of non-genetic culture data in the training set T and C _k The I represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, and K is the number of the non-genetic culture levels, and the value of K is 2 in the invention. Supposing specialThe value of sign A divides training set T into T ₁ And T ₂ Two categories, then |T ₁ I and T ₂ The i indicates the amount of non-genetic cultural data contained in each category, respectively.

For this embodiment, AUC, ACC, F1 and feature subset size are used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is the harmonic mean of Precision and Recall. The value range of F1 is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model. The feature subset size refers to the number of feature subsets after feature selection, and the smaller the feature subset size, the better the feature subset size.

In this example, the present invention was compared with three existing feature selection methods on four evaluation indicators, the comparison method comprising: SSA (sparrow search algorithm), HHO (harris eagle optimization algorithm), RFE (feature recursive elimination algorithm), and the comparison result is shown in fig. 2 and 3. From fig. 2 and 3, it can be seen that the effect of the present invention is optimal, and the four evaluation indexes are all significantly improved. The method and the device can effectively acquire the feature subset with high importance and acquire a better classification result.

Claims

1. The non-cultural-heritage data characteristic selection method is characterized by comprising the following steps of:

2. The method for selecting non-cultural-of-missing data features as defined in claim 1, wherein said step 1 is specifically as follows: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; wherein, the characteristic subset of the non-genetic culture data set, namely firefly number N is 50, and the maximum iteration number T _max For 30, a firefly population fag= { S with size N is randomly initialized ₁ ,S ₂ ,...,S _N Initial position s= { S for each firefly _i1 ,S _i2 ,...,S _id I is more than or equal to 1 and less than or equal to N, and d represents a feature number; setting an initial attractive force beta ₀ Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T _max The method comprises the steps of carrying out a first treatment on the surface of the Before calculating the fitness of each firefly individual, i.e. each feature subset, each individual is encoded with a sigmoid function, which is defined as follows, to convert its value into a form 0, 1:

3. the method for selecting non-genetic cultural data features as defined in claim 2, wherein the neighborhood granularity coarse entropy calculation formula in step 2 is as follows:

NGRE(S)＝NGK(D|S)×NEr(D|S) (2)

4. The non-genetic culture data feature selection method as claimed in claim 3, wherein the step 3 includes comparing the sizes of the adaptation degree of the firefly individuals, moving the firefly individuals with lower adaptation degree toward the direction of the firefly individuals with higher adaptation degree, calculating the mutual attraction force between each firefly individual and other firefly individuals according to the space distance, and further updating the position of the firefly and recalculating the adaptation degree of the firefly individuals; the method specifically comprises the following steps:

Sid(t+1)＝Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)

5. The method for selecting non-genetic culture data features as claimed in claim 4, wherein the step 4 further comprises dividing the optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to a ratio of 7:3, classifying the divided feature subsets by using a CART decision tree model, and selecting an initial root node of the CART decision tree by calculating a base index of each feature in the training set T during the classification process, dividing the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows: