CN116662859A - Non-cultural-heritage data feature selection method - Google Patents

Non-cultural-heritage data feature selection method Download PDF

Info

Publication number
CN116662859A
CN116662859A CN202310636101.7A CN202310636101A CN116662859A CN 116662859 A CN116662859 A CN 116662859A CN 202310636101 A CN202310636101 A CN 202310636101A CN 116662859 A CN116662859 A CN 116662859A
Authority
CN
China
Prior art keywords
firefly
feature
genetic
fitness
individual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310636101.7A
Other languages
Chinese (zh)
Other versions
CN116662859B (en
Inventor
赵雪青
杨晗
师昕
刘浩
吴祯鴻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Polytechnic University
Original Assignee
Xian Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Polytechnic University filed Critical Xian Polytechnic University
Priority to CN202310636101.7A priority Critical patent/CN116662859B/en
Publication of CN116662859A publication Critical patent/CN116662859A/en
Application granted granted Critical
Publication of CN116662859B publication Critical patent/CN116662859B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a non-genetic culture data characteristic selection method, which comprises the following steps: acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set; moving the firefly individual with lower fitness towards the direction of the firefly individual with higher fitness, updating the position of the firefly and recalculating the fitness of the firefly individual; and outputting an optimal feature subset of the non-genetic culture data set corresponding to the global optimal firefly individual. Compared with the original data, the non-genetic culture data subjected to the feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy, and better data information completeness is maintained, so that the optimization of the non-genetic culture level classification effect is realized, and the purposes of reducing the data redundancy and optimizing the resources are achieved.

Description

Non-cultural-heritage data feature selection method
Technical Field
The invention belongs to the technical field of data mining methods, and particularly relates to a non-genetic cultural data feature selection method.
Background
In recent years, non-genetic culture is increasingly valued by the state and society, and particularly with the rapid development of information technology, the digital construction of the non-genetic culture is also increasingly strong, and various non-genetic culture information resources are continuously emerging. By classifying and analyzing the non-genetic culture level, a more reasonable decision scheme can be provided for the related departments to divide the level of the future non-genetic culture, so that the non-genetic culture is more effectively protected. However, the existing non-genetic cultural data has a high dimension, which greatly increases the cost of classifying and analyzing the non-genetic cultural data. In addition, the existing non-cultural-relics data information has certain uncertainty, and unimportant characteristics not only increase the redundancy of data, but also cause that ideal effects cannot be achieved when the non-cultural-relics are predicted. Therefore, in order to more effectively analyze cultural data, reducing the cost of data processing, it is necessary to perform feature selection on the cultural data to reduce the data dimension and eliminate unimportant features.
Currently, a firefly algorithm in a swarm intelligent algorithm is designed by inspiring the action of firefly flickering, and is proposed by Xin-She Yang in 2008. Compared with other intelligent algorithms, the firefly algorithm has better performance, but the construction of the adaptability function of the standard firefly algorithm generally cannot ensure that the selected feature subset has smaller information loss, and meanwhile, the algorithm has the problems of low search precision and slow convergence speed in the optimizing process. The neighborhood rough set converts the equivalent relation of the rough set theory into the coverage relation of information particles in the neighborhood space by introducing the concepts of neighborhood granulation and measurement space, and can effectively measure the uncertainty of data information. Therefore, it is necessary to combine the neighborhood rough set with the firefly algorithm, and to conduct improved researches on the aspects of fitness function construction, search updating strategy and the like of the firefly algorithm. For processing high-dimensional complex cultural data.
Disclosure of Invention
The invention aims to provide a non-genetic culture data characteristic selection method which can delete irrelevant or low-importance attributes under the condition of not influencing the final classification result of non-genetic culture data.
The technical scheme adopted by the invention is as follows: the non-cultural-heritage data characteristic selection method comprises the following steps:
step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm;
step 2, calculating fitness Fit of individuals in the firefly population by using the neighborhood granularity rough entropy and the attribute set importance NGRE
Step 3, enabling the firefly individual with low fitness to move towards the direction of the firefly individual with high fitness, updating the position of the firefly and recalculating the fitness of the firefly individual;
step 4, judging whether the current iteration reaches the maximum iteration number T max If not, returning to the execution step 3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the globally optimal firefly individual.
The present invention is also characterized in that,
the step 1 specifically comprises the following steps: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; wherein, the characteristic subset of the non-genetic culture data set, namely firefly number N is 50, and the maximum iteration number T max For 30, a firefly population fag= { S with size N is randomly initialized 1 ,S 2 ,...,S N Initial position s= { S for each firefly i1 ,S i2 ,...,S id I is more than or equal to 1 and less than or equal to N, and d represents a feature number; setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max The method comprises the steps of carrying out a first treatment on the surface of the Encoding each individual, i.e., each feature subset, using a sigmoid function prior to calculating the fitness of each firefly individual, fromAnd converting the value into 0 and 1 forms, and defining a sigmoid function as follows:
in the step 2, the calculation formula of the neighborhood granularity rough entropy is as follows:
NGRE(S)=NGK(D|S)×NE r (D|S) (2)
in formula (2), NGK (D|S) and NE r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
in the formula (3) and the formula (4), delta S (x i ) Is a neighborhood class of samples in feature subset S, |δ S∪D (x i ) The i is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space;
calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
in the formula (5), lambda 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 12 =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is neighborhood granularity coarse entropy.
Step 3, comparing the sizes of the fitness of the firefly individuals, enabling the firefly individuals with lower fitness to move towards the direction of the firefly individuals with higher fitness, calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, updating the position of fireflies, and recalculating the fitness of the firefly individuals; the method specifically comprises the following steps:
step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:
in formula (6), beta 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j A distance therebetween;
step 3.2, for any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
in the formula (7), d represents the space dimension of firefly individual, namely the characteristic dimension, alpha E [0,1]Is a step factor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]Random numbers in the interval, t is the iteration number;
step 3.3, updating the firefly individual S by using the formula (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
The step 4 further comprises dividing an optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to the proportion of 7:3, classifying the divided feature subsets by adopting a CART decision tree model, and selecting an initial root node of a CART decision tree by calculating the base index of each feature in the training set T in the classifying process to divide the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:
in the formula (8), T represents the number of non-genetic culture data in the training set T and C k I represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and the training set T is divided into T by supposing the value of the characteristic A 1 And T 2 Two categories, then |T 1 I and T 2 The I respectively represents the non-genetic culture data amount contained in each category;
for each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.
The beneficial effects of the invention are as follows: according to the non-genetic culture data feature selection method, compared with original data, the non-genetic culture data subjected to feature selection processing has lower dimensionality, and when the processed non-genetic culture data is classified, the non-genetic culture data has lower redundancy and keeps better data information completeness, so that optimization of non-genetic culture level classification effects is realized, and the purposes of reducing data redundancy and optimizing resources are achieved.
Drawings
FIG. 1 is a flow chart of a non-cultural data feature selection method of the present invention;
FIG. 2 is a graph of the comparison results of example 3 using AUC, ACC, F1 evaluation criteria on three comparison methods in the non-cultural data feature selection method of the present invention;
FIG. 3 is a graph of the comparison results of example 3 employing feature subset scale evaluation metrics on three comparison methods in a non-cultural data feature selection method of the present invention.
Detailed Description
The invention will be described in detail with reference to the accompanying drawings and detailed description.
Example 1
As shown in fig. 1, the method comprises the following steps:
step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method is implemented according to the following steps:
and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the number N of fireflies (i.e. feature subsets) is 50, the maximum number T of iterations max 30. Randomly initializing firefly population FAG= { S with size of N 1 ,S 2 ,...,S N Initial position s= { S for each firefly i1 ,S i2 ,...,S id 1.ltoreq.i.ltoreq.N, d representing the number of features. Setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max . Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:
and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method is implemented according to the following steps:
the neighborhood granularity coarse entropy calculation formula is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
in the formula, NGK (D|B) and NE r (D|B) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
wherein delta S (x i ) For the neighborhood class of samples in the attribute subset S, |δ S∪D (x i ) The i is the neighborhood class of samples in the attribute subset S and the decision attribute D, and U is the sample space.
Calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
wherein lambda is 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 12 =1,Fit NGRE Is the fitness of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.
And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method is implemented according to the following steps:
comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:
wherein beta is 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j Distance between them.
For any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
where d represents the spatial dimension (i.e., characteristic dimension) of firefly individual, α ε [0,1]Is a step factor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]The random number in the interval, t, is the number of iterations.
Updating firefly individual S using equation (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
Step 4, judging whether the current iteration reaches the maximum iteration number T max (maximum iteration count T in the present invention) max 30), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the overall optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm.
Example 2
In order to verify the effectiveness of the non-genetic cultural data feature selection method, the method utilizes a CART decision tree algorithm to execute classification operation on the processed non-genetic cultural data set and evaluate classification results. The method is implemented according to the following steps:
judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:
wherein, T represents the number of non-genetic culture data in the training set T and C k The I represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, and K is the number of the non-genetic culture levels, and the value of K is 2 in the invention. Let the value of feature A divide training set T into T 1 And T 2 Two categories, then |T 1 I and T 2 The i indicates the amount of non-genetic cultural data contained in each category, respectively.
For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.
For the classification results, AUC, accuracy (hereinafter referred to as ACC), F1-score (hereinafter referred to as F1), and feature subset size were used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is a harmonic mean of Precision and Recall, and its value range is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model.
Through the mode, the non-genetic cultural data characteristic selection method of the invention carries out characteristic selection on the collected non-genetic cultural data A to generate a group of [ x ] 1 ,x 1 ,...,x n ]Feature subset of a vector set, where n is the largest dimension of the dataset features where x i =0 or 1, indicating whether the current feature is selected to screen out the key feature in the data, and reject the redundant data feature. The invention can generate a group of feature subsets, a decision maker can select an optimization scheme of the feature subsets according to decision requirements, and then generate new cultural data B based on the selected feature subset scheme and the non-genetic cultural data A. At this time, the cultural data B has a lower dimension than the non-cultural data a. When classifying the cultural data, the non-genetic cultural data B has lower dimensionality, and keeps better classification performance, so that the optimization of computing resources is realized.
Example 3
As shown in fig. 1, the method is specifically implemented according to the following steps:
step 1, acquiring a non-genetic culture data set, constructing a non-genetic culture data set feature selection model based on a firefly algorithm, and initializing parameters such as firefly population scale (i.e. feature subset), light absorption coefficient, maximum iteration number and the like according to the non-genetic culture data set. The method comprises the following steps: and initializing characteristic selection model parameters based on a firefly algorithm idea according to the acquired non-genetic culture data set. Wherein the number N of fireflies (i.e. feature subsets) is 50, the maximum number T of iterations max 30. Randomly initializing firefly population FAG= { S with size of N 1 ,S 2 ,...,S N Each of (E)Initial position s= { S corresponding to firefly only i1 ,S i2 ,...,S id 1.ltoreq.i.ltoreq.N, d representing the number of features. Setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max . Before computing the fitness of each firefly individual (i.e., each feature subset), each individual is encoded with a sigmoid function, converting its value to a 0,1 form. The sigmoid function is defined as follows:
and step 2, calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set. The method comprises the following steps: the neighborhood granularity coarse entropy calculation formula is as follows:
NGRE(S)=NGK(D|S)×NE r (D|S) (2)
in the formula, NGK (D|S) and NE r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
wherein delta S (x i ) Is a neighborhood class of samples in feature subset S, |δ S∪D (x i ) I is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space.
Calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
wherein lambda is 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 12 =1,Fit NGRE Is the fitness of individuals in the firefly population. For any firefly (i.e., feature subset) S e FAG, |s| is the number of features of feature subset S, and N is all the number of features. NGRE (S) is neighborhood granularity coarse entropy.
And 3, comparing the sizes of the adaptation degrees of the firefly individuals, enabling the firefly individuals with lower adaptation degrees to move towards the direction of the firefly individuals with higher adaptation degrees, calculating the mutual attractive force between each firefly individual and other firefly individuals according to the spatial distance, updating the position of the firefly and recalculating the adaptation degree of the firefly individuals. The method comprises the following steps: comparing the fitness of each firefly individual with the fitness of other firefly individuals in sequence, determining which firefly individuals in the population attract each firefly individual respectively according to the principle that firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other firefly individuals according to the space distance, wherein the attraction calculation formula is as follows:
wherein beta is 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j Distance between them.
For any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
where d represents the spatial dimension (i.e., characteristic dimension) of firefly individual, α ε [0,1]Is the step lengthFactor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]The random number in the interval, t, is the number of iterations.
Updating firefly individual S using equation (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
Step 4, judging whether the current iteration reaches the maximum iteration number T max (maximum iteration count T in the present invention) max 30), if not, returning to the step 3, otherwise, outputting a feature subset corresponding to the overall optimal firefly individual, and finally obtaining the optimal feature subset of the non-genetic culture data set based on the firefly algorithm. In order to verify the effectiveness of the non-genetic cultural data feature selection method, the CART decision tree model is utilized to perform classification operation on the processed non-genetic cultural data set, and the classification result is evaluated. The method is implemented according to the following steps:
judging whether the optimizing result meets the ending condition (the maximum iteration times are reached), if not, turning to the step 3, and carrying out the next optimizing; and if the ending condition is met, using the optimal feature subset obtained in the step 3 in a feature selection process of the non-genetic culture data set. Dividing the non-genetic culture data set R after feature selection processing into a training set T and a testing set V according to the proportion of 7:3, and classifying and analyzing the divided data set by adopting a CART decision tree model. In the classification process, an initial root node is selected by calculating a base index for each feature in the training set T, and the training set T is divided into several subsets. The formula for calculating the base index of each feature A in the training set T is as follows:
wherein, T represents the number of non-genetic culture data in the training set T and C k The I represents the non-genetic culture data amount of the kth category (namely the country level or the province level) in the training set T, and K is the number of the non-genetic culture levels, and the value of K is 2 in the invention. Supposing specialThe value of sign A divides training set T into T 1 And T 2 Two categories, then |T 1 I and T 2 The i indicates the amount of non-genetic cultural data contained in each category, respectively.
For each subset of the partitions, if the non-cultural data in the subset belongs to the same category (such as a country level), the subset is marked as a category; otherwise, jump to the step of calculating the feature base index and apply the above steps recursively on each subset. This process is repeated until the stop condition is satisfied. The constructed CART decision tree model may classify the test set V, classifying non-genetic cultural data in the test set into predefined categories.
For this embodiment, AUC, ACC, F1 and feature subset size are used to evaluate them. Where AUC values are the size of the area enclosed by the ROC curve and the coordinate axis, which clearly shows the classification effect of the classifier. The closer the AUC value is to 1, the better the classification performance. When the AUC value is less than or equal to 0.5, the worse the classification capacity is represented. ACC refers to the accuracy of sample classification, i.e., the ratio of the number of samples correctly classified by the classifier to the total number of samples. F1 is the harmonic mean of Precision and Recall. The value range of F1 is [0,1].1 represents the best output of the model, and 0 represents the worst output of the model. The feature subset size refers to the number of feature subsets after feature selection, and the smaller the feature subset size, the better the feature subset size.
In this example, the present invention was compared with three existing feature selection methods on four evaluation indicators, the comparison method comprising: SSA (sparrow search algorithm), HHO (harris eagle optimization algorithm), RFE (feature recursive elimination algorithm), and the comparison result is shown in fig. 2 and 3. From fig. 2 and 3, it can be seen that the effect of the present invention is optimal, and the four evaluation indexes are all significantly improved. The method and the device can effectively acquire the feature subset with high importance and acquire a better classification result.

Claims (5)

1. The non-cultural-heritage data characteristic selection method is characterized by comprising the following steps of:
step 1, acquiring a non-genetic culture data set, and constructing a non-genetic culture data set feature selection model based on a firefly algorithm;
step 2, calculating fitness Fit of individuals in the firefly population by using the neighborhood granularity rough entropy and the attribute set importance NGRE
Step 3, enabling the firefly individual with low fitness to move towards the direction of the firefly individual with high fitness, updating the position of the firefly and recalculating the fitness of the firefly individual;
step 4, judging whether the current iteration reaches the maximum iteration number T max If not, returning to the execution step 3, otherwise, outputting the optimal feature subset of the non-genetic culture data set corresponding to the globally optimal firefly individual.
2. The method for selecting non-cultural-of-missing data features as defined in claim 1, wherein said step 1 is specifically as follows: initializing characteristic selection model parameters based on a firefly algorithm according to the acquired non-genetic culture data set; wherein, the characteristic subset of the non-genetic culture data set, namely firefly number N is 50, and the maximum iteration number T max For 30, a firefly population fag= { S with size N is randomly initialized 1 ,S 2 ,...,S N Initial position s= { S for each firefly i1 ,S i2 ,...,S id I is more than or equal to 1 and less than or equal to N, and d represents a feature number; setting an initial attractive force beta 0 Absorption coefficient gamma of propagation medium to light, disturbance factor alpha of step length and maximum iteration number T max The method comprises the steps of carrying out a first treatment on the surface of the Before calculating the fitness of each firefly individual, i.e. each feature subset, each individual is encoded with a sigmoid function, which is defined as follows, to convert its value into a form 0, 1:
3. the method for selecting non-genetic cultural data features as defined in claim 2, wherein the neighborhood granularity coarse entropy calculation formula in step 2 is as follows:
NGRE(S)=NGK(D|S)×NEr(D|S) (2)
in formula (2), NGK (D|S) and NE r (D|S) is the neighborhood knowledge granularity and neighborhood rough entropy of the candidate feature subset S relative to the decision attribute D, and the calculation formula is as follows:
in the formula (3) and the formula (4), delta S (x i ) Is a neighborhood class of samples in feature subset S, |δ S∪D (x i ) The i is the neighborhood class of samples in feature subset S and decision attribute D, U is the sample space;
calculating the fitness of individuals in the firefly population by using the rough entropy of the neighborhood granularity and the importance of the attribute set, wherein the calculation formula is as follows:
in the formula (5), lambda 1 And lambda (lambda) 2 Used for adjusting influence degree of neighborhood granularity coarse entropy and attribute set importance, and lambda 12 =1; for any firefly, namely, a feature subset S epsilon FAG, s|is the feature number of the feature subset S, and N is the number of all features; NGRE (S) is neighborhood granularity coarse entropy.
4. The non-genetic culture data feature selection method as claimed in claim 3, wherein the step 3 includes comparing the sizes of the adaptation degree of the firefly individuals, moving the firefly individuals with lower adaptation degree toward the direction of the firefly individuals with higher adaptation degree, calculating the mutual attraction force between each firefly individual and other firefly individuals according to the space distance, and further updating the position of the firefly and recalculating the adaptation degree of the firefly individuals; the method specifically comprises the following steps:
step 3.1, sequentially comparing the fitness of each firefly individual with the fitness of other firefly individuals, determining which firefly individuals in the population attract each firefly individual according to the principle that the firefly individuals with low fitness are attracted by firefly individuals with high fitness, and calculating the mutual attraction between each firefly individual and other fireflies according to the space distance, wherein the attraction calculation formula is as follows:
in formula (6), beta 0 Is the attractive force when r=0, γ is the light absorption coefficient, r ij Is firefly individual x i And x j A distance therebetween;
step 3.2, for any two fireflies S i And S is j E FAG, if S j Is higher than S i The firefly S i Towards S j The direction of the position moves, and the position update calculation formula of the firefly individual is as follows:
Sid(t+1)=Sid(t)+β(rij)(Sjd(t)-Sid(t))+α(rand-1/2) (7)
in the formula (7), d represents the space dimension of firefly individual, namely the characteristic dimension, alpha E [0,1]Is a step factor, beta (r) ij ) Is firefly x i And x j The attractive force between them, (rand-1/2) is [ -0.5,0.5]Random numbers in the interval, t is the iteration number;
step 3.3, updating the firefly individual S by using the formula (5) i And (3) sequencing all fireflies and finding out firefly individuals with optimal adaptability in the current iteration times.
5. The method for selecting non-genetic culture data features as claimed in claim 4, wherein the step 4 further comprises dividing the optimal feature subset R of the output non-genetic culture data set into a training set T and a test set V according to a ratio of 7:3, classifying the divided feature subsets by using a CART decision tree model, and selecting an initial root node of the CART decision tree by calculating a base index of each feature in the training set T during the classification process, dividing the training set T into a plurality of subsets; the formula for calculating the base index of each feature A in the training set T is as follows:
in the formula (8), T represents the number of non-genetic culture data in the training set T and C k I represents the non-genetic culture data amount of the kth category in the training set T, K is the number of non-genetic culture levels, and the training set T is divided into T by supposing the value of the characteristic A 1 And T 2 Two categories, then |T 1 I and T 2 The I respectively represents the non-genetic culture data amount contained in each category;
for each divided subset, if the non-cultural data in the subset belongs to the same category, marking the subset as one category; otherwise, jumping to a step of calculating the feature base index, and recursively applying the above steps on each subset; this process is repeated until the stop condition is satisfied.
CN202310636101.7A 2023-05-31 2023-05-31 Non-cultural-heritage data feature selection method Active CN116662859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310636101.7A CN116662859B (en) 2023-05-31 2023-05-31 Non-cultural-heritage data feature selection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310636101.7A CN116662859B (en) 2023-05-31 2023-05-31 Non-cultural-heritage data feature selection method

Publications (2)

Publication Number Publication Date
CN116662859A true CN116662859A (en) 2023-08-29
CN116662859B CN116662859B (en) 2024-04-19

Family

ID=87720173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310636101.7A Active CN116662859B (en) 2023-05-31 2023-05-31 Non-cultural-heritage data feature selection method

Country Status (1)

Country Link
CN (1) CN116662859B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104076512A (en) * 2013-03-25 2014-10-01 精工爱普生株式会社 Head-mounted display device and method of controlling head-mounted display device
CN105493173A (en) * 2013-06-28 2016-04-13 诺基亚技术有限公司 Supporting activation of function of device
CN105824937A (en) * 2016-03-17 2016-08-03 合肥工业大学 Attribute selection method based on binary system firefly algorithm
CN106779063A (en) * 2016-11-15 2017-05-31 河南理工大学 A kind of hoist braking system method for diagnosing faults based on RBF networks
CN107230213A (en) * 2017-05-15 2017-10-03 昆明理工大学 A kind of colored mine belt zoning map of multi thresholds shaking table based on improvement glowworm swarm algorithm is as split plot design
US20170364933A1 (en) * 2014-12-09 2017-12-21 Beijing Didi Infinity Technology And Development Co., Ltd. User maintenance system and method
CN108417171A (en) * 2017-02-10 2018-08-17 宏碁股份有限公司 Display device and its display parameters method of adjustment
CN110162841A (en) * 2019-04-26 2019-08-23 南京航空航天大学 A kind of Milling Process multi-objective method introducing three-dimensional stability constraint
CN110537165A (en) * 2017-10-26 2019-12-03 华为技术有限公司 A kind of display methods and device
CN110867172A (en) * 2019-11-19 2020-03-06 苹果公司 Electronic device for dynamically controlling standard dynamic range and high dynamic range content

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104076512A (en) * 2013-03-25 2014-10-01 精工爱普生株式会社 Head-mounted display device and method of controlling head-mounted display device
CN105493173A (en) * 2013-06-28 2016-04-13 诺基亚技术有限公司 Supporting activation of function of device
US20170364933A1 (en) * 2014-12-09 2017-12-21 Beijing Didi Infinity Technology And Development Co., Ltd. User maintenance system and method
CN105824937A (en) * 2016-03-17 2016-08-03 合肥工业大学 Attribute selection method based on binary system firefly algorithm
CN106779063A (en) * 2016-11-15 2017-05-31 河南理工大学 A kind of hoist braking system method for diagnosing faults based on RBF networks
CN108417171A (en) * 2017-02-10 2018-08-17 宏碁股份有限公司 Display device and its display parameters method of adjustment
CN107230213A (en) * 2017-05-15 2017-10-03 昆明理工大学 A kind of colored mine belt zoning map of multi thresholds shaking table based on improvement glowworm swarm algorithm is as split plot design
CN110537165A (en) * 2017-10-26 2019-12-03 华为技术有限公司 A kind of display methods and device
CN110162841A (en) * 2019-04-26 2019-08-23 南京航空航天大学 A kind of Milling Process multi-objective method introducing three-dimensional stability constraint
CN110867172A (en) * 2019-11-19 2020-03-06 苹果公司 Electronic device for dynamically controlling standard dynamic range and high dynamic range content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭鹏 等: "基于改进二元萤火虫群优化算法和邻域粗糙集的属性约简方法", 《模式识别与人工智能》, vol. 33, no. 2, pages 95 - 105 *

Also Published As

Publication number Publication date
CN116662859B (en) 2024-04-19

Similar Documents

Publication Publication Date Title
CN112784881B (en) Network abnormal flow detection method, model and system
CN107622182B (en) Method and system for predicting local structural features of protein
CN112101430B (en) Anchor frame generation method for image target detection processing and lightweight target detection method
CN109344698B (en) Hyperspectral band selection method based on separable convolution and hard threshold function
CN110674865B (en) Rule learning classifier integration method oriented to software defect class distribution unbalance
CN109388565B (en) Software system performance optimization method based on generating type countermeasure network
CN112687349A (en) Construction method of model for reducing octane number loss
CN111601358B (en) Multi-stage hierarchical clustering spatial correlation temperature perception data redundancy removing method
CN114580281A (en) Model quantization method, apparatus, device, storage medium, and program product
CN111309577B (en) Spark-oriented batch application execution time prediction model construction method
CN113724195B (en) Quantitative analysis model and establishment method of protein based on immunofluorescence image
CN114663770A (en) Hyperspectral image classification method and system based on integrated clustering waveband selection
CN113066528B (en) Protein classification method based on active semi-supervised graph neural network
CN116662859B (en) Non-cultural-heritage data feature selection method
CN111832645A (en) Classification data feature selection method based on discrete crow difference collaborative search algorithm
CN113177078B (en) Approximate query processing algorithm based on condition generation model
CN114117876A (en) Feature selection method based on improved Harris eagle algorithm
CN112287437A (en) Multimodal extreme value solving method applied to vehicle load analysis
CN112651424A (en) GIS insulation defect identification method and system based on LLE dimension reduction and chaos algorithm optimization
CN112308151A (en) Weighting-based classification method for hyperspectral images of rotating forest
CN111461199A (en) Security attribute selection method based on distributed junk mail classified data
CN110782950A (en) Tumor key gene identification method based on preference grid and Levy flight multi-target particle swarm algorithm
CN110766087A (en) Method for improving data clustering quality of k-means based on dispersion maximization method
Arivalagan et al. Face recognition based on a hybrid meta-heuristic feature selection algorithm
CN115017125B (en) Data processing method and device for improving KNN method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant