CN114091607B

CN114091607B - Semi-supervised multi-label online stream feature selection method based on neighborhood rough set

Info

Publication number: CN114091607B
Application number: CN202111403199.9A
Authority: CN
Inventors: 梁顺攀; 刘泽; 赵俊杰
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2024-05-03
Anticipated expiration: 2041-11-24
Also published as: CN114091607A

Abstract

The invention relates to a semi-supervised multi-label online flow characteristic selection method based on a neighborhood rough set, which comprises the following steps: any new features flow into the model one by one in the form of a stream; obtaining neighbors of the missing tag examples through defined neighborhood relations, and predicting the missing tags; obtaining the dependency of the new features; performing online feature importance assessment on the new features; performing online redundancy update on the candidate set; repeating the steps until no unprocessed features exist, and finally, obtaining an optimal feature subset. The invention does not need to know any domain knowledge, can adapt to various data sets under the condition of not setting any parameters, and has stronger generalization capability. The invention can handle data sets lacking tags and can select features that are efficient. The invention maximizes the dependence of the candidate feature set on the tag and selects features with high correlation and low redundancy.

Description

Semi-supervised multi-label online stream feature selection method based on neighborhood rough set

Technical Field

The invention relates to a semi-supervised multi-label online stream feature selection calculation method based on a neighborhood rough set, belongs to the technical field of pattern recognition and data mining, and particularly relates to real-time prediction of missing labels through feature information.

Background

Online flow feature selection has been accepted by researchers as one branch of feature selection. Features can be generated at any time in life, so that the traditional batch processing method is not applicable, and after all, feature selection cannot be performed until all features arrive. For example, in practical applications, features are dynamically generated and come one by one over time, such as real-time monitoring and analysis of the environment, cancer prediction and remote sensing image classification. In this case, great attention has been paid to how to dynamically process the flow characteristics.

In the real world, multi-labeled objects are ubiquitous in life. For example, in the field of text classification and emotion recognition, it is highly necessary to determine the category to which news belongs, such as financial, sports, entertainment, and the like. A piece of news may have two or more categories at the same time. Multi-tag data has more tags than single tag data, which undoubtedly increases the difficulty of feature selection. How to effectively remove redundant and irrelevant features has become a major problem, while it is not easy to collect accurate and complete data sets. Manual labeling is expensive and labor intensive. Therefore, when the tag data is incomplete, feature selection must be performed while maintaining good results.

To deal with "flow characteristics", various online characteristic selection methods have been proposed, such as OSFS, alpha-investing, OS-NRRSARA-SA and OFS-A3M. alpha-investing can handle large datasets without the runtime increasing exponentially. Unfortunately, processing each feature only once increases the redundancy of the selected feature, resulting in inaccurate and unstable results. OSFS have made breakthroughs in feature selection and have proposed a novel framework that can handle streaming functions. Although OSFS has a higher prediction accuracy than the Alpha investment, the parameter α must be pre-specified before feature selection, which can affect independence detection. A coarse set has proven to be an effective feature selection and knowledge discovery tool. Many studies use a coarse set for feature selection. For example, OS-NRRSARA-SA is one method of online flow feature selection using a conventional asperity set, without specifying any parameters prior to using this method. Many real-valued features render conventional asperities unusable. Therefore, a neighborhood rough set is proposed to solve this problem. The neighborhood rough set supports both continuous data and discrete data. The OFS-A3M uses the neighborhood rough set for online feature selection and proposes a new neighborhood relation that enables the algorithm to select the best number of neighbors for each instance based on different data sets. This neighborhood relationship solves the problem of the approach proposed for traditional feature selection requiring the assignment of appropriate parameter values for different data sets. But this approach can only handle a single tag and a complete tag dataset.

Disclosure of Invention

Since some parameters need to be specified in existing online stream feature selection algorithms, the selection of parameters is typically limited by domain knowledge, and many existing online stream feature selection algorithms can only perform feature selection on a complete tag dataset. In order to solve the problem, the invention provides a semi-supervised multi-label online stream feature selection algorithm based on a neighborhood rough set, which aims to solve the problems of parameter limitation and label limitation simultaneously, and simultaneously considers the importance of the features and the redundancy of the feature set, maximizes the dependence of candidate feature sets on labels, selects features with high correlation and low redundancy, and maximizes the dependence of the candidate feature sets on the premise of selecting as few features as possible.

Specifically, the invention provides a semi-supervised multi-label online flow characteristic selection method based on a neighborhood rough set, which specifically comprises the following steps:

S01, defining a tag set L, wherein L _j epsilon L is the j-th tag in L;

S02, defining a candidate feature set as S, and initializing

S03, defining the dependence degree of the candidate feature set S as gamma (S);

s04, defining the historical dependency of the candidate feature set S as Deps, and initializing Deps =0;

s05, defining an average dependency threshold of the candidate feature set S as mean_Dep, and initializing mean_Dep=0;

S06, defining the moment of arrival of the ith feature as t _i; and initialize t _i = 0;

S07. defines the feature reached at time t _i as f _i;

S08, judging whether a missing label exists, and if so, performing a step S09; if not, proceeding to step S10;

S09, obtaining neighbors of the missing label examples according to the mean value neighborhood relation, predicting the values of the missing labels through analysis of labels of similar examples, and recovering the labels after obtaining the required values;

S10, calculating the dependence gamma (f _i) of the feature f _i;

s11, judging whether the characteristic f _i is an important characteristic, namely whether gamma (S.u.f _i) > Deps is met, and if so, executing a step S15; otherwise, performing steps S12-S14;

S12, judging whether redundancy updating processing is needed for the candidate set S, and if so, performing steps S13-S14; otherwise, go to step S16;

S13, inserting the feature f _i into the candidate set S, namely enabling S=S U f _i;

S14, performing redundancy updating operation on the candidate set S, and performing step S16;

S15, inserting the characteristic f _i into the candidate set S, namely enabling S=SU f _i, and updating the value of Deps;

S16, updating an average dependency threshold value mean_Dep;

s17, judging whether unprocessed characteristics exist, if so, returning to the step S08; if not, outputting the most characteristic set S;

Preferably, the specific steps for predicting missing tags described in steps S08-S09 are as follows:

① The distance of instance x and instance y on the candidate set s= { S ₁,s₂,…,s_|S| } is denoted by Δ (x _i,x_j), here representing the euclidean distance, which is calculated as follows:

Where S _ix represents the value of example x on feature S _i (1.ltoreq.i.ltoreq.S|), and S _jy represents the value of example y on feature S _j (1.ltoreq.j.ltoreq.S|).

② Let N _S(x_i) denote a neighbor sequence obtained by ordering the distances between x _i and other instances with feature subset S, specifically expressed as:

Wherein the method comprises the steps of Representing the nearest instance to x _i, i.e./>The value of (2) is the smallest; wherein/>Representing the instance furthest from x _i, i.e./>Is the largest value.

③ Let S _R(x_i) denote, given N _S(x_i), that the neighbor set of x _i is found by means of the mean neighborhood relation R, assuming instance x _j∈S_R(x_i), the following condition must be met:

from the above equation, the average distance between the instance and the neighbor can be obtained by dividing the maximum distance minus the minimum distance by (n-1). If the distance between the current instance and the other instances is less than the average distance multiplied by 0.35 (0.35 is a parameter that may have greater efficiency according to the test), then the instance may be considered a neighbor of the current instance.

④ Assuming that the missing tag of x _i is l _i, there are Pos positive samples in its mean neighborhood set S _R(x_i), i.e. l _i =1; there are Neg negative samples, i.e. l _i = -1. Prediction of missing labels may be achieved by the number of positive and negative samples in the corresponding labels of the instance neighbors.

Preferably, the specific steps of calculating the feature or feature set dependency in step S03 and step S10 are as follows:

① Given S _R(x_i), i.e., find the neighbor set of x _i by the mean neighborhood relationship R, let CARD (S _R(x_i)) represent the CARD value of instance x _i for use in computing the label consistency of the instance with its neighbors.

② After the feature set S is given, traversing each label in the label set L, and if the missing label exists, predicting the missing label by a label prediction method; let X _S then represent the set of all instances on feature set S, traverse all instances on X _S, calculate the value of the card value for each instance and record the sum of the values of all instances, recover the missing tag after all instances on X _S have been calculated, and continue traversing the next tag. The above process is formulated as:

In the above formula, L _num represents the size of the tag set L, i.e., the total tag number; n represents the size of instance set X _S, i.e., the number of all instances on feature set S, CARD (S _R(x_j)) represents the CARD value of instance X _j, which is used to calculate the consistency of instance X _j with its neighbors on label l _i. The final obtained dep _s represents the dependence of the current feature set S on the tag, denoted by γ (S). The calculation mode of the single feature f _i dependence is the same as that of the feature set S dependence, and the feature f _i dependence gamma (f _i) can be obtained by bringing the S= { f _i } into the step of feature set dependence calculation.

Preferably, the specific step of determining whether the feature f _i is an important feature in step S11 is as follows:

① It is determined whether the feature f _i would increase the dependency of the candidate feature set S, i.e. comparing the magnitude relation between γ (S u f _i) and the candidate set S history dependency Deps. If γ (S &. U. F _i) > Deps, then prove that feature f _i is an important feature, add feature f _i to candidate set S, i.e. let s=s &. F _i, and update the value of history dependence Deps, i.e. let Deps =γ (S head. F _i); otherwise, on-line redundancy update judgment is carried out.

Preferably, the specific steps of steps S12-S14 to determine whether a redundancy update is required for candidate set S are as follows:

① It is determined whether or not redundancy update is required for the candidate set S, that is, when the condition of γ (S ∈ _i) = Deps and γ (f _i) > mean_dep is satisfied, redundancy update operation is required for the candidate set S.

② The feature f _i is added to the candidate set S, that is, let s=sjf _i, in order to fairly treat all the features in the candidate feature set, it is necessary to randomly select one feature f _j in the candidate feature set S and calculate the importance delta _S(f_j of the feature until all the features are calculated once, where the calculation formula of the importance is as follows:

δ_S(f_j)＝γ(S)-γ(S-f_j)

③ If the importance delta _S(f_j) of the feature f _j =0, the feature f _j is removed from the candidate set S, i.e., let s=s-f _j; otherwise, continuing to calculate the remaining features in the candidate feature set S.

Preferably, the specific steps for calculating the average dependency threshold mean_dep of step S16 are as follows:

① Traversing each feature f _i in the candidate feature set S, calculating the dependence gamma (f _i) of the feature f _i, and updating the value of the average dependence threshold mean_dep according to the formula, wherein the calculation formula of the average dependence threshold mean_dep is as follows:

where S| represents the size of the candidate set S, i.e., the number of features contained in the candidate set, γ (f _i) is the dependency of feature f _i, and f _i ε S.

By adopting the technical scheme, the invention has the following technical effects:

The invention defines a new neighborhood relation based on the neighborhood rough set, and the method can automatically select proper neighbor quantity for all examples. Based on the neighbor rough set theory, the method does not need any domain knowledge, and can automatically select proper neighbors in the online feature selection process by utilizing the newly defined neighbor relation without specifying any parameters.

The invention aims at the problem that many existing online flow feature selection algorithms can only execute feature selection on a complete tag data set, and improves the performance of the algorithm by predicting missing tags online in real time based on the similarity of an instance and a neighbor thereof.

The invention considers the importance of the features and the size of the candidate feature set, and realizes that the candidate feature set is as small as possible while maintaining the maximum dependency of the candidate feature set. The final implementation maximizes the dependence of the candidate feature set on the tag and selects features with high correlation and low redundancy.

The invention has wide application, can be suitable for various data sets, for example, the variety of birds is predicted through audio, the emotion type of birds is judged through music data, and the protein functions are classified.

Drawings

FIG. 1 is a schematic workflow diagram of the present invention;

FIG. 2 is a graph showing the results of the similarity test of data sets according to the present invention;

fig. 3 is a diagram of information of a data set used in the present invention.

Detailed Description

The working principle and working steps of the invention are further explained below with reference to the drawings and the specific embodiments:

Since some parameters need to be specified in existing online stream feature selection algorithms, the selection of parameters is typically limited by domain knowledge, and many existing online stream feature selection algorithms can only perform feature selection on a complete tag dataset. In order to solve the problem, the invention provides a semi-supervised multi-label online stream feature selection method based on a neighborhood rough set, which aims to solve the problems of parameter limitation and label limitation simultaneously, and simultaneously considers the importance of features and the redundancy of feature sets, maximizes the dependence of candidate feature sets on labels, selects features with high correlation and low redundancy, and maximizes the dependence of candidate feature sets on the premise of selecting as few features as possible.

As shown in fig. 1, the invention provides a semi-supervised multi-label online flow feature selection method based on a neighborhood rough set, which comprises the following steps:

S01, defining a tag set L, wherein L _j epsilon L is the j-th tag in L;

S02, defining a candidate feature set as S, and initializing

S07. defines the feature reached at time t _i as f _i;

① The distance of instance x and instance y on the candidate set s= { S ₁,s₂,…,s_|S| } is first denoted by Δ (x _i,x_j), here representing the euclidean distance, the calculation formula is as follows:

S10, calculating the dependence gamma (f _i) of the feature f _i;

① In the invention, the missing label appears when the feature dependence is calculated, and the dependence of the current feature is obtained by predicting the missing label. If the prediction of the label is incorrect, it may result in the algorithm selecting certain features that are highly uncorrelated with the original label, thereby reducing the efficiency of the algorithm. To verify the accuracy of missing tag predictions, a similarity test is performed after each prediction to verify the validity of the algorithm by the following formula:

In the above equation, L _x represents the number of tags whose neighbors are the same as instance x, and L _num is the number of tags. Fig. 2 is the result of a similarity test performed on 12 datasets, while fig. 3 is the detail of 12 datasets, wherein Arts, recreations and science are both from yahoo and widely used for text classification. Gpositive and GNEGATIVE are used to predict subcellular locations of proteins based on their sequences. Birds is an audio data having an audio clip of 10 seconds for predicting the kind of birds. Emotions is a piece of music data consisting of 593 songs and 6 categories; surprisingly, it is happy, relaxed, calm, sad and solitary and vigorous. Yeast was used to predict functional classes of genes, including microarray expression and phylogenetic profiles of 2417 Yeast genes. Image and Scene are Image data composed of 2000 images and 2407 images, respectively. Enron is text data based on a collection of email messages categorized into 53 subject categories. Genbase is a dataset of functional classifications of proteins. Each instance is a protein and each tag is a class of proteins.

As can be seen from fig. 2 (a), the neighbors selected by the neighborhood relationship have a high degree of similarity to the current example. It can be seen from fig. 2 (b) that the similarity is 65% at the minimum, and that the similarity of the data set with multiple tags reaches 90% and above, indicating that the proposed method is effective in tag prediction.

② Given S _R(x_i), i.e., find the neighbor set of x _i by the mean neighborhood relationship R, let CARD (S _R(x_i)) represent the CARD value of instance x _i for use in computing the label consistency of the instance with its neighbors.

③ After the feature set S is given, traversing each label in the label set L, and if the missing label exists, predicting the missing label by a label prediction method; let X _S then represent the set of all instances on feature set S, traverse all instances on X _S, calculate the value of the card value for each instance and record the sum of the values of all instances, recover the missing tag after all instances on X _S have been calculated, and continue traversing the next tag. The above process is formulated as:

S11, judging whether the characteristic f _i is an important characteristic, namely whether gamma (S.u.f _i) > Deps is met, and if so, executing the step S15; otherwise, performing steps S12-S14;

① In the function selection process, it is impossible to perform online redundancy update processing for each function. In order to select high quality features and reduce computational costs, the proposed algorithm introduces a judgment of feature dependencies, where the magnitude of the threshold is automatically transformed according to the average dependency of the selected feature set.

② It is determined whether or not redundancy update is required for the candidate set S, that is, when the condition of γ (S ∈ _i) = Deps and γ (f _i) > mean_dep is satisfied, redundancy update operation is required for the candidate set S.

① The feature f _i is added to the candidate set S, that is, let s=sjf _i, in order to fairly treat all the features in the candidate feature set δ, one feature f _j needs to be randomly selected from the candidate feature set δ and the importance delta _S(f_j of the feature needs to be calculated until all the features are calculated once, and the calculation formula of the importance is as follows:

δ_S(f_j)＝γ(S)-γ(S-f_j)

② If the importance delta _S(f_j) of the feature f _j =0, the feature f _j is removed from the candidate set S, i.e., let s=s-f _j; otherwise, continuing to calculate the remaining features in the candidate feature set S.

③ The purpose of the redundancy update in the invention is to ensure that the algorithm maximizes the dependence of the candidate feature set on the label and selects features with high correlation and low redundancy. If the importance delta _S(f_j) of a feature f _j is=0, it is proved that the dependency gamma (S) of the candidate feature set is unchanged regardless of whether the feature f _j is included, so that it can be explained that the feature f _j is redundant, and the feature f _j is removed from the candidate feature set in order to ensure that the candidate feature set is minimum.

① In the invention, whether the feature f _i coming at the time t is an important feature is judged by maintaining the maximum dependence of the candidate feature set, namely the feature is selected with the aim of minimizing the number of the selected features while the dependence of the selected feature set on the label L is highest.

② If adding the feature f _i to the candidate set S increases the dependency of S, i.e., γ (S u f _i) > Deps, in order to maintain the maximum dependency of the candidate feature set, the feature f _i may be directly added to the candidate set S, i.e., let s=s u f _i, and update the value of the history dependency Deps, i.e., deps =γ (S); if γ (S ∈f _i) = Deps, it is explained that the feature f _i does not affect the maximum dependency of the candidate feature set, so it is continuously determined whether the redundant update operation is needed for the candidate feature set; if γ (S. U. F _i) < Deps, then the description feature f _i is a non-important feature and no additional processing is required.

S16, updating an average dependency threshold value mean_Dep;

The advantages of the invention are as follows:

The invention has wide application, can be suitable for various data sets, for example, the variety of birds is predicted through audio, the emotion type of birds is judged through music data, and the protein functions are classified. Finally, it should be noted that: the embodiments described above are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the scope of the embodiments of the present invention.

Claims

1. A semi-supervised multi-label online flow characteristic selection method based on a neighborhood rough set is characterized by comprising the following steps:

S01, defining a tag set L, wherein L _j epsilon L is the j-th tag in L;

S02, defining a candidate feature set as S, and initializing

S07. defines the feature reached at time t _i as f _i;

S10, calculating the dependence gamma (f _i) of the feature f _i;

s12, judging whether redundancy updating processing is needed for the candidate set S, and if so, performing steps S13-S14; if not, performing step S16;

S16, updating an average dependency threshold value mean_Dep;

S17, judging whether unprocessed characteristics exist, if so, returning to the step S08; if not, the most useful feature set S is output.

2. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for missing tag prediction described in steps S08-S09 are as follows:

Wherein S _ix represents the value of example x on feature S _i (1.ltoreq.i.ltoreq.S|), and S _jy represents the value of example y on feature S _j (1.ltoreq.j.ltoreq.S|);

Wherein the method comprises the steps of Representing the nearest instance to x _i, i.e./>The value of (2) is the smallest; wherein/>Representing the instance furthest from x _i, i.e./>The value of (2) is the largest;

From the above equation, the average distance between the instance and the neighbor can be obtained by dividing the maximum distance minus the minimum distance by n-1; if the distance between the current instance and the other instances is less than the average distance multiplied by 0.35, then the instance may be considered a neighbor of the current instance;

④ Assuming that the missing tag of x _i is l _i, there are Pos positive samples in its mean neighborhood set S _R(x_i), i.e. l _i =1; there are Neg negative samples, i.e. i _i = -1, the prediction of missing labels can be achieved by the number of positive and negative samples in the corresponding label of the instance neighbor.

3. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 2, wherein: 0.35 in the formula is a parameter that may have higher efficiency.

4. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps of calculating the feature or feature set dependency in step S03 and step S10 are as follows:

① Given S _R(x_i), i.e., find the neighbor set of x _i by the mean neighborhood relationship R, let CARD (S _R(x_i)) represent the CARD value of instance x _i for use in computing the label consistency of the instance with its neighbors;

② After the feature set S is given, traversing each label in the label set L, and if the missing label exists, predicting the missing label by a label prediction method; let X _S represent the set of all instances on feature set S, traverse all instances on X _S, calculate the value of the card of each instance and record the sum of the values of the card of all instances, recover the missing tag after all instances on X _S have been calculated, continue traversing the next tag; the above process is formulated as:

In the above formula, L _num represents the size of the tag set L, i.e., the total tag number; n represents the size of instance set X _S, i.e., the number of all instances on feature set S, CARD (S _R(x_j)) represents the CARD value of instance X _j, used to calculate the consistency of instance X _j with its neighbors on label l _i; the finally obtained dep _s represents the dependence of the current feature set S on the label, and is expressed by gamma (S); the calculation mode of the single feature f _i dependence is the same as that of the feature set S dependence, and the feature f _i dependence gamma (f _i) can be obtained by bringing the S= { f _i } into the step of feature set dependence calculation.

5. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for determining whether the feature f _i is an important feature in step S11 are as follows:

Judging whether the feature f _i can improve the dependency of the candidate feature set S, namely comparing the magnitude relation between gamma (S U f _i) and the history dependency Deps of the candidate set S; if γ (S &. U. F _i) > Deps, then prove that feature f _i is an important feature, add feature f _i to candidate set S, i.e. let s=s &. F _i, and update the value of history dependence Deps, i.e. let Deps =γ (S head. F _i); otherwise, on-line redundancy update judgment is carried out.

6. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for determining whether the candidate set S needs to be redundantly updated in steps S12-S14 are as follows:

① Judging whether the candidate set S needs to be subjected to redundancy update or not, namely, when the conditions of γ (S ∈ _i) = Deps and γ (f _i) > mean_dep are satisfied, performing redundancy update operation on the candidate set S;

δ_S(f_j)＝γ(S)-γ(S-f_j)

7. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for calculating the average dependency threshold mean_dep in step S16 are as follows:

Traversing each feature f _i in the candidate feature set S, calculating the dependence gamma (f _i) of the feature f _i, and updating the value of the average dependence threshold mean_dep according to the formula, wherein the calculation formula of the average dependence threshold mean_dep is as follows: