CN114091607B - Semi-supervised multi-label online stream feature selection method based on neighborhood rough set - Google Patents

Semi-supervised multi-label online stream feature selection method based on neighborhood rough set Download PDF

Info

Publication number
CN114091607B
CN114091607B CN202111403199.9A CN202111403199A CN114091607B CN 114091607 B CN114091607 B CN 114091607B CN 202111403199 A CN202111403199 A CN 202111403199A CN 114091607 B CN114091607 B CN 114091607B
Authority
CN
China
Prior art keywords
feature
candidate
label
instance
dependence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111403199.9A
Other languages
Chinese (zh)
Other versions
CN114091607A (en
Inventor
梁顺攀
刘泽
赵俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanshan University
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN202111403199.9A priority Critical patent/CN114091607B/en
Publication of CN114091607A publication Critical patent/CN114091607A/en
Application granted granted Critical
Publication of CN114091607B publication Critical patent/CN114091607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Fuzzy Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a semi-supervised multi-label online flow characteristic selection method based on a neighborhood rough set, which comprises the following steps: any new features flow into the model one by one in the form of a stream; obtaining neighbors of the missing tag examples through defined neighborhood relations, and predicting the missing tags; obtaining the dependency of the new features; performing online feature importance assessment on the new features; performing online redundancy update on the candidate set; repeating the steps until no unprocessed features exist, and finally, obtaining an optimal feature subset. The invention does not need to know any domain knowledge, can adapt to various data sets under the condition of not setting any parameters, and has stronger generalization capability. The invention can handle data sets lacking tags and can select features that are efficient. The invention maximizes the dependence of the candidate feature set on the tag and selects features with high correlation and low redundancy.

Description

Semi-supervised multi-label online stream feature selection method based on neighborhood rough set
Technical Field
The invention relates to a semi-supervised multi-label online stream feature selection calculation method based on a neighborhood rough set, belongs to the technical field of pattern recognition and data mining, and particularly relates to real-time prediction of missing labels through feature information.
Background
Online flow feature selection has been accepted by researchers as one branch of feature selection. Features can be generated at any time in life, so that the traditional batch processing method is not applicable, and after all, feature selection cannot be performed until all features arrive. For example, in practical applications, features are dynamically generated and come one by one over time, such as real-time monitoring and analysis of the environment, cancer prediction and remote sensing image classification. In this case, great attention has been paid to how to dynamically process the flow characteristics.
In the real world, multi-labeled objects are ubiquitous in life. For example, in the field of text classification and emotion recognition, it is highly necessary to determine the category to which news belongs, such as financial, sports, entertainment, and the like. A piece of news may have two or more categories at the same time. Multi-tag data has more tags than single tag data, which undoubtedly increases the difficulty of feature selection. How to effectively remove redundant and irrelevant features has become a major problem, while it is not easy to collect accurate and complete data sets. Manual labeling is expensive and labor intensive. Therefore, when the tag data is incomplete, feature selection must be performed while maintaining good results.
To deal with "flow characteristics", various online characteristic selection methods have been proposed, such as OSFS, alpha-investing, OS-NRRSARA-SA and OFS-A3M. alpha-investing can handle large datasets without the runtime increasing exponentially. Unfortunately, processing each feature only once increases the redundancy of the selected feature, resulting in inaccurate and unstable results. OSFS have made breakthroughs in feature selection and have proposed a novel framework that can handle streaming functions. Although OSFS has a higher prediction accuracy than the Alpha investment, the parameter α must be pre-specified before feature selection, which can affect independence detection. A coarse set has proven to be an effective feature selection and knowledge discovery tool. Many studies use a coarse set for feature selection. For example, OS-NRRSARA-SA is one method of online flow feature selection using a conventional asperity set, without specifying any parameters prior to using this method. Many real-valued features render conventional asperities unusable. Therefore, a neighborhood rough set is proposed to solve this problem. The neighborhood rough set supports both continuous data and discrete data. The OFS-A3M uses the neighborhood rough set for online feature selection and proposes a new neighborhood relation that enables the algorithm to select the best number of neighbors for each instance based on different data sets. This neighborhood relationship solves the problem of the approach proposed for traditional feature selection requiring the assignment of appropriate parameter values for different data sets. But this approach can only handle a single tag and a complete tag dataset.
Disclosure of Invention
Since some parameters need to be specified in existing online stream feature selection algorithms, the selection of parameters is typically limited by domain knowledge, and many existing online stream feature selection algorithms can only perform feature selection on a complete tag dataset. In order to solve the problem, the invention provides a semi-supervised multi-label online stream feature selection algorithm based on a neighborhood rough set, which aims to solve the problems of parameter limitation and label limitation simultaneously, and simultaneously considers the importance of the features and the redundancy of the feature set, maximizes the dependence of candidate feature sets on labels, selects features with high correlation and low redundancy, and maximizes the dependence of the candidate feature sets on the premise of selecting as few features as possible.
Specifically, the invention provides a semi-supervised multi-label online flow characteristic selection method based on a neighborhood rough set, which specifically comprises the following steps:
S01, defining a tag set L, wherein L j epsilon L is the j-th tag in L;
S02, defining a candidate feature set as S, and initializing
S03, defining the dependence degree of the candidate feature set S as gamma (S);
s04, defining the historical dependency of the candidate feature set S as Deps, and initializing Deps =0;
s05, defining an average dependency threshold of the candidate feature set S as mean_Dep, and initializing mean_Dep=0;
S06, defining the moment of arrival of the ith feature as t i; and initialize t i = 0;
S07. defines the feature reached at time t i as f i;
S08, judging whether a missing label exists, and if so, performing a step S09; if not, proceeding to step S10;
S09, obtaining neighbors of the missing label examples according to the mean value neighborhood relation, predicting the values of the missing labels through analysis of labels of similar examples, and recovering the labels after obtaining the required values;
S10, calculating the dependence gamma (f i) of the feature f i;
s11, judging whether the characteristic f i is an important characteristic, namely whether gamma (S.u.f i) > Deps is met, and if so, executing a step S15; otherwise, performing steps S12-S14;
S12, judging whether redundancy updating processing is needed for the candidate set S, and if so, performing steps S13-S14; otherwise, go to step S16;
S13, inserting the feature f i into the candidate set S, namely enabling S=S U f i;
S14, performing redundancy updating operation on the candidate set S, and performing step S16;
S15, inserting the characteristic f i into the candidate set S, namely enabling S=SU f i, and updating the value of Deps;
S16, updating an average dependency threshold value mean_Dep;
s17, judging whether unprocessed characteristics exist, if so, returning to the step S08; if not, outputting the most characteristic set S;
Preferably, the specific steps for predicting missing tags described in steps S08-S09 are as follows:
① The distance of instance x and instance y on the candidate set s= { S 1,s2,…,s|S| } is denoted by Δ (x i,xj), here representing the euclidean distance, which is calculated as follows:
Where S ix represents the value of example x on feature S i (1.ltoreq.i.ltoreq.S|), and S jy represents the value of example y on feature S j (1.ltoreq.j.ltoreq.S|).
② Let N S(xi) denote a neighbor sequence obtained by ordering the distances between x i and other instances with feature subset S, specifically expressed as:
Wherein the method comprises the steps of Representing the nearest instance to x i, i.e./>The value of (2) is the smallest; wherein/>Representing the instance furthest from x i, i.e./>Is the largest value.
③ Let S R(xi) denote, given N S(xi), that the neighbor set of x i is found by means of the mean neighborhood relation R, assuming instance x j∈SR(xi), the following condition must be met:
from the above equation, the average distance between the instance and the neighbor can be obtained by dividing the maximum distance minus the minimum distance by (n-1). If the distance between the current instance and the other instances is less than the average distance multiplied by 0.35 (0.35 is a parameter that may have greater efficiency according to the test), then the instance may be considered a neighbor of the current instance.
④ Assuming that the missing tag of x i is l i, there are Pos positive samples in its mean neighborhood set S R(xi), i.e. l i =1; there are Neg negative samples, i.e. l i = -1. Prediction of missing labels may be achieved by the number of positive and negative samples in the corresponding labels of the instance neighbors.
Preferably, the specific steps of calculating the feature or feature set dependency in step S03 and step S10 are as follows:
① Given S R(xi), i.e., find the neighbor set of x i by the mean neighborhood relationship R, let CARD (S R(xi)) represent the CARD value of instance x i for use in computing the label consistency of the instance with its neighbors.
② After the feature set S is given, traversing each label in the label set L, and if the missing label exists, predicting the missing label by a label prediction method; let X S then represent the set of all instances on feature set S, traverse all instances on X S, calculate the value of the card value for each instance and record the sum of the values of all instances, recover the missing tag after all instances on X S have been calculated, and continue traversing the next tag. The above process is formulated as:
In the above formula, L num represents the size of the tag set L, i.e., the total tag number; n represents the size of instance set X S, i.e., the number of all instances on feature set S, CARD (S R(xj)) represents the CARD value of instance X j, which is used to calculate the consistency of instance X j with its neighbors on label l i. The final obtained dep s represents the dependence of the current feature set S on the tag, denoted by γ (S). The calculation mode of the single feature f i dependence is the same as that of the feature set S dependence, and the feature f i dependence gamma (f i) can be obtained by bringing the S= { f i } into the step of feature set dependence calculation.
Preferably, the specific step of determining whether the feature f i is an important feature in step S11 is as follows:
① It is determined whether the feature f i would increase the dependency of the candidate feature set S, i.e. comparing the magnitude relation between γ (S u f i) and the candidate set S history dependency Deps. If γ (S &. U. F i) > Deps, then prove that feature f i is an important feature, add feature f i to candidate set S, i.e. let s=s &. F i, and update the value of history dependence Deps, i.e. let Deps =γ (S head. F i); otherwise, on-line redundancy update judgment is carried out.
Preferably, the specific steps of steps S12-S14 to determine whether a redundancy update is required for candidate set S are as follows:
① It is determined whether or not redundancy update is required for the candidate set S, that is, when the condition of γ (S ∈ i) = Deps and γ (f i) > mean_dep is satisfied, redundancy update operation is required for the candidate set S.
② The feature f i is added to the candidate set S, that is, let s=sjf i, in order to fairly treat all the features in the candidate feature set, it is necessary to randomly select one feature f j in the candidate feature set S and calculate the importance delta S(fj of the feature until all the features are calculated once, where the calculation formula of the importance is as follows:
δS(fj)=γ(S)-γ(S-fj)
③ If the importance delta S(fj) of the feature f j =0, the feature f j is removed from the candidate set S, i.e., let s=s-f j; otherwise, continuing to calculate the remaining features in the candidate feature set S.
Preferably, the specific steps for calculating the average dependency threshold mean_dep of step S16 are as follows:
① Traversing each feature f i in the candidate feature set S, calculating the dependence gamma (f i) of the feature f i, and updating the value of the average dependence threshold mean_dep according to the formula, wherein the calculation formula of the average dependence threshold mean_dep is as follows:
where S| represents the size of the candidate set S, i.e., the number of features contained in the candidate set, γ (f i) is the dependency of feature f i, and f i ε S.
By adopting the technical scheme, the invention has the following technical effects:
The invention defines a new neighborhood relation based on the neighborhood rough set, and the method can automatically select proper neighbor quantity for all examples. Based on the neighbor rough set theory, the method does not need any domain knowledge, and can automatically select proper neighbors in the online feature selection process by utilizing the newly defined neighbor relation without specifying any parameters.
The invention aims at the problem that many existing online flow feature selection algorithms can only execute feature selection on a complete tag data set, and improves the performance of the algorithm by predicting missing tags online in real time based on the similarity of an instance and a neighbor thereof.
The invention considers the importance of the features and the size of the candidate feature set, and realizes that the candidate feature set is as small as possible while maintaining the maximum dependency of the candidate feature set. The final implementation maximizes the dependence of the candidate feature set on the tag and selects features with high correlation and low redundancy.
The invention has wide application, can be suitable for various data sets, for example, the variety of birds is predicted through audio, the emotion type of birds is judged through music data, and the protein functions are classified.
Drawings
FIG. 1 is a schematic workflow diagram of the present invention;
FIG. 2 is a graph showing the results of the similarity test of data sets according to the present invention;
fig. 3 is a diagram of information of a data set used in the present invention.
Detailed Description
The working principle and working steps of the invention are further explained below with reference to the drawings and the specific embodiments:
Since some parameters need to be specified in existing online stream feature selection algorithms, the selection of parameters is typically limited by domain knowledge, and many existing online stream feature selection algorithms can only perform feature selection on a complete tag dataset. In order to solve the problem, the invention provides a semi-supervised multi-label online stream feature selection method based on a neighborhood rough set, which aims to solve the problems of parameter limitation and label limitation simultaneously, and simultaneously considers the importance of features and the redundancy of feature sets, maximizes the dependence of candidate feature sets on labels, selects features with high correlation and low redundancy, and maximizes the dependence of candidate feature sets on the premise of selecting as few features as possible.
As shown in fig. 1, the invention provides a semi-supervised multi-label online flow feature selection method based on a neighborhood rough set, which comprises the following steps:
S01, defining a tag set L, wherein L j epsilon L is the j-th tag in L;
S02, defining a candidate feature set as S, and initializing
S03, defining the dependence degree of the candidate feature set S as gamma (S);
s04, defining the historical dependency of the candidate feature set S as Deps, and initializing Deps =0;
s05, defining an average dependency threshold of the candidate feature set S as mean_Dep, and initializing mean_Dep=0;
S06, defining the moment of arrival of the ith feature as t i; and initialize t i = 0;
S07. defines the feature reached at time t i as f i;
S08, judging whether a missing label exists, and if so, performing a step S09; if not, proceeding to step S10;
S09, obtaining neighbors of the missing label examples according to the mean value neighborhood relation, predicting the values of the missing labels through analysis of labels of similar examples, and recovering the labels after obtaining the required values;
① The distance of instance x and instance y on the candidate set s= { S 1,s2,…,s|S| } is first denoted by Δ (x i,xj), here representing the euclidean distance, the calculation formula is as follows:
Where S ix represents the value of example x on feature S i (1.ltoreq.i.ltoreq.S|), and S jy represents the value of example y on feature S j (1.ltoreq.j.ltoreq.S|).
② Let N S(xi) denote a neighbor sequence obtained by ordering the distances between x i and other instances with feature subset S, specifically expressed as:
Wherein the method comprises the steps of Representing the nearest instance to x i, i.e./>The value of (2) is the smallest; wherein/>Representing the instance furthest from x i, i.e./>Is the largest value.
③ Let S R(xi) denote, given N S(xi), that the neighbor set of x i is found by means of the mean neighborhood relation R, assuming instance x j∈SR(xi), the following condition must be met:
from the above equation, the average distance between the instance and the neighbor can be obtained by dividing the maximum distance minus the minimum distance by (n-1). If the distance between the current instance and the other instances is less than the average distance multiplied by 0.35 (0.35 is a parameter that may have greater efficiency according to the test), then the instance may be considered a neighbor of the current instance.
④ Assuming that the missing tag of x i is l i, there are Pos positive samples in its mean neighborhood set S R(xi), i.e. l i =1; there are Neg negative samples, i.e. l i = -1. Prediction of missing labels may be achieved by the number of positive and negative samples in the corresponding labels of the instance neighbors.
S10, calculating the dependence gamma (f i) of the feature f i;
① In the invention, the missing label appears when the feature dependence is calculated, and the dependence of the current feature is obtained by predicting the missing label. If the prediction of the label is incorrect, it may result in the algorithm selecting certain features that are highly uncorrelated with the original label, thereby reducing the efficiency of the algorithm. To verify the accuracy of missing tag predictions, a similarity test is performed after each prediction to verify the validity of the algorithm by the following formula:
In the above equation, L x represents the number of tags whose neighbors are the same as instance x, and L num is the number of tags. Fig. 2 is the result of a similarity test performed on 12 datasets, while fig. 3 is the detail of 12 datasets, wherein Arts, recreations and science are both from yahoo and widely used for text classification. Gpositive and GNEGATIVE are used to predict subcellular locations of proteins based on their sequences. Birds is an audio data having an audio clip of 10 seconds for predicting the kind of birds. Emotions is a piece of music data consisting of 593 songs and 6 categories; surprisingly, it is happy, relaxed, calm, sad and solitary and vigorous. Yeast was used to predict functional classes of genes, including microarray expression and phylogenetic profiles of 2417 Yeast genes. Image and Scene are Image data composed of 2000 images and 2407 images, respectively. Enron is text data based on a collection of email messages categorized into 53 subject categories. Genbase is a dataset of functional classifications of proteins. Each instance is a protein and each tag is a class of proteins.
As can be seen from fig. 2 (a), the neighbors selected by the neighborhood relationship have a high degree of similarity to the current example. It can be seen from fig. 2 (b) that the similarity is 65% at the minimum, and that the similarity of the data set with multiple tags reaches 90% and above, indicating that the proposed method is effective in tag prediction.
② Given S R(xi), i.e., find the neighbor set of x i by the mean neighborhood relationship R, let CARD (S R(xi)) represent the CARD value of instance x i for use in computing the label consistency of the instance with its neighbors.
③ After the feature set S is given, traversing each label in the label set L, and if the missing label exists, predicting the missing label by a label prediction method; let X S then represent the set of all instances on feature set S, traverse all instances on X S, calculate the value of the card value for each instance and record the sum of the values of all instances, recover the missing tag after all instances on X S have been calculated, and continue traversing the next tag. The above process is formulated as:
In the above formula, L num represents the size of the tag set L, i.e., the total tag number; n represents the size of instance set X S, i.e., the number of all instances on feature set S, CARD (S R(xj)) represents the CARD value of instance X j, which is used to calculate the consistency of instance X j with its neighbors on label l i. The final obtained dep s represents the dependence of the current feature set S on the tag, denoted by γ (S). The calculation mode of the single feature f i dependence is the same as that of the feature set S dependence, and the feature f i dependence gamma (f i) can be obtained by bringing the S= { f i } into the step of feature set dependence calculation.
S11, judging whether the characteristic f i is an important characteristic, namely whether gamma (S.u.f i) > Deps is met, and if so, executing the step S15; otherwise, performing steps S12-S14;
S12, judging whether redundancy updating processing is needed for the candidate set S, and if so, performing steps S13-S14; otherwise, go to step S16;
① In the function selection process, it is impossible to perform online redundancy update processing for each function. In order to select high quality features and reduce computational costs, the proposed algorithm introduces a judgment of feature dependencies, where the magnitude of the threshold is automatically transformed according to the average dependency of the selected feature set.
② It is determined whether or not redundancy update is required for the candidate set S, that is, when the condition of γ (S ∈ i) = Deps and γ (f i) > mean_dep is satisfied, redundancy update operation is required for the candidate set S.
S13, inserting the feature f i into the candidate set S, namely enabling S=S U f i;
S14, performing redundancy updating operation on the candidate set S, and performing step S16;
① The feature f i is added to the candidate set S, that is, let s=sjf i, in order to fairly treat all the features in the candidate feature set δ, one feature f j needs to be randomly selected from the candidate feature set δ and the importance delta S(fj of the feature needs to be calculated until all the features are calculated once, and the calculation formula of the importance is as follows:
δS(fj)=γ(S)-γ(S-fj)
② If the importance delta S(fj) of the feature f j =0, the feature f j is removed from the candidate set S, i.e., let s=s-f j; otherwise, continuing to calculate the remaining features in the candidate feature set S.
③ The purpose of the redundancy update in the invention is to ensure that the algorithm maximizes the dependence of the candidate feature set on the label and selects features with high correlation and low redundancy. If the importance delta S(fj) of a feature f j is=0, it is proved that the dependency gamma (S) of the candidate feature set is unchanged regardless of whether the feature f j is included, so that it can be explained that the feature f j is redundant, and the feature f j is removed from the candidate feature set in order to ensure that the candidate feature set is minimum.
S15, inserting the characteristic f i into the candidate set S, namely enabling S=SU f i, and updating the value of Deps;
① In the invention, whether the feature f i coming at the time t is an important feature is judged by maintaining the maximum dependence of the candidate feature set, namely the feature is selected with the aim of minimizing the number of the selected features while the dependence of the selected feature set on the label L is highest.
② If adding the feature f i to the candidate set S increases the dependency of S, i.e., γ (S u f i) > Deps, in order to maintain the maximum dependency of the candidate feature set, the feature f i may be directly added to the candidate set S, i.e., let s=s u f i, and update the value of the history dependency Deps, i.e., deps =γ (S); if γ (S ∈f i) = Deps, it is explained that the feature f i does not affect the maximum dependency of the candidate feature set, so it is continuously determined whether the redundant update operation is needed for the candidate feature set; if γ (S. U. F i) < Deps, then the description feature f i is a non-important feature and no additional processing is required.
S16, updating an average dependency threshold value mean_Dep;
① Traversing each feature f i in the candidate feature set S, calculating the dependence gamma (f i) of the feature f i, and updating the value of the average dependence threshold mean_dep according to the formula, wherein the calculation formula of the average dependence threshold mean_dep is as follows:
where S| represents the size of the candidate set S, i.e., the number of features contained in the candidate set, γ (f i) is the dependency of feature f i, and f i ε S.
S17, judging whether unprocessed characteristics exist, if so, returning to the step S08; if not, outputting the most characteristic set S;
The advantages of the invention are as follows:
The invention defines a new neighborhood relation based on the neighborhood rough set, and the method can automatically select proper neighbor quantity for all examples. Based on the neighbor rough set theory, the method does not need any domain knowledge, and can automatically select proper neighbors in the online feature selection process by utilizing the newly defined neighbor relation without specifying any parameters.
The invention aims at the problem that many existing online flow feature selection algorithms can only execute feature selection on a complete tag data set, and improves the performance of the algorithm by predicting missing tags online in real time based on the similarity of an instance and a neighbor thereof.
The invention considers the importance of the features and the size of the candidate feature set, and realizes that the candidate feature set is as small as possible while maintaining the maximum dependency of the candidate feature set. The final implementation maximizes the dependence of the candidate feature set on the tag and selects features with high correlation and low redundancy.
The invention has wide application, can be suitable for various data sets, for example, the variety of birds is predicted through audio, the emotion type of birds is judged through music data, and the protein functions are classified. Finally, it should be noted that: the embodiments described above are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the scope of the embodiments of the present invention.

Claims (7)

1. A semi-supervised multi-label online flow characteristic selection method based on a neighborhood rough set is characterized by comprising the following steps:
S01, defining a tag set L, wherein L j epsilon L is the j-th tag in L;
S02, defining a candidate feature set as S, and initializing
S03, defining the dependence degree of the candidate feature set S as gamma (S);
s04, defining the historical dependency of the candidate feature set S as Deps, and initializing Deps =0;
s05, defining an average dependency threshold of the candidate feature set S as mean_Dep, and initializing mean_Dep=0;
S06, defining the moment of arrival of the ith feature as t i; and initialize t i = 0;
S07. defines the feature reached at time t i as f i;
S08, judging whether a missing label exists, and if so, performing a step S09; if not, proceeding to step S10;
S09, obtaining neighbors of the missing label examples according to the mean value neighborhood relation, predicting the values of the missing labels through analysis of labels of similar examples, and recovering the labels after obtaining the required values;
S10, calculating the dependence gamma (f i) of the feature f i;
s11, judging whether the characteristic f i is an important characteristic, namely whether gamma (S.u.f i) > Deps is met, and if so, executing a step S15; otherwise, performing steps S12-S14;
s12, judging whether redundancy updating processing is needed for the candidate set S, and if so, performing steps S13-S14; if not, performing step S16;
S13, inserting the feature f i into the candidate set S, namely enabling S=S U f i;
S14, performing redundancy updating operation on the candidate set S, and performing step S16;
S15, inserting the characteristic f i into the candidate set S, namely enabling S=SU f i, and updating the value of Deps;
S16, updating an average dependency threshold value mean_Dep;
S17, judging whether unprocessed characteristics exist, if so, returning to the step S08; if not, the most useful feature set S is output.
2. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for missing tag prediction described in steps S08-S09 are as follows:
① The distance of instance x and instance y on the candidate set s= { S 1,s2,…,s|S| } is denoted by Δ (x i,xj), here representing the euclidean distance, which is calculated as follows:
Wherein S ix represents the value of example x on feature S i (1.ltoreq.i.ltoreq.S|), and S jy represents the value of example y on feature S j (1.ltoreq.j.ltoreq.S|);
② Let N S(xi) denote a neighbor sequence obtained by ordering the distances between x i and other instances with feature subset S, specifically expressed as:
Wherein the method comprises the steps of Representing the nearest instance to x i, i.e./>The value of (2) is the smallest; wherein/>Representing the instance furthest from x i, i.e./>The value of (2) is the largest;
③ Let S R(xi) denote, given N S(xi), that the neighbor set of x i is found by means of the mean neighborhood relation R, assuming instance x j∈SR(xi), the following condition must be met:
From the above equation, the average distance between the instance and the neighbor can be obtained by dividing the maximum distance minus the minimum distance by n-1; if the distance between the current instance and the other instances is less than the average distance multiplied by 0.35, then the instance may be considered a neighbor of the current instance;
④ Assuming that the missing tag of x i is l i, there are Pos positive samples in its mean neighborhood set S R(xi), i.e. l i =1; there are Neg negative samples, i.e. i i = -1, the prediction of missing labels can be achieved by the number of positive and negative samples in the corresponding label of the instance neighbor.
3. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 2, wherein: 0.35 in the formula is a parameter that may have higher efficiency.
4. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps of calculating the feature or feature set dependency in step S03 and step S10 are as follows:
① Given S R(xi), i.e., find the neighbor set of x i by the mean neighborhood relationship R, let CARD (S R(xi)) represent the CARD value of instance x i for use in computing the label consistency of the instance with its neighbors;
② After the feature set S is given, traversing each label in the label set L, and if the missing label exists, predicting the missing label by a label prediction method; let X S represent the set of all instances on feature set S, traverse all instances on X S, calculate the value of the card of each instance and record the sum of the values of the card of all instances, recover the missing tag after all instances on X S have been calculated, continue traversing the next tag; the above process is formulated as:
In the above formula, L num represents the size of the tag set L, i.e., the total tag number; n represents the size of instance set X S, i.e., the number of all instances on feature set S, CARD (S R(xj)) represents the CARD value of instance X j, used to calculate the consistency of instance X j with its neighbors on label l i; the finally obtained dep s represents the dependence of the current feature set S on the label, and is expressed by gamma (S); the calculation mode of the single feature f i dependence is the same as that of the feature set S dependence, and the feature f i dependence gamma (f i) can be obtained by bringing the S= { f i } into the step of feature set dependence calculation.
5. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for determining whether the feature f i is an important feature in step S11 are as follows:
Judging whether the feature f i can improve the dependency of the candidate feature set S, namely comparing the magnitude relation between gamma (S U f i) and the history dependency Deps of the candidate set S; if γ (S &. U. F i) > Deps, then prove that feature f i is an important feature, add feature f i to candidate set S, i.e. let s=s &. F i, and update the value of history dependence Deps, i.e. let Deps =γ (S head. F i); otherwise, on-line redundancy update judgment is carried out.
6. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for determining whether the candidate set S needs to be redundantly updated in steps S12-S14 are as follows:
① Judging whether the candidate set S needs to be subjected to redundancy update or not, namely, when the conditions of γ (S ∈ i) = Deps and γ (f i) > mean_dep are satisfied, performing redundancy update operation on the candidate set S;
② The feature f i is added to the candidate set S, that is, let s=sjf i, in order to fairly treat all the features in the candidate feature set, it is necessary to randomly select one feature f j in the candidate feature set S and calculate the importance delta S(fj of the feature until all the features are calculated once, where the calculation formula of the importance is as follows:
δS(fj)=γ(S)-γ(S-fj)
③ If the importance delta S(fj) of the feature f j =0, the feature f j is removed from the candidate set S, i.e., let s=s-f j; otherwise, continuing to calculate the remaining features in the candidate feature set S.
7. The method for selecting the semi-supervised multi-label online flow characteristics based on the neighborhood rough set as set forth in claim 1, wherein the method comprises the following steps: the specific steps for calculating the average dependency threshold mean_dep in step S16 are as follows:
Traversing each feature f i in the candidate feature set S, calculating the dependence gamma (f i) of the feature f i, and updating the value of the average dependence threshold mean_dep according to the formula, wherein the calculation formula of the average dependence threshold mean_dep is as follows:
where S| represents the size of the candidate set S, i.e., the number of features contained in the candidate set, γ (f i) is the dependency of feature f i, and f i ε S.
CN202111403199.9A 2021-11-24 2021-11-24 Semi-supervised multi-label online stream feature selection method based on neighborhood rough set Active CN114091607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111403199.9A CN114091607B (en) 2021-11-24 2021-11-24 Semi-supervised multi-label online stream feature selection method based on neighborhood rough set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111403199.9A CN114091607B (en) 2021-11-24 2021-11-24 Semi-supervised multi-label online stream feature selection method based on neighborhood rough set

Publications (2)

Publication Number Publication Date
CN114091607A CN114091607A (en) 2022-02-25
CN114091607B true CN114091607B (en) 2024-05-03

Family

ID=80303883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111403199.9A Active CN114091607B (en) 2021-11-24 2021-11-24 Semi-supervised multi-label online stream feature selection method based on neighborhood rough set

Country Status (1)

Country Link
CN (1) CN114091607B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115729957B (en) * 2022-11-28 2024-01-19 安徽大学 Unknown stream feature selection method and device based on maximum information coefficient
CN115718894B (en) * 2022-11-30 2023-11-17 江西农业大学 Online flow characteristic selection method for high-dimensional complex data
CN116933161A (en) * 2023-09-19 2023-10-24 天津市金超利达科技有限公司 Calorimeter data analysis system and method based on cloud computing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447110A (en) * 2018-09-17 2019-03-08 华中科技大学 The method of the multi-tag classification of comprehensive neighbours' label correlative character and sample characteristics
CN110347791A (en) * 2019-06-20 2019-10-18 广东工业大学 A kind of topic recommended method based on multi-tag classification convolutional neural networks
CN112463894A (en) * 2020-11-26 2021-03-09 浙江工商大学 Multi-label feature selection method based on conditional mutual information and interactive information
CN113344091A (en) * 2021-06-18 2021-09-03 燕山大学 Method for determining optimal feature subset of multi-label stream features based on label correlation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447110A (en) * 2018-09-17 2019-03-08 华中科技大学 The method of the multi-tag classification of comprehensive neighbours' label correlative character and sample characteristics
CN110347791A (en) * 2019-06-20 2019-10-18 广东工业大学 A kind of topic recommended method based on multi-tag classification convolutional neural networks
CN112463894A (en) * 2020-11-26 2021-03-09 浙江工商大学 Multi-label feature selection method based on conditional mutual information and interactive information
CN113344091A (en) * 2021-06-18 2021-09-03 燕山大学 Method for determining optimal feature subset of multi-label stream features based on label correlation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种面向标签排序数据集的特征选择方法;曾子林;;计算机应用研究(第04期);全文 *

Also Published As

Publication number Publication date
CN114091607A (en) 2022-02-25

Similar Documents

Publication Publication Date Title
CN114091607B (en) Semi-supervised multi-label online stream feature selection method based on neighborhood rough set
Yan et al. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic
US8108324B2 (en) Forward feature selection for support vector machines
Carbonera et al. A density-based approach for instance selection
CN109948735B (en) Multi-label classification method, system, device and storage medium
CN101937513A (en) Messaging device, information processing method and program
CN115470354B (en) Method and system for identifying nested and overlapped risk points based on multi-label classification
CN110880007A (en) Automatic selection method and system for machine learning algorithm
JP6004015B2 (en) Learning method, information processing apparatus, and learning program
Samma et al. Adaptation of k-means algorithm for image segmentation
Cui et al. Allie: Active learning on large-scale imbalanced graphs
de Silva et al. Evolutionary k-nearest neighbor imputation algorithm for gene expression data
JP5905375B2 (en) Misclassification detection apparatus, method, and program
CN116051924B (en) Divide-and-conquer defense method for image countermeasure sample
CN110363302B (en) Classification model training method, prediction method and device
Hlosta et al. Constrained classification of large imbalanced data by logistic regression and genetic algorithm
CN109670552B (en) Image classification method, device and equipment and readable storage medium
CN112749851A (en) Big data demand prediction method based on artificial intelligence and big data cloud service center
CN108280531B (en) Student class score ranking prediction method based on Lasso regression
CN111401783A (en) Power system operation data integration feature selection method
Moradi et al. A swarm intelligence-based ensemble learning model for optimizing customer churn prediction in the telecommunications sector
CN118097319B (en) Image classification method with unseen class and noise labels in online stream data
CN116502093B (en) Target detection data selection method and device based on active learning
CN111709532B (en) Online shopping representative sample selection system based on model-independent local interpretation
Trstenjak et al. Case-Based Reasoning: A Hybrid Classification Model Improved with an Expert's Knowledge for High-Dimensional Problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant