CN114611592A - Semi-supervised feature selection method, system, medium, equipment and terminal - Google Patents

Semi-supervised feature selection method, system, medium, equipment and terminal Download PDF

Info

Publication number
CN114611592A
CN114611592A CN202210208158.2A CN202210208158A CN114611592A CN 114611592 A CN114611592 A CN 114611592A CN 202210208158 A CN202210208158 A CN 202210208158A CN 114611592 A CN114611592 A CN 114611592A
Authority
CN
China
Prior art keywords
score
feature
data
calculating
semi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210208158.2A
Other languages
Chinese (zh)
Inventor
孙建勋
曾洁
吴全旺
龚彦鹭
李德辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202210208158.2A priority Critical patent/CN114611592A/en
Publication of CN114611592A publication Critical patent/CN114611592A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of machine learning and data mining, and discloses a semi-supervised feature selection method, a system, a medium, equipment and a terminal, wherein a local data structure of a data set lacking tags is evaluated by using NLS (non line of sight), the correlation between the tags and features is calculated by using MIC (many integrated core) and a small amount of tag information in the data set is utilized; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NMScore of the feature. According to the method, the method for generating the natural Laplace by improving the semi-supervised Laplace method by merging the natural neighbors, so that higher sensitivity can be achieved on the local data structure of the data; the MIC is used for carrying out label and feature correlation evaluation, the conflict coefficient is innovatively used for carrying out weighted combination of the label and the feature to obtain a final NM score, the score is used for carrying out feature importance ranking, the feature importance can be well evaluated, and the whole method has higher performance and efficiency.

Description

Semi-supervised feature selection method, system, medium, equipment and terminal
Technical Field
The invention belongs to the technical field of machine learning and data mining, and particularly relates to a semi-supervised feature selection method, a semi-supervised feature selection system, a semi-supervised feature selection medium, semi-supervised feature selection equipment and a semi-supervised feature selection terminal.
Background
At present, feature selection is a data preprocessing method, a feature selection method selects a related feature subset from an original feature set, the performance of a learning algorithm is improved, and the feature selection method is a common data preprocessing means for data mining and machine learning tasks. The function of feature selection can improve the interpretability of the learning model, reduce the learning time and avoid 'dimensional disasters'. However, the conventional supervised feature selection method processes labeled data, and when a large number of labels are missing in the data, the supervised feature selection method cannot well utilize information of all data, and meanwhile, the evaluation capability of feature subsets drops sharply. Semi-supervised feature selection is required for this drawback.
Semi-supervised feature selection algorithms can be mainly classified into filtering methods, wrapping methods and embedded methods. In general, the performance of the wrapped approach depends on the performance of the classifiers used, and the wrapped approach is very inefficient because it uses a single classification model or an integration model to predict the labels of the unlabeled data, and the embedded approach also has this drawback. Compared with the wrapped method and the embedded method, the filtering method does not need to use a classification model in the selection process, is independent of the classification model, only considers the structure of data, and therefore has better performance and excellent characteristic subset evaluation capability.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the traditional supervised feature selection method is used for processing labeled data, when a large number of labels are missing in the data, the supervised feature selection method cannot well utilize information of all data, and meanwhile, the evaluation capability of feature subsets is rapidly reduced.
(2) The wrapped approach is very inefficient because it uses a single classification model or an integration model to predict the label of the unlabeled data.
The difficulty in solving the above problems and defects is: the performance of a classifier which only utilizes label data is greatly reduced due to a large amount of missing of data set labels, so that the classifier is under-fitted and cannot achieve the expected effect. At the same time, too many redundant features result in a reduced temporal performance of the classifier. And for data with a large number of missing tags: data related to the financial field generally has the characteristic of extremely high dimensionality, and the data set is mostly a data set lacking a large number of labels due to the difficulty in obtaining the labels of the data. The general feature selection is a data preprocessing method, the feature selection method selects a related feature subset from an original feature set, the performance of a learning algorithm is improved, and the feature selection method is a common data preprocessing means for data mining and machine learning tasks. The function of feature selection can improve the interpretability of the learning model, reduce the learning time and avoid 'dimensional disasters'. However, the conventional supervised feature selection method processes labeled data, and when a large number of labels are missing in the data, the supervised feature selection method cannot well utilize information of all data, and meanwhile, the evaluation capability of feature subsets drops sharply.
The significance of solving the problems and the defects is as follows: a subset is found out through a semi-supervised feature selection algorithm, the efficiency of the classifier is improved, and meanwhile, the precision of the classifier is improved by using label-free data. At present, very large and complex data exist, and many data generally have the characteristic of extremely high dimensionality, and meanwhile, because the labels of the data are difficult to obtain, the financial data sets are mostly data sets lacking a large number of labels. And the financial related data is sensitive in content and difficult to acquire, and a large amount of loss of the labels of the samples is likely to occur. A large number of missing labels can have a large impact on the accuracy of the classifier. And aiming at the condition that a large number of labels are lost in the data set, semi-supervised feature selection is researched to improve the classification precision of subsequent tasks. The financial data set is input into a semi-supervised feature selection algorithm, a feature subset is extracted according to the requirements of a subsequent classifier, and the subsequent classifier is trained by using the feature subset so that the performance of the classifier is higher.
Disclosure of Invention
The invention provides a semi-supervised feature selection method, a system, a medium, equipment and a terminal, and particularly relates to a novel semi-supervised feature selection method, a system, a medium, equipment and a terminal based on natural Laplace fraction (NLS) and maximum Mutual Information Coefficient (MIC).
The present invention is achieved by a semi-supervised feature selection method that evaluates the local data structure of a tag-deficient dataset by using NLS while computing the correlation of tags and features using MIC, utilizing a small amount of tag information in the dataset; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.
Further, the feature selection method comprises the following steps:
step one, calculating the number of times of conflict according to the field of data and the information of the label;
step two, calculating a conflict proportion;
step three, judging whether each feature is evaluated, and sorting the features after all the features of the data are evaluated, or continuing to evaluate;
respectively calculating the MIC score and the NL score of the data;
step five, calculating a final Score NM Score of the data;
step six, sorting the importance of the features according to the respective NM scores of all the features;
and step seven, returning a feature importance sorting result.
The first step and the second step are used for self-adapting the data set, the algorithm is non-parametric, the algorithm without the parameters enables the application scene of the algorithm to be wider, and meanwhile, parameter optimization is not needed to be carried out with large manpower input and equipment input. And thirdly, evaluating the features, wherein the evaluation scores are used for final feature sorting, the quality of feature evaluation determines the quality of a feature selection algorithm, and a good feature evaluation standard can evaluate the importance of the features more accurately. The fourth step and the fifth step are evaluation criteria, the importance degree of the features can be calculated, the importance degree of the features determines the ranking position of the features, and the more the ranking is, the more important the features are explained. And sequencing the importance of the features in the sixth step and the seventh step into a final result, adopting the features according to needs to improve the performance of the classifier, adopting feature subsets with different sizes to improve the efficiency of the classifier, and researching more experimental modes.
Further, the calculating the conflict ratio in the first step includes:
calculating the number of times of conflict between the field and the label information in the data, wherein the method for calculating the number of times of conflict comprises the following steps:
when the labels of the two samples are the same and the two samples do not belong to the natural neighbors of the opposite side, recording as a conflict;
when the labels of the two samples are different and the two samples belong to the natural neighbors of the opposite side, recording as a conflict;
proportion of conflict according to
Figure BDA0003529994430000041
Calculating, c is the number of collisions, | Y2Is the square of the number of labels.
Further, the NL score for the calculated data in step four comprises:
and (3) fusing natural neighbors into a semi-supervised Laplace algorithm, and modifying the construction process of a Laplace matrix:
Figure BDA0003529994430000042
Figure BDA0003529994430000043
the final NL score was calculated as follows:
Figure BDA0003529994430000044
wherein f isrIs the r characteristic, LwAnd LbRespectively, intra-class Laplace matrix and inter-class Laplace matrixAnd (4) matrix.
The MIC score of the calculated data comprises:
the MIC score is a maximum mutual information coefficient used to measure the correlation between the tag and the feature, and is calculated by the following formula:
Figure BDA0003529994430000045
MIC(U,V)=maxuv≤B(n,α){mu,v}。
further, the calculating the final Score NM Score of the data in the fifth step includes:
the NM score is obtained by weighted addition of the NL score and the MIC score, the weight coefficient is obtained by using a conflict ratio, and the calculation formula is as follows:
SNM(fk)=λMIC(fk)-(1-λ)NLS(fk)。
further, the feature importance ranking in the sixth step includes:
evaluating the importance of each feature of the data to obtain a corresponding score; and performing feature importance ranking on the features according to the scores to obtain a feature importance ranking result, and freely selecting the size of the feature subset according to the requirement.
Another object of the present invention is to provide a semi-supervised feature selection system applying the semi-supervised feature selection method, the semi-supervised feature selection system comprising:
the collision times calculation module is used for calculating the collision times according to the field of the data and the information of the label;
the conflict proportion calculation module is used for calculating a conflict proportion;
the judging module is used for judging whether each feature is evaluated, and ranking the features after all the features of the data are evaluated, otherwise, continuing to evaluate;
a Score calculating module for calculating MIC Score and NL Score of the data, respectively, and calculating final Score NM Score of the data;
and the characteristic importance ranking module is used for ranking the characteristic importance according to the respective NM scores of all the characteristics and returning a characteristic importance ranking result.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
evaluating a local data structure of the data set lacking the tag by using NLS, and simultaneously calculating correlation of the tag and the feature by using MIC, and utilizing a small amount of tag information in the data set; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
evaluating a local data structure of the data set lacking the tag by using NLS, and simultaneously calculating correlation of the tag and the feature by using MIC, and utilizing a small amount of tag information in the data set; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.
Another object of the present invention is to provide an information data processing terminal for implementing the semi-supervised feature selection system.
By combining all the technical schemes, the invention has the advantages and positive effects that: semi-supervised feature selection is of increasing interest due to dimension cursing and the high cost of manual tagging. In the semi-supervised feature selection method provided by the invention, the NM score method is improved by fusing natural neighbors into the semi-supervised Laplacian method, so that the natural Laplacian method is generated, the local data structure of data can have higher sensitivity, then the MIC is used for carrying out tag and feature correlation evaluation, the collision coefficient is used for carrying out weighted combination of the tag and the feature, the final NM score is obtained, and the feature importance can be well evaluated by using the score for carrying out feature importance ranking. Compared with the traditional feature selection algorithm, the NM score of the invention innovatively combines the label information and the local structure information, constructs the adaptive conflict proportion parameter, and can perform adaptive calculation on each data set without parameters, so that the whole method is parameterless and has higher performance and efficiency.
The method is used for selecting the characteristics of the financial data, can reduce the dimension of the financial data set with high dimension and a large number of missing data set labels and utilize the non-label data, and greatly improves the availability of the non-label financial data.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a semi-supervised feature selection method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a semi-supervised feature selection method provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of the NM Score algorithm provided by the embodiment of the present invention;
FIG. 4 is a block diagram of a semi-supervised feature selection system provided by embodiments of the present invention;
fig. 5 is a diagram illustrating the results on an Australian dataset provided by an embodiment of the invention.
Fig. 6 is a graphical representation of the results on an arrhytmia data set provided by an embodiment of the present invention.
FIG. 7 is a graphical representation of the results on the chess-krvkp dataset provided by an embodiment of the present invention.
FIG. 8 is a diagram illustrating the results of the climate-simulation data set according to the embodiment of the present invention.
Fig. 9 is a diagram illustrating results on a result data set on a CMC data set according to an embodiment of the present invention.
FIG. 10 is a graph illustrating the results on the cylinder-bases dataset provided by an embodiment of the present invention.
Fig. 11 is a diagram illustrating the result of the lymphographic dataset according to the embodiment of the present invention.
FIG. 12 is a diagram illustrating results on a page-blocks dataset according to an embodiment of the present invention.
Fig. 13 is a diagram illustrating the results on the Sonar dataset provided by the embodiment of the present invention.
Fig. 14 is a diagram illustrating the results on an Australian dataset provided by an embodiment of the invention.
Fig. 15 is a graphical representation of the results on an arrhytmia data set provided by an embodiment of the present invention.
FIG. 16 is a graphical representation of the results on the chess-krvkp dataset provided by an embodiment of the present invention.
FIG. 17 is a diagram illustrating the results of the climate-simulation data set provided by the embodiment of the present invention.
Fig. 18 is a diagram illustrating results on a result dataset on a CMC dataset according to an embodiment of the present invention.
FIG. 19 is a graph illustrating the results on the cylinder-bases dataset provided by an embodiment of the present invention.
Fig. 20 is a diagram illustrating the results on the lymphograph dataset provided by the embodiment of the present invention.
FIG. 21 is a diagram illustrating results on a page-blocks dataset according to an embodiment of the present invention.
Fig. 22 is a diagram illustrating the results on the Sonar dataset provided by the embodiment of the present invention.
In the figure: 1. a conflict number calculation module; 2. a conflict proportion calculation module; 3. a judgment module; 4. a score calculation module; 5. a feature importance ranking module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In view of the problems in the prior art, the present invention provides a method, a system, a medium, a device and a terminal for selecting semi-supervised features, which are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the semi-supervised feature selection method provided by the embodiment of the present invention includes the following steps:
s101, calculating the number of times of collision according to the field of the data and the information of the label;
s102, calculating a conflict proportion;
s103, judging whether each feature is evaluated, and sorting the features after all the features of the data are evaluated, or continuing to evaluate;
s104, respectively calculating the MIC score and the NL score of the data;
s105, calculating a final Score NM Score of the data;
s106, sorting the importance of the features according to the respective NM scores of all the features;
and S107, returning a feature importance ranking result.
A schematic diagram of a semi-supervised feature selection method provided by an embodiment of the present invention is shown in fig. 2, and a schematic diagram of an NM Score algorithm provided by an embodiment of the present invention is shown in fig. 3.
As shown in fig. 4, a semi-supervised feature selection system provided in an embodiment of the present invention includes:
the collision times calculation module 1 is used for calculating the collision times according to the information of the data field and the label;
a conflict proportion calculation module 2 for calculating a conflict proportion;
the judging module 3 is used for judging whether each feature is evaluated, and when all the features of the data are evaluated, sorting the features, otherwise, continuing to evaluate;
a Score calculating module 4 for calculating MIC Score and NL Score of the data, respectively, and calculating final Score NM Score of the data;
and the feature importance ranking module 5 is used for ranking the feature importance according to the respective NM scores of all the features and returning a feature importance ranking result.
The technical solution of the present invention is further described below with reference to specific examples.
The invention provides a novel semi-supervised feature selection method based on natural Laplace fraction (NLS) and Maximum Information Coefficient (MIC), which is called NM Score. The method evaluates the local data structure of the data set lacking the label by using NLS. The MIC is used simultaneously to compute the correlation of the tag and features, thereby utilizing a small amount of tag information in the data set. Then, NLS and MIC are adaptively combined according to the collision ratio between the neighborhood and the tag to determine NM Score of the feature to evaluate its importance.
As shown in fig. 2, the semi-supervised feature selection method provided by the embodiment of the present invention includes the following steps:
s1: calculating the number of times of conflict according to the field of the data and the information of the label;
s2: calculating a conflict proportion;
s3: judging whether each feature is evaluated, and sorting the features after all the features of the data are evaluated, or continuing to evaluate;
s4: calculating MIC scores and NL scores of the data, respectively;
s5: calculating a final Score NM Score of the data;
s6: sorting the importance of the features according to the respective NM scores of all the features;
s7: and returning a characteristic importance ranking result.
1. Conflict ratio: the method for calculating the number of times of conflict between the field and the label information in the data comprises the following steps: 1. when the labels of the two samples are the same and the two samples do not belong to the natural neighbors of the other sample, the two samples are recorded as a conflict.
2. When the labels of the two samples are different and the two samples belong to the natural neighbors of the other side, the two samples are recorded as a conflict. Proportion of conflict according to
Figure BDA0003529994430000091
Calculating, c is the number of collisions, | Y2As many as tagsSquare.
Figure BDA0003529994430000092
NL score: in order to make the semi-supervised laplacian algorithm more sensitive to the evaluation of the local structure of the data, we incorporate the natural neighborhood into it, and modify the construction process of the laplacian matrix:
Figure BDA0003529994430000093
Figure BDA0003529994430000094
the final NL score was calculated as follows:
Figure BDA0003529994430000095
wherein f isrIs the r characteristic, LwAnd LbRespectively, an intra-class laplacian matrix and an inter-class laplacian matrix.
MIC score: the MIC score is a maximum mutual information coefficient used to measure the correlation between the tag and the feature, and is calculated by the following formula:
Figure BDA0003529994430000101
MIC(U,V)=maxuv≤B(n,α){mu,v}。
NM score: the NM score is obtained by weighting and adding the NL score and the MIC score, the weight coefficient uses a conflict ratio, and the calculation formula is as follows:
SNM(fk)=λMIC(fk)-(1-λ)NLS(fk)。
5. characteristic sorting: and performing feature importance evaluation on each feature of the data to obtain a corresponding score, and performing feature importance ranking on the features according to the score to obtain a feature importance ranking result. The size of the feature subset can be freely selected according to requirements.
Semi-supervised feature selection is of increasing interest due to dimension cursing and the high cost of manual tagging. The NM scoring method is improved by fusing natural neighbors into a semi-supervised Laplace method to generate a natural Laplace method, can have higher sensitivity on a local data structure of data, and then uses MIC to evaluate the relevance of a label and characteristics. And a final NM score is obtained by performing weighted combination of the two using the collision coefficient. Using the scores for feature importance ranking can provide a good assessment of feature importance. Compared with the traditional feature selection algorithm, the NM score innovatively combines the label information and the local structure information, constructs the adaptive conflict ratio parameter, and can perform adaptive calculation on each data set without the parameter, so that the whole method is parametrically-free and has higher performance and efficiency.
The technical effects of the present invention will be described in detail with reference to the experiments.
1. Data set
Most of the data sets are from the UCI machine learning library, and the last data set is a financial data set and is obtained by a web crawler. The description of these data sets is shown in the table. These datasets have from tens to hundreds of features, hundreds of samples, and the number of classes also involves two classes and multiple classes.
Table 1 data set description
Figure BDA0003529994430000102
Figure BDA0003529994430000111
2. Results display
In the diagrams shown below (fig. 5 to 22), the abscissa represents the number of features, and the ordinate represents the classification accuracy. The lower graph shows the experimental results of the classical semi-supervised feature selection algorithms LSDF, Fisher, RRPC algorithm and our proposed NM SCore on different datasets tested using KNN classifier and SVM classifier.
In different data sets, different semi-supervised feature selection algorithms have advantages and disadvantages, so in practical application, a proper feature selection method should be selected according to the characteristics of the data sets. Such as: for data sets with larger number of samples, NM Score can be chosen, which reduces time consumption, and for higher accuracy requirements, semi-supervised Fisher methods or NM Score can be used.
The graph shown below demonstrates the performance of different semi-supervised feature selection algorithms under KNN and SVM classifiers on different UCI generic data sets. The test was performed on the Australian, arrymia, chess-krvkp data set using KNN, the NM Score performed best, and the highest efficiency was obtained with the fewest number of features.
Fig. 5 results on Australian dataset.
Fig. 6 results on arrhymia data set.
FIG. 7 results on the chess-krvkp dataset tested on the close-simulation, CMC, cylinder-bases dataset using KNN, with a slightly better NM Score.
FIG. 8 results on the close-simulation dataset.
Fig. 9 results on the results dataset on the CMC dataset.
FIG. 10 results on the cylinder-bases dataset tested on the Lymphogry, page-blocks, Sonar datasets using KNN, the performance of the four classifiers differed less on the Lymphogry and page-blocks datasets, but the NM Score performed well on the Sonar dataset.
FIG. 11 results on a Lymphogry dataset.
FIG. 12 results on page-blocks dataset.
The results on the Sonar dataset of fig. 13 were tested on the Australian, arrymia, chess-krvkp dataset using SVM, the NM Score still performed best, with the lowest feature count to achieve the highest efficiency, while the LSDF algorithm performed well on arrymia.
FIG. 14 results on Australian dataset.
Fig. 15 results on arrhymia data set.
FIG. 16 results on the chess-krvkp dataset tested on the close-simulation, CMC, cylinder-bases dataset using SVM, with a slightly better NM Score.
FIG. 17 results on the close-simulation dataset.
Figure 18 results on the results dataset on the CMC dataset.
FIG. 19 results on the cylinder-bases dataset tested on the Lymphogry, page-blocks, Sonar datasets using SVM, the performance of the four algorithms differed less on the Lymphogry and page-blocks datasets but on the Sonar dataset the NM Score performed well.
Fig. 20 results on the lymphograph dataset.
FIG. 21 results on page-blocks dataset.
Figure 22 results on Sonar dataset.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A semi-supervised feature selection method, characterized in that the feature selection method exploits tag information in a dataset by evaluating local data structures of the dataset lacking tags using NLS, while computing correlations of tags and features using MIC; NLS and MIC are adaptively combined according to the collision ratio between the neighborhood and the tag, NM Score of the feature is determined, and importance is evaluated.
2. The semi-supervised feature selection method of claim 1, wherein the feature selection method comprises the steps of:
step one, calculating the number of times of conflict according to the field of data and the information of the label;
step two, calculating a conflict proportion;
step three, judging whether each feature is evaluated, and sorting the features after all the features of the data are evaluated, or continuing to evaluate;
respectively calculating the MIC score and the NL score of the data;
step five, calculating a final Score NM Score of the data;
step six, sorting the importance of the features according to the respective NM scores of all the features;
and step seven, returning a feature importance sorting result.
3. The semi-supervised feature selection method of claim 2, wherein the calculating of the conflict ratio in step one comprises:
calculating the number of times of conflict between the field and the label information in the data, wherein the method for calculating the number of times of conflict comprises the following steps:
when the labels of the two samples are the same and the two samples do not belong to the natural neighbors of the opposite side, recording as a conflict;
when the labels of the two samples are different and the two samples belong to the natural neighbors of the opposite side, recording as a conflict;
proportion of conflict according to
Figure FDA0003529994420000011
Calculating, c is the number of collisions, | Y2Is the square of the number of labels.
4. The semi-supervised feature selection method of claim 2, wherein calculating the NL score for the data in step four comprises:
and (3) fusing natural neighbors into a semi-supervised Laplace algorithm, and modifying the construction process of a Laplace matrix:
Figure FDA0003529994420000012
Figure FDA0003529994420000021
the final NL score was calculated as follows:
Figure FDA0003529994420000022
wherein f isrIs the r characteristic, LwAnd LbAre respectively intra-class LaplaceMatrix and inter-class laplacian matrix;
the MIC score of the calculated data comprises: the MIC score is a maximum mutual information coefficient used to measure the correlation between the tag and the feature, and is calculated by the following formula:
Figure FDA0003529994420000023
MIC(U,V)=maxuv≤B(n,α){mu,v}。
5. the semi-supervised feature selection method as recited in claim 2, wherein the calculating the final Score NM Score of the data in the fifth step comprises: the NM score is obtained by weighted addition of the NL score and the MIC score, the weight coefficient is obtained by using a conflict ratio, and the calculation formula is as follows:
SNM(fk)=λMIC(fk)-(1-λ)NLS(fk)。
6. the semi-supervised feature selection method of claim 2, wherein the feature importance ranking in step six comprises: evaluating the importance of each feature of the data to obtain a corresponding score; and performing feature importance ranking on the features according to the scores to obtain a feature importance ranking result, and freely selecting the size of the feature subset according to the requirement.
7. A semi-supervised feature selection system for implementing the semi-supervised feature selection method of any one of claims 1 to 6, the semi-supervised feature selection system comprising:
the collision frequency calculation module is used for calculating the collision frequency according to the field of the data and the information of the label;
the conflict proportion calculation module is used for calculating a conflict proportion;
the judging module is used for judging whether each feature is evaluated, and when all the features of the data are evaluated, the feature sorting is carried out, otherwise, the evaluation is continued;
a Score calculating module for calculating MIC Score and NL Score of the data, respectively, and calculating final Score NM Score of the data;
and the characteristic importance ranking module is used for ranking the characteristic importance according to the respective NM scores of all the characteristics and returning a characteristic importance ranking result.
8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of: evaluating a local data structure of the data set lacking the tag by using NLS, and simultaneously calculating correlation of the tag and the feature by using MIC, and utilizing a small amount of tag information in the data set; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.
9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: evaluating a local data structure of the data set lacking the tag by using NLS, and simultaneously calculating correlation of the tag and the feature by using MIC, and utilizing a small amount of tag information in the data set; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.
10. An information data processing terminal, characterized in that the information data processing terminal is adapted to implement the semi-supervised feature selection system as claimed in claim 7.
CN202210208158.2A 2022-03-03 2022-03-03 Semi-supervised feature selection method, system, medium, equipment and terminal Pending CN114611592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210208158.2A CN114611592A (en) 2022-03-03 2022-03-03 Semi-supervised feature selection method, system, medium, equipment and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210208158.2A CN114611592A (en) 2022-03-03 2022-03-03 Semi-supervised feature selection method, system, medium, equipment and terminal

Publications (1)

Publication Number Publication Date
CN114611592A true CN114611592A (en) 2022-06-10

Family

ID=81861215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210208158.2A Pending CN114611592A (en) 2022-03-03 2022-03-03 Semi-supervised feature selection method, system, medium, equipment and terminal

Country Status (1)

Country Link
CN (1) CN114611592A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239485A (en) * 2022-08-16 2022-10-25 苏州大学 Credit evaluation method and system based on forward iteration constraint scoring feature selection

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239485A (en) * 2022-08-16 2022-10-25 苏州大学 Credit evaluation method and system based on forward iteration constraint scoring feature selection

Similar Documents

Publication Publication Date Title
US8756174B2 (en) Forward feature selection for support vector machines
CN109634698B (en) Menu display method and device, computer equipment and storage medium
US20160104058A1 (en) Generic object detection in images
CN111340086B (en) Processing method, system, medium and terminal of label-free electronic transaction data
US20220058431A1 (en) Semantic input sampling for explanation (sise) of convolutional neural networks
CN111027636B (en) Unsupervised feature selection method and system based on multi-label learning
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN113934851A (en) Data enhancement method and device for text classification and electronic equipment
CN117197559A (en) Pork classification model based on deep learning, construction method, electronic equipment and computer readable medium
CN117557872B (en) Unsupervised anomaly detection method and device for optimizing storage mode
CN114611592A (en) Semi-supervised feature selection method, system, medium, equipment and terminal
CN114238746A (en) Cross-modal retrieval method, device, equipment and storage medium
CN113762294B (en) Feature vector dimension compression method, device, equipment and medium
KR102548130B1 (en) Defect detection method with channel attention and spatial attention
CN112131199A (en) Log processing method, device, equipment and medium
CN112561569B (en) Dual-model-based store arrival prediction method, system, electronic equipment and storage medium
US20210182686A1 (en) Cross-batch memory for embedding learning
CN111861493B (en) Information processing method, information processing device, electronic equipment and storage medium
CN113762298B (en) Similar crowd expansion method and device
CN114529136A (en) Electronic part component evaluation method and device based on principal component analysis and Topsis
CN113780335A (en) Small sample commodity image classification method, device, equipment and storage medium
Sharma et al. Regularization and feature selection for large dimensional data
Kumar et al. Image classification in python using Keras
WO2023220859A1 (en) Multi-dimensional attention for dynamic convolutional kernel
CN118228993A (en) Method, device, computer equipment and storage medium for determining demand priority

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination