CN114611592A

CN114611592A - Semi-supervised feature selection method, system, medium, equipment and terminal

Info

Publication number: CN114611592A
Application number: CN202210208158.2A
Authority: CN
Inventors: 孙建勋; 曾洁; 吴全旺; 龚彦鹭; 李德辉
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-10

Abstract

The invention belongs to the technical field of machine learning and data mining, and discloses a semi-supervised feature selection method, a system, a medium, equipment and a terminal, wherein a local data structure of a data set lacking tags is evaluated by using NLS (non line of sight), the correlation between the tags and features is calculated by using MIC (many integrated core) and a small amount of tag information in the data set is utilized; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NMScore of the feature. According to the method, the method for generating the natural Laplace by improving the semi-supervised Laplace method by merging the natural neighbors, so that higher sensitivity can be achieved on the local data structure of the data; the MIC is used for carrying out label and feature correlation evaluation, the conflict coefficient is innovatively used for carrying out weighted combination of the label and the feature to obtain a final NM score, the score is used for carrying out feature importance ranking, the feature importance can be well evaluated, and the whole method has higher performance and efficiency.

Description

Semi-supervised feature selection method, system, medium, equipment and terminal

Technical Field

The invention belongs to the technical field of machine learning and data mining, and particularly relates to a semi-supervised feature selection method, a semi-supervised feature selection system, a semi-supervised feature selection medium, semi-supervised feature selection equipment and a semi-supervised feature selection terminal.

Background

At present, feature selection is a data preprocessing method, a feature selection method selects a related feature subset from an original feature set, the performance of a learning algorithm is improved, and the feature selection method is a common data preprocessing means for data mining and machine learning tasks. The function of feature selection can improve the interpretability of the learning model, reduce the learning time and avoid 'dimensional disasters'. However, the conventional supervised feature selection method processes labeled data, and when a large number of labels are missing in the data, the supervised feature selection method cannot well utilize information of all data, and meanwhile, the evaluation capability of feature subsets drops sharply. Semi-supervised feature selection is required for this drawback.

Semi-supervised feature selection algorithms can be mainly classified into filtering methods, wrapping methods and embedded methods. In general, the performance of the wrapped approach depends on the performance of the classifiers used, and the wrapped approach is very inefficient because it uses a single classification model or an integration model to predict the labels of the unlabeled data, and the embedded approach also has this drawback. Compared with the wrapped method and the embedded method, the filtering method does not need to use a classification model in the selection process, is independent of the classification model, only considers the structure of data, and therefore has better performance and excellent characteristic subset evaluation capability.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) the traditional supervised feature selection method is used for processing labeled data, when a large number of labels are missing in the data, the supervised feature selection method cannot well utilize information of all data, and meanwhile, the evaluation capability of feature subsets is rapidly reduced.

(2) The wrapped approach is very inefficient because it uses a single classification model or an integration model to predict the label of the unlabeled data.

The difficulty in solving the above problems and defects is: the performance of a classifier which only utilizes label data is greatly reduced due to a large amount of missing of data set labels, so that the classifier is under-fitted and cannot achieve the expected effect. At the same time, too many redundant features result in a reduced temporal performance of the classifier. And for data with a large number of missing tags: data related to the financial field generally has the characteristic of extremely high dimensionality, and the data set is mostly a data set lacking a large number of labels due to the difficulty in obtaining the labels of the data. The general feature selection is a data preprocessing method, the feature selection method selects a related feature subset from an original feature set, the performance of a learning algorithm is improved, and the feature selection method is a common data preprocessing means for data mining and machine learning tasks. The function of feature selection can improve the interpretability of the learning model, reduce the learning time and avoid 'dimensional disasters'. However, the conventional supervised feature selection method processes labeled data, and when a large number of labels are missing in the data, the supervised feature selection method cannot well utilize information of all data, and meanwhile, the evaluation capability of feature subsets drops sharply.

The significance of solving the problems and the defects is as follows: a subset is found out through a semi-supervised feature selection algorithm, the efficiency of the classifier is improved, and meanwhile, the precision of the classifier is improved by using label-free data. At present, very large and complex data exist, and many data generally have the characteristic of extremely high dimensionality, and meanwhile, because the labels of the data are difficult to obtain, the financial data sets are mostly data sets lacking a large number of labels. And the financial related data is sensitive in content and difficult to acquire, and a large amount of loss of the labels of the samples is likely to occur. A large number of missing labels can have a large impact on the accuracy of the classifier. And aiming at the condition that a large number of labels are lost in the data set, semi-supervised feature selection is researched to improve the classification precision of subsequent tasks. The financial data set is input into a semi-supervised feature selection algorithm, a feature subset is extracted according to the requirements of a subsequent classifier, and the subsequent classifier is trained by using the feature subset so that the performance of the classifier is higher.

Disclosure of Invention

The invention provides a semi-supervised feature selection method, a system, a medium, equipment and a terminal, and particularly relates to a novel semi-supervised feature selection method, a system, a medium, equipment and a terminal based on natural Laplace fraction (NLS) and maximum Mutual Information Coefficient (MIC).

The present invention is achieved by a semi-supervised feature selection method that evaluates the local data structure of a tag-deficient dataset by using NLS while computing the correlation of tags and features using MIC, utilizing a small amount of tag information in the dataset; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.

Further, the feature selection method comprises the following steps:

step one, calculating the number of times of conflict according to the field of data and the information of the label;

step two, calculating a conflict proportion;

step three, judging whether each feature is evaluated, and sorting the features after all the features of the data are evaluated, or continuing to evaluate;

respectively calculating the MIC score and the NL score of the data;

step five, calculating a final Score NM Score of the data;

step six, sorting the importance of the features according to the respective NM scores of all the features;

and step seven, returning a feature importance sorting result.

The first step and the second step are used for self-adapting the data set, the algorithm is non-parametric, the algorithm without the parameters enables the application scene of the algorithm to be wider, and meanwhile, parameter optimization is not needed to be carried out with large manpower input and equipment input. And thirdly, evaluating the features, wherein the evaluation scores are used for final feature sorting, the quality of feature evaluation determines the quality of a feature selection algorithm, and a good feature evaluation standard can evaluate the importance of the features more accurately. The fourth step and the fifth step are evaluation criteria, the importance degree of the features can be calculated, the importance degree of the features determines the ranking position of the features, and the more the ranking is, the more important the features are explained. And sequencing the importance of the features in the sixth step and the seventh step into a final result, adopting the features according to needs to improve the performance of the classifier, adopting feature subsets with different sizes to improve the efficiency of the classifier, and researching more experimental modes.

Further, the calculating the conflict ratio in the first step includes:

calculating the number of times of conflict between the field and the label information in the data, wherein the method for calculating the number of times of conflict comprises the following steps:

when the labels of the two samples are the same and the two samples do not belong to the natural neighbors of the opposite side, recording as a conflict;

when the labels of the two samples are different and the two samples belong to the natural neighbors of the opposite side, recording as a conflict;

proportion of conflict according to

Calculating, c is the number of collisions, | Y²Is the square of the number of labels.

Further, the NL score for the calculated data in step four comprises:

and (3) fusing natural neighbors into a semi-supervised Laplace algorithm, and modifying the construction process of a Laplace matrix:

the final NL score was calculated as follows:

wherein f is_rIs the r characteristic, L^wAnd L^bRespectively, intra-class Laplace matrix and inter-class Laplace matrixAnd (4) matrix.

The MIC score of the calculated data comprises:

the MIC score is a maximum mutual information coefficient used to measure the correlation between the tag and the feature, and is calculated by the following formula:

MIC(U,V)＝max_uv≤B(n,α){m_u,v}。

further, the calculating the final Score NM Score of the data in the fifth step includes:

the NM score is obtained by weighted addition of the NL score and the MIC score, the weight coefficient is obtained by using a conflict ratio, and the calculation formula is as follows:

S_NM(f_k)＝λMIC(f_k)-(1-λ)NLS(f_k)。

further, the feature importance ranking in the sixth step includes:

evaluating the importance of each feature of the data to obtain a corresponding score; and performing feature importance ranking on the features according to the scores to obtain a feature importance ranking result, and freely selecting the size of the feature subset according to the requirement.

Another object of the present invention is to provide a semi-supervised feature selection system applying the semi-supervised feature selection method, the semi-supervised feature selection system comprising:

the collision times calculation module is used for calculating the collision times according to the field of the data and the information of the label;

the conflict proportion calculation module is used for calculating a conflict proportion;

the judging module is used for judging whether each feature is evaluated, and ranking the features after all the features of the data are evaluated, otherwise, continuing to evaluate;

a Score calculating module for calculating MIC Score and NL Score of the data, respectively, and calculating final Score NM Score of the data;

and the characteristic importance ranking module is used for ranking the characteristic importance according to the respective NM scores of all the characteristics and returning a characteristic importance ranking result.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

evaluating a local data structure of the data set lacking the tag by using NLS, and simultaneously calculating correlation of the tag and the feature by using MIC, and utilizing a small amount of tag information in the data set; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

Another object of the present invention is to provide an information data processing terminal for implementing the semi-supervised feature selection system.

By combining all the technical schemes, the invention has the advantages and positive effects that: semi-supervised feature selection is of increasing interest due to dimension cursing and the high cost of manual tagging. In the semi-supervised feature selection method provided by the invention, the NM score method is improved by fusing natural neighbors into the semi-supervised Laplacian method, so that the natural Laplacian method is generated, the local data structure of data can have higher sensitivity, then the MIC is used for carrying out tag and feature correlation evaluation, the collision coefficient is used for carrying out weighted combination of the tag and the feature, the final NM score is obtained, and the feature importance can be well evaluated by using the score for carrying out feature importance ranking. Compared with the traditional feature selection algorithm, the NM score of the invention innovatively combines the label information and the local structure information, constructs the adaptive conflict proportion parameter, and can perform adaptive calculation on each data set without parameters, so that the whole method is parameterless and has higher performance and efficiency.

The method is used for selecting the characteristics of the financial data, can reduce the dimension of the financial data set with high dimension and a large number of missing data set labels and utilize the non-label data, and greatly improves the availability of the non-label financial data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a semi-supervised feature selection method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a semi-supervised feature selection method provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of the NM Score algorithm provided by the embodiment of the present invention;

FIG. 4 is a block diagram of a semi-supervised feature selection system provided by embodiments of the present invention;

fig. 5 is a diagram illustrating the results on an Australian dataset provided by an embodiment of the invention.

Fig. 6 is a graphical representation of the results on an arrhytmia data set provided by an embodiment of the present invention.

FIG. 7 is a graphical representation of the results on the chess-krvkp dataset provided by an embodiment of the present invention.

FIG. 8 is a diagram illustrating the results of the climate-simulation data set according to the embodiment of the present invention.

Fig. 9 is a diagram illustrating results on a result data set on a CMC data set according to an embodiment of the present invention.

FIG. 10 is a graph illustrating the results on the cylinder-bases dataset provided by an embodiment of the present invention.

Fig. 11 is a diagram illustrating the result of the lymphographic dataset according to the embodiment of the present invention.

FIG. 12 is a diagram illustrating results on a page-blocks dataset according to an embodiment of the present invention.

Fig. 13 is a diagram illustrating the results on the Sonar dataset provided by the embodiment of the present invention.

Fig. 14 is a diagram illustrating the results on an Australian dataset provided by an embodiment of the invention.

Fig. 15 is a graphical representation of the results on an arrhytmia data set provided by an embodiment of the present invention.

FIG. 16 is a graphical representation of the results on the chess-krvkp dataset provided by an embodiment of the present invention.

FIG. 17 is a diagram illustrating the results of the climate-simulation data set provided by the embodiment of the present invention.

Fig. 18 is a diagram illustrating results on a result dataset on a CMC dataset according to an embodiment of the present invention.

FIG. 19 is a graph illustrating the results on the cylinder-bases dataset provided by an embodiment of the present invention.

Fig. 20 is a diagram illustrating the results on the lymphograph dataset provided by the embodiment of the present invention.

FIG. 21 is a diagram illustrating results on a page-blocks dataset according to an embodiment of the present invention.

Fig. 22 is a diagram illustrating the results on the Sonar dataset provided by the embodiment of the present invention.

In the figure: 1. a conflict number calculation module; 2. a conflict proportion calculation module; 3. a judgment module; 4. a score calculation module; 5. a feature importance ranking module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a method, a system, a medium, a device and a terminal for selecting semi-supervised features, which are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the semi-supervised feature selection method provided by the embodiment of the present invention includes the following steps:

s101, calculating the number of times of collision according to the field of the data and the information of the label;

s102, calculating a conflict proportion;

s103, judging whether each feature is evaluated, and sorting the features after all the features of the data are evaluated, or continuing to evaluate;

s104, respectively calculating the MIC score and the NL score of the data;

s105, calculating a final Score NM Score of the data;

s106, sorting the importance of the features according to the respective NM scores of all the features;

and S107, returning a feature importance ranking result.

A schematic diagram of a semi-supervised feature selection method provided by an embodiment of the present invention is shown in fig. 2, and a schematic diagram of an NM Score algorithm provided by an embodiment of the present invention is shown in fig. 3.

As shown in fig. 4, a semi-supervised feature selection system provided in an embodiment of the present invention includes:

the collision times calculation module 1 is used for calculating the collision times according to the information of the data field and the label;

a conflict proportion calculation module 2 for calculating a conflict proportion;

the judging module 3 is used for judging whether each feature is evaluated, and when all the features of the data are evaluated, sorting the features, otherwise, continuing to evaluate;

a Score calculating module 4 for calculating MIC Score and NL Score of the data, respectively, and calculating final Score NM Score of the data;

and the feature importance ranking module 5 is used for ranking the feature importance according to the respective NM scores of all the features and returning a feature importance ranking result.

The technical solution of the present invention is further described below with reference to specific examples.

The invention provides a novel semi-supervised feature selection method based on natural Laplace fraction (NLS) and Maximum Information Coefficient (MIC), which is called NM Score. The method evaluates the local data structure of the data set lacking the label by using NLS. The MIC is used simultaneously to compute the correlation of the tag and features, thereby utilizing a small amount of tag information in the data set. Then, NLS and MIC are adaptively combined according to the collision ratio between the neighborhood and the tag to determine NM Score of the feature to evaluate its importance.

As shown in fig. 2, the semi-supervised feature selection method provided by the embodiment of the present invention includes the following steps:

s1: calculating the number of times of conflict according to the field of the data and the information of the label;

s2: calculating a conflict proportion;

s3: judging whether each feature is evaluated, and sorting the features after all the features of the data are evaluated, or continuing to evaluate;

s4: calculating MIC scores and NL scores of the data, respectively;

s5: calculating a final Score NM Score of the data;

s6: sorting the importance of the features according to the respective NM scores of all the features;

s7: and returning a characteristic importance ranking result.

1. Conflict ratio: the method for calculating the number of times of conflict between the field and the label information in the data comprises the following steps: 1. when the labels of the two samples are the same and the two samples do not belong to the natural neighbors of the other sample, the two samples are recorded as a conflict.

2. When the labels of the two samples are different and the two samples belong to the natural neighbors of the other side, the two samples are recorded as a conflict. Proportion of conflict according to

Calculating, c is the number of collisions, | Y²As many as tagsSquare.

NL score: in order to make the semi-supervised laplacian algorithm more sensitive to the evaluation of the local structure of the data, we incorporate the natural neighborhood into it, and modify the construction process of the laplacian matrix:

the final NL score was calculated as follows:

wherein f is_rIs the r characteristic, L^wAnd L^bRespectively, an intra-class laplacian matrix and an inter-class laplacian matrix.

MIC score: the MIC score is a maximum mutual information coefficient used to measure the correlation between the tag and the feature, and is calculated by the following formula:

MIC(U,V)＝max_uv≤B(n,α){m_u,v}。

NM score: the NM score is obtained by weighting and adding the NL score and the MIC score, the weight coefficient uses a conflict ratio, and the calculation formula is as follows:

S_NM(f_k)＝λMIC(f_k)-(1-λ)NLS(f_k)。

5. characteristic sorting: and performing feature importance evaluation on each feature of the data to obtain a corresponding score, and performing feature importance ranking on the features according to the score to obtain a feature importance ranking result. The size of the feature subset can be freely selected according to requirements.

Semi-supervised feature selection is of increasing interest due to dimension cursing and the high cost of manual tagging. The NM scoring method is improved by fusing natural neighbors into a semi-supervised Laplace method to generate a natural Laplace method, can have higher sensitivity on a local data structure of data, and then uses MIC to evaluate the relevance of a label and characteristics. And a final NM score is obtained by performing weighted combination of the two using the collision coefficient. Using the scores for feature importance ranking can provide a good assessment of feature importance. Compared with the traditional feature selection algorithm, the NM score innovatively combines the label information and the local structure information, constructs the adaptive conflict ratio parameter, and can perform adaptive calculation on each data set without the parameter, so that the whole method is parametrically-free and has higher performance and efficiency.

The technical effects of the present invention will be described in detail with reference to the experiments.

1. Data set

Most of the data sets are from the UCI machine learning library, and the last data set is a financial data set and is obtained by a web crawler. The description of these data sets is shown in the table. These datasets have from tens to hundreds of features, hundreds of samples, and the number of classes also involves two classes and multiple classes.

Table 1 data set description

2. Results display

In the diagrams shown below (fig. 5 to 22), the abscissa represents the number of features, and the ordinate represents the classification accuracy. The lower graph shows the experimental results of the classical semi-supervised feature selection algorithms LSDF, Fisher, RRPC algorithm and our proposed NM SCore on different datasets tested using KNN classifier and SVM classifier.

In different data sets, different semi-supervised feature selection algorithms have advantages and disadvantages, so in practical application, a proper feature selection method should be selected according to the characteristics of the data sets. Such as: for data sets with larger number of samples, NM Score can be chosen, which reduces time consumption, and for higher accuracy requirements, semi-supervised Fisher methods or NM Score can be used.

The graph shown below demonstrates the performance of different semi-supervised feature selection algorithms under KNN and SVM classifiers on different UCI generic data sets. The test was performed on the Australian, arrymia, chess-krvkp data set using KNN, the NM Score performed best, and the highest efficiency was obtained with the fewest number of features.

Fig. 5 results on Australian dataset.

Fig. 6 results on arrhymia data set.

FIG. 7 results on the chess-krvkp dataset tested on the close-simulation, CMC, cylinder-bases dataset using KNN, with a slightly better NM Score.

FIG. 8 results on the close-simulation dataset.

Fig. 9 results on the results dataset on the CMC dataset.

FIG. 10 results on the cylinder-bases dataset tested on the Lymphogry, page-blocks, Sonar datasets using KNN, the performance of the four classifiers differed less on the Lymphogry and page-blocks datasets, but the NM Score performed well on the Sonar dataset.

FIG. 11 results on a Lymphogry dataset.

FIG. 12 results on page-blocks dataset.

The results on the Sonar dataset of fig. 13 were tested on the Australian, arrymia, chess-krvkp dataset using SVM, the NM Score still performed best, with the lowest feature count to achieve the highest efficiency, while the LSDF algorithm performed well on arrymia.

FIG. 14 results on Australian dataset.

Fig. 15 results on arrhymia data set.

FIG. 16 results on the chess-krvkp dataset tested on the close-simulation, CMC, cylinder-bases dataset using SVM, with a slightly better NM Score.

FIG. 17 results on the close-simulation dataset.

Figure 18 results on the results dataset on the CMC dataset.

FIG. 19 results on the cylinder-bases dataset tested on the Lymphogry, page-blocks, Sonar datasets using SVM, the performance of the four algorithms differed less on the Lymphogry and page-blocks datasets but on the Sonar dataset the NM Score performed well.

Fig. 20 results on the lymphograph dataset.

FIG. 21 results on page-blocks dataset.

Figure 22 results on Sonar dataset.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A semi-supervised feature selection method, characterized in that the feature selection method exploits tag information in a dataset by evaluating local data structures of the dataset lacking tags using NLS, while computing correlations of tags and features using MIC; NLS and MIC are adaptively combined according to the collision ratio between the neighborhood and the tag, NM Score of the feature is determined, and importance is evaluated.

2. The semi-supervised feature selection method of claim 1, wherein the feature selection method comprises the steps of:

step two, calculating a conflict proportion;

respectively calculating the MIC score and the NL score of the data;

step five, calculating a final Score NM Score of the data;

and step seven, returning a feature importance sorting result.

3. The semi-supervised feature selection method of claim 2, wherein the calculating of the conflict ratio in step one comprises:

proportion of conflict according to

4. The semi-supervised feature selection method of claim 2, wherein calculating the NL score for the data in step four comprises:

the final NL score was calculated as follows:

wherein f is_rIs the r characteristic, L^wAnd L^bAre respectively intra-class LaplaceMatrix and inter-class laplacian matrix;

the MIC score of the calculated data comprises: the MIC score is a maximum mutual information coefficient used to measure the correlation between the tag and the feature, and is calculated by the following formula:

MIC(U,V)＝max_uv≤B(n,α){m_u,v}。

5. the semi-supervised feature selection method as recited in claim 2, wherein the calculating the final Score NM Score of the data in the fifth step comprises: the NM score is obtained by weighted addition of the NL score and the MIC score, the weight coefficient is obtained by using a conflict ratio, and the calculation formula is as follows:

S_NM(f_k)＝λMIC(f_k)-(1-λ)NLS(f_k)。

6. the semi-supervised feature selection method of claim 2, wherein the feature importance ranking in step six comprises: evaluating the importance of each feature of the data to obtain a corresponding score; and performing feature importance ranking on the features according to the scores to obtain a feature importance ranking result, and freely selecting the size of the feature subset according to the requirement.

7. A semi-supervised feature selection system for implementing the semi-supervised feature selection method of any one of claims 1 to 6, the semi-supervised feature selection system comprising:

the collision frequency calculation module is used for calculating the collision frequency according to the field of the data and the information of the label;

the judging module is used for judging whether each feature is evaluated, and when all the features of the data are evaluated, the feature sorting is carried out, otherwise, the evaluation is continued;

8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of: evaluating a local data structure of the data set lacking the tag by using NLS, and simultaneously calculating correlation of the tag and the feature by using MIC, and utilizing a small amount of tag information in the data set; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.

9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: evaluating a local data structure of the data set lacking the tag by using NLS, and simultaneously calculating correlation of the tag and the feature by using MIC, and utilizing a small amount of tag information in the data set; the significance is evaluated by adaptively combining NLS and MIC according to the collision ratio between the neighborhood and the tag, determining NM Score of the feature.

10. An information data processing terminal, characterized in that the information data processing terminal is adapted to implement the semi-supervised feature selection system as claimed in claim 7.