CN112529077A

CN112529077A - Embedded feature selection method and equipment based on prior probability distribution

Info

Publication number: CN112529077A
Application number: CN202011438665.2A
Authority: CN
Inventors: 陈会; 姜青山; 刘薇; 肖焯
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-19

Abstract

The invention provides an embedded feature selection method and equipment based on prior probability distribution, wherein the method comprises the following steps: step 1, acquiring a K-th type sample in a training data set; step 2, a preset constant is given, wherein the preset constant is obtained by a weight based on a Dirichlet distribution function; the Dirichlet distribution is used for estimating prior probability of Bayes theorem; step 3, determining a preset average value based on the average value of the one-dimensional Gaussian distribution function for determining that the sample belongs to a preset class; the one-dimensional Gaussian distribution function is used for estimating the conditional probability of Bayes theorem; step 4, determining a middle value based on the value of the sample and the preset average value; step 5, determining the sum of the intermediate values subjected to logarithmic operation; and 6, determining the weight of the K classes based on a preset constant, the intermediate value and the sum value. The scheme can simply, quickly and effectively obtain the weight. The attribute with large weight value can be selected to represent the category, and data redundancy is reduced.

Description

Embedded feature selection method and equipment based on prior probability distribution

Technical Field

The invention relates to the technical field of computers, in particular to an embedded feature selection method and equipment based on prior probability distribution.

Background

Classification Using labeled samples to classify unlabeled samples into known classes is a supervised learning technique. At present, there are many classifiers with good performance, such as Decision Tree (DT), Logistic Regression (LR), naivebutteryes (nb), neural network, etc., and with the development of information technology, we face the problem of processing high dimensional data, zettabytes data volume, and thousands of features. The dimension cursing thus produced affects the performance of the classification result.

Feature selection is an important data mining preprocessing technique that attempts to remove redundant information attributes in high-dimensional data. Conventional feature extraction methods include Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), Local Linear Embedding (LLE), RelifF algorithm, and the like. Currently, there are three basic techniques for feature selection: filtration, packaging and embedding methods. In which the filtering method selects features that have strong correlation with the target variable, but ignores the correlation between the features. The packing (english name wrapper) method is based on the correlation coefficient of a linear model, and the model obtains AUC, which does not change or decrease greatly when the absolute value of the correlation coefficient is small. The NB classifier has the characteristics of interpretability, simplicity, usability, practicality, expandability, good incremental learning ability, and the like. For these reasons, it is widely used to address classification problems encountered in the field of data mining. But the usual na iotave Bayes is based on the assumption of conditional independence and cannot be used directly in practical applications. I.e. all attributes play the same role in a given class (w)₁＝w₂…＝w_DAnd w represents weight, importance of attribute). To alleviate its conditional independence hypothesis, some researchers have investigated feature weighting methods. However, these methods are almost independent of the entire NB classification process, only as a separate processing step.

In the prior art, in order to deal with the condition independence assumption and embed the feature weighting method into the feature selection algorithm, scholars at home and abroad develop related researches. One study has proposed a global weighted gaussian distribution, FWNB, for each feature of all classes. Their feature weights may be given by w_jIndicating that j-th attributes represent all attribute classes having the same weightAnd (4) heavy. Furthermore, Chen et al propose subspace feature weights

Bayes (SWNB), each class has different weight and can be represented by w_kjAnd (4) showing. For class k, each attribute plays the same role, but the weights are different for different classes. SWNB iteratively optimizes the calculation of the weight values using newton's method. In addition, there have been studies that propose class-specific attribute weighted naive Bayes (CAWNB) that learn weights by maximizing a conditional log-likelihood (CLL) objective function and a minimum Mean Square Error (MSE) objective function, and optimize a weight matrix with L-BFGS-M.

Specifically, most of the current research methods select features as preprocessing steps of data, and are separated from the whole algorithm. And most of the methods mainly adopt optimization calculation, and more algorithm time is needed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an embedded feature selection method and equipment based on prior probability distribution.

The embodiment of the invention provides an embedded feature selection method based on prior probability distribution, which comprises the following steps:

step 1, acquiring a K-th type sample in a training data set;

step 2, a preset constant is given, and the preset constant is obtained by a weight based on a Dirichlet distribution function; the Dirichlet distribution is used for estimating prior probability of Bayes theorem;

step 3, determining a preset average value based on the average value of the one-dimensional Gaussian distribution function for determining that the sample belongs to a preset class; the one-dimensional Gaussian distribution function is used for estimating the conditional probability of Bayes theorem;

step 4, determining a middle value based on the value of the sample and the preset average value;

step 5, determining the sum of the intermediate values subjected to logarithmic operation;

and 6, determining the weight of the K classes based on a preset constant, the intermediate value and the sum value.

In a specific embodiment, the method further comprises the following steps:

the value of K is changed and steps 1-6 are repeated to determine the weights of all classes in the training dataset.

In a specific embodiment, the method further comprises the following steps:

determining a final class of the sample based on the weights.

In a specific embodiment, the intermediate value is determined based on the following formula:

said X_kjIs a median value, said μ_kjIs the preset mean value, x_ijIs the value of the sample, the | c_kAnd | is the number of samples belonging to the kth class in the training samples.

In a specific embodiment, the weight is determined based on the following formula:

wherein tau is a preset constant, alpha is a hyperparameter, and w_kjIs the weight, the X_kjIs a median value, said D is the number of samples, said λ₁For the introduced Lagrangian constant, said | c_kL is the number of samples belonging to the kth class in the training samples, the sigma_kA standard deviation of the class k samples estimated for a one-dimensional gaussian function.

The embodiment of the invention also provides embedded feature selection equipment based on prior probability distribution, which comprises the following steps:

the acquisition module is used for acquiring a K-th type sample in the training data set;

the device comprises a first determining module, a second determining module and a control module, wherein the first determining module is used for giving a preset constant, and the preset constant is obtained by a weight based on a Dirichlet distribution function; the Dirichlet distribution is a prior probability used to estimate Bayes' theorem;

a second determination module that determines a preset mean value based on an average value of one-dimensional Gaussian distribution functions used to determine that the sample belongs to a preset class; the one-dimensional Gaussian distribution function is used for estimating the conditional probability of Bayes theorem;

a median module for determining a median based on the values of the samples and the preset mean;

the sum module is used for determining the sum of the intermediate values subjected to logarithmic operation;

and the weighting module is used for determining the weighting of the K classes based on a preset constant, the intermediate value and the sum value.

In a specific embodiment, the method further comprises the following steps:

and the processing module is used for replacing the K value and sequentially and repeatedly executing the acquisition module to the weight module so as to determine the weights of all classes in the training data set.

In a specific embodiment, the method further comprises the following steps:

a classification module to determine a final classification of the sample based on the weights.

An embodiment of the present invention further provides a computer storage medium, in which a program for executing the above method is stored.

Therefore, the embodiment of the invention provides an embedded feature selection method and equipment based on prior probability distribution, wherein the method comprises the following steps: step 1, acquiring a K-th type sample in a training data set; step 2, a preset constant is given, wherein the preset constant is obtained by a weight based on a Dirichlet distribution function; the Dirichlet distribution is a prior probability used to estimate Bayes' theorem; step 3, determining a preset average value based on the average value of the one-dimensional Gaussian distribution function for determining that the sample belongs to a preset class; the one-dimensional Gaussian distribution function is used for estimating the conditional probability of Bayes theorem; step 4, determining a middle value based on the value of the sample and the preset average value; step 5, determining the sum of the intermediate values subjected to logarithmic operation; and 6, determining the weight of the K classes based on a preset constant, the intermediate value and the sum value. The prior probability of the Dirichlet distribution estimation Bayes theory is introduced in the scheme, so that the analytic expression of the algorithm output is realized, and the weight can be simply, quickly and effectively obtained. After the weight is obtained, the attribute representative category with the large weight value can be selected, and data redundancy is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of an embedded feature selection method based on prior probability distribution according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an embedded feature selection device based on prior probability distribution according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an embedded feature selection device based on prior probability distribution according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a frame structure of a computer storage medium according to an embodiment of the present invention.

Detailed Description

Various embodiments of the present disclosure will be described more fully hereinafter. The present disclosure is capable of various embodiments and of modifications and variations therein. However, it should be understood that: there is no intention to limit the various embodiments of the disclosure to the specific embodiments disclosed herein, but rather, the disclosure is to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of the various embodiments of the disclosure.

The terminology used in the various embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the present disclosure. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of the present disclosure belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined in various embodiments of the present disclosure.

Example 1

The embodiment 1 of the invention discloses an embedded feature selection method based on prior probability distribution, which comprises the following steps as shown in figure 1:

step 1, acquiring a K-th type sample in a training data set; in particular, for example, class k samples x in the training dataset are obtained_i＝<x_i1，x_i2，...，x_iD>。

Step 2, a preset constant is given, wherein the preset constant is obtained by a weight based on a Dirichlet distribution function; the Dirichlet distribution is a prior probability used to estimate Bayes' theorem; (ii) a

First, in probability statistics, a Dirichlet (Dirichlet) distribution probability density expression is:

wherein α ═ (α)₁，…，α_D) Is a hyperparameter, and x₁，…，x_D> 0 and x₁+x₂+…x_KOne more common form is a symmetric Dirichlet distribution, where all α's are₁，…，α_DTake the same value of alpha. Since there is usually no a priori knowledge to determine that a component is better than other components, the symmetric form is often used when Dirichlet priors are used, among:

when α is 1, the above formula is equivalent to a uniform distribution regardless of the x value; when alpha is more than 1, the distribution tends to be more stable, and all values in one sampling tend to be the same; the distribution tends to be sharper when α < 1, and most values tend to be 0 in one sample, with few components having larger values.

Specifically, bayesian theory can be expressed as: p (k/x) P (x) ═ P (k) P (x | k); then

Since P (x) does not vary with k, it can be considered a constant, and P (k/x) is proportional to P (x | k). Based on the conditional independence assumption and the subspace weighting method, then:

in the method of introducing probability density, Bayes' theorem and density function are combined, and we can obtain prior probability P (k) from f (k) (density function) and likelihood probability P (x | k) from f (x | k) (also density function).

For classification, the probability that a sample xt to be detected belongs to which class is large is the same as the class, so that the parameters can be optimized to the maximum extent:

J₁(θ_k)＝p(θ_k/c_k)∝p(θ_k)p(c_k/θ_k)；

assuming that the probability that a sample belongs to class k is represented by a one-dimensional Gaussian distribution

Wherein mu_kAnd σ_kMean and standard deviation of the one-dimensional gaussian distribution are shown.

specifically, the intermediate value is determined based on the following formula:

specifically, lnw_kjRepresenting intermediate values by logarithmic operation, based on formula

A sum value is determined.

The weights are determined based on the following formula:

Specifically, based on the dirichlet distribution and the gaussian distribution in the

above steps

2 and 3, through logarithmic transformation, the objective function is defined as:

then:

wherein,

the limiting conditions are as follows:

logarithm is calculated for both sides:

introducing a lagrange multiplier, the objective function is:

objective function pair sigma_kDerivation:

then the process of the first step is carried out,

objective function pair w_kjDerivation:

when X is present_kjWhen the signal is not equal to 0, the signal is transmitted,

objective function pair lambda₁Derivation:

bringing (2) into (3) to obtain:

bringing (4) into formula (2) to obtain:

in addition, this scheme still includes: steps 1-6 are repeated to determine the weights of all classes in the training dataset.

In the scheme, the embedded features are selectively embedded into the algorithm for learning, and the prior probability of the Dirichlet distribution estimation Bayes theory is introduced, so that the analytic expression of the algorithm output is realized, and the weight can be simply, quickly and effectively obtained. After the weight is obtained, the attribute representative category with the large weight value can be selected, and data redundancy is reduced. The type of the sample to be tested can also be judged by using Bayesian theory.

Example 2

Embodiment 2 of the present invention also discloses an embedded feature selection device based on prior probability distribution, as shown in fig. 2, including:

an obtaining module 201, configured to obtain a kth class sample in a training data set;

a first determining module 202, configured to set a preset constant, where the preset constant is obtained from a weight based on a dirichlet distribution function; the Dirichlet distribution is a prior probability used to estimate Bayes' theorem;

a second determining module 203, configured to determine a preset mean value based on a mean value of a one-dimensional gaussian distribution function used to determine that the sample belongs to a preset class; the one-dimensional Gaussian distribution function is used for estimating the conditional probability of Bayes theorem;

a median module 204, configured to determine a median based on the values of the samples and the preset mean;

a sum module 205, configured to determine a sum of the intermediate values subjected to the logarithm operation;

a weight module 206, configured to determine the weight of the K class based on a preset constant, the intermediate value, and the sum value.

In a specific embodiment, as shown in fig. 3, the method further includes:

a processing module 207, configured to repeatedly execute the obtaining module to the weighting module in sequence, so as to determine the weights of all classes in the training data set.

In a specific embodiment, the method further comprises the following steps:

Example 3

Embodiment 3 of the present invention also discloses a computer storage medium, as shown in fig. 4, in which a program for executing the method described in embodiment 1 is stored.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned invention numbers are merely for description and do not represent the merits of the implementation scenarios.

The above disclosure is only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. An embedded feature selection method based on prior probability distribution is characterized by comprising the following steps:

step 1, acquiring a K-th type sample in a training data set;

step 2, a preset constant is given, wherein the preset constant is obtained by a weight based on a Dirichlet distribution function; the Dirichlet distribution is a prior probability used to estimate Bayes' theorem;

2. The method of claim 1, further comprising:

steps 1-6 are repeated to determine the weights of all classes in the training dataset.

3. The method of claim 1 or 2, further comprising:

determining a final class of the sample based on the weights.

4. The method of claim 1, wherein the intermediate value is determined based on the following formula:

5. The method of claim 1, wherein the weight is determined based on the following formula:

6. An embedded feature selection device based on prior probability distribution, comprising:

the second determining module is used for determining the average value of the one-dimensional Gaussian distribution functions of the samples belonging to the preset class to determine a preset average value; the one-dimensional Gaussian distribution function is used for estimating the conditional probability of Bayes theorem;

7. The apparatus of claim 6, further comprising:

8. The method of claim 1 or 2, further comprising:

9. The method of claim 1, wherein the intermediate value is determined based on the following formula:

10. A computer storage medium having a program stored therein for executing the method of claims 1-5.