CN110929801B

CN110929801B - Improved Euclid distance KNN classification method and system

Info

Publication number: CN110929801B
Application number: CN201911215801.9A
Authority: CN
Inventors: 徐承俊; 朱国宾
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2022-05-13
Anticipated expiration: 2039-12-02
Also published as: CN110929801A

Abstract

The invention discloses a method based onThe improved Euclid distance KNN classification method and system firstly acquire a data set from a database, and divide the data set into a test set and a training set; setting a neighbor parameter K value; calculating a projection vector w according to an LDA (Linear Discriminant analysis) algorithm; constructing a neighbor graph G (V, E) from the training set; for each data sample x in the test set_textFinding data sample x from neighbor map_textK neighbors in the training set; return pair data sample x_textIs estimated value of

And the determination of the sample class is made. The invention has the following advantages: (1) the noise-resistant KNN has good noise resistance, and can solve the problem that the traditional KNN is sensitive to noise. (2) The method adopts the improved Euclid distance to replace the Euclid distance measurement adopted by the traditional KNN, can distinguish samples more, improves the accuracy of classification, and does not increase the complexity of calculation.

Description

Improved Euclid distance KNN classification method and system

Technical Field

The invention relates to the technical field of data classification, in particular to a classification method and system based on an improved Euclid distance KNN.

Background

In the current big data era, various data are large in scale and wide in range, and data need to be classified and processed so as to facilitate subsequent further analysis and processing. The data are classified by using a KNN algorithm, and the basic idea of the KNN algorithm is as follows: for the nearest K neighbors of any given sample to be classified, the classification of the K neighbors is then voted according to the classification attributes of the K neighbors. The distance measurement method of the KNN algorithm mainly adopts the Euclid distance (Euclidean distance) of a sample to be measured and a training sample. The KNN algorithm assumes that all samples correspond to the n-dimensional space RⁿThe nearest neighbor of a sample is defined according to the standard Euclid distance. The KNN algorithm is only related to a very small number of adjacent samples when the class is judged, the class is mainly determined by the limited adjacent samples around rather than by a method for judging the class domain, and therefore, the KNN algorithm is more suitable for a sample set to be detected with more overlapping or crossing of the class domain than other classification methods.

The KNN algorithm is an inert learning method, so that the problems of low classification speed, strong sample library capacity dependence, low data processing efficiency and the like are caused easily when the distance calculation standard is sensitive to noise characteristics and the sample data volume is large, particularly under the condition that a sample contains noise.

Disclosure of Invention

The invention provides an improved Euclid distance KNN classification method, which is used for solving the problem that the metric for calculating the distance in the background technology is sensitive to noise characteristics.

In order to achieve the above object, the present invention provides an improved euclidd distance KNN-based classification method, which comprises the following specific steps:

step1, acquiring a data set from the database, and dividing the data set into a test set and a training set;

step2, setting a neighbor parameter K value;

step3, solving a projection vector w of a training set according to a Linear Discriminant Analysis algorithm;

step4, constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each training sample in the training set, and E represents an edge connecting each training sample;

step5, for each data sample x in the test set_textFinding data sample x from neighbor map_textK neighbors in the training set;

step6, return data sample x_textIs estimated value of

Wherein the content of the first and second substances,

f(x_i) Problem function, x, representing a classification_iRepresents the ith training sample, V represents the class corresponding to the training sample, and V ═ V₁,v₂,…,v_sV denotes a set of data categories,

is the data sample x_textIn the final category of (a) to (b),

further, Step2 sets K to 1,3,5,7,9,11,13, 15.

Further, the projection vector w in Step3 is calculated as follows,

taking the two classifications as an example, the optimal projection vector w is solved by quantitative analysis:

given N training samples characterized by d dimensions

First, the mean value, i.e. the center point, of each class of training samples is found, where i is 1,2,

specifically, there is N₁A training sample belonging to the category w₁Having N of₂A training sample belongs to the category w₂，N＝N₁+N₂，μ_iRepresenting the mean of the ith class of training samples;

the projection of the training samples x to w is represented by y ═ w^Tx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:

therefore, the projected average value is the projection of the center point of the sample;

the straight line that can make the two types of sample central points after projection separate as much as possible is the best straight line, and the quantitative expression is:

obtaining a hash value of the projected class, specifically:

final pass metric formula

Measuring a projection vector w;

according to the above formula, it is sufficient to find w that maximizes J (w), and the solution is as follows:

expanding the hash value formula:

wherein order

Namely a hash matrix;

then, let S_w＝S₁+S₂，S_wCalled the intra-class dispersion degree matrix, S_B＝(μ₁-μ₂)(μ₁-μ₂)^T，S_BCalled inter-class dispersion degree matrix;

j (w) is finally expressed as:

performing derivation on the derivative, and performing normalization processing on the denominator before derivation; then let | | w^TS_WW | | | 1, after adding lagrange multiplier, the derivation:

it follows that w is a matrix

The feature vector of (2);

in particular, because of S_Bw＝(μ₁-μ₂)(μ₁-μ₂)^Tw, where the product of the latter two terms is a constant, denoted λ_wThen, then

Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides are reduced for simplicity_wTo obtain

Therefore, only the mean and equation of the original training sample are required to calculate the optimal w.

Further, in Step4, the size of the edge in the neighbor graph is specifically represented by the formula:

determination of where x^lThe l-th feature vector, x, representing the training sample x_i,x_jRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.

Further, the value of m is 5, and the value of m respectively comprises a stroke, a contour, a cross point, an end point and a gray level feature vector of the image.

The invention also provides an improved Euclid distance KNN classification system, which comprises the following modules:

the data set acquisition module is used for acquiring a data set from a database and dividing the data set into a test set and a training set;

the parameter setting module is used for setting a neighbor parameter K value;

the projection vector w solving module is used for solving a training set projection vector w according to a Linear discriminatant Analysis algorithm;

the neighbor graph constructing module is used for constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each training sample in the training set, and E represents an edge connecting each training sample;

k neighbor search modules for each data sample x in the test set_textFinding K neighbors in the training set;

a sample class determination module for returning the data sample x_textIs estimated value of

Wherein the content of the first and second substances,

f(x_i) Problem function, x, representing a classification_iRepresents the ith training sample, v represents the class corresponding to the training sample,

is the data sample x_textIn the final category of (a) to (b),

further, the setting K in the parameter setting module is 1,3,5,7,9,11,13, 15.

Furthermore, the projection vector w in the projection vector w solving module is calculated as follows,

given N training samples characterized by d dimensions

specifically, there is N₁A training sample belonging to the category w₁Having N of₂A training sample belonging to the category w₂，N＝N₁+N₂，μ_iRepresenting the mean of the ith class of training samples;

obtaining a hash value of the projected class, specifically:

final pass metric formula

Measuring a projection vector w;

expanding the hash value formula:

wherein order

Namely a hash matrix;

j (w) is finally expressed as:

it follows that w is a matrix

The feature vector of (2);

Therefore, only the original training sample is requiredThe mean and equation of the present method find the optimal w.

Further, in the neighbor graph constructing module, the size of the edge in the neighbor graph is specifically represented by the formula:

determination of where x^lThe l-th feature vector, x, representing a training sample x_i,x_jRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.

Compared with the prior art, the invention has the following beneficial effects: the invention provides an improved Euclidid distance KNN classification method, which is characterized in that a neighbor parameter K is preset, a projection vector w is calculated according to an LDA (Linear cognitive analysis) algorithm, a training data set is constructed into a neighbor graph G (V, E), wherein G represents the neighbor graph, V represents a node which is each data sample, E represents an edge connecting each data sample, and the size of the edge is specifically represented by a formula:

wherein x is^lValues representing the l-th feature of the sample x, x_i,x_jRespectively representing the ith and jth samples, t representing an arbitrary constant, w representing the projection vector, x for each data sample in the test set_textFinding K neighbors in the training set, the return value of the KNN algorithm

Is to the data sample x_textEstimation of class, i.e. distance sample x_textAnd performing class judgment on the most common f value in the latest K training samples. Since the traditional KNN algorithm adopts Euclid measurement, the measurement standard for calculating the distance is sensitive to noise characteristicsThe KNN algorithm is improved by replacing the traditional Euclid distance with the Euclid distance improved by the method. The method has good distinguishability, noise immunity and robustness of the projection vector of the LDA algorithm, can distinguish multidimensional data and carry out good classification, can keep high resolution and good calculation performance, and can be used as a reference for similar KNN research.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the present invention is further described below with reference to the accompanying drawings and the embodiments.

FIG. 1 is a simplified flow chart of the classification method based on the improved Euclid distance KNN according to the present invention;

FIG. 2 is a schematic view of a training sample of the present invention projected onto a straight line;

FIG. 3 is a schematic view of a sample center projection of the present invention;

FIG. 4 is a diagram illustrating the present invention using LDA to solve the optimal projection vector w;

FIG. 5 is a graphical representation of the classification performance of the USPS data set in accordance with the present invention;

fig. 6 is a diagram illustrating the classification performance of the MNIST data set according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The detailed description of the embodiments of the present invention generally described and illustrated in the figures herein is not intended to limit the scope of the invention, which is claimed, but is merely representative of selected embodiments of the invention.

It should be noted that: like reference symbols in the following drawings indicate like items, and thus, once an item is defined in one drawing, it need not be further defined and explained in subsequent drawings.

Referring to fig. 1, fig. 1 is a simplified flow chart of the classification method based on the improved euclidd distance KNN according to the present invention. The embodiment is particularly applicable to classification of data, and the embodiment of the invention is executed in a development environment of lie group machine learning.

Step1, in this embodiment, the USPS data set is downloaded over the network, and the data set includes 10 categories, each of which is a 0-9 category number, and includes 20000 ten thousand pictures, and each picture is an image of 32 × 32 (unit: pixel) size. An MNIST data set is downloaded from a network, and the data set comprises 10 categories of numbers of 0-9 categories respectively and 70000 pictures in total, wherein each picture is an image with the size of 28 x 28 (unit: pixel). Furthermore, the classification test is carried out on the two data sets, and the two data sets are respectively divided into a training data set and a test data set by programming through matlab language.

It should be noted that the picture data in this embodiment has the following advantages: (1) the data volume is large, the categories are many, and the method is necessary for the plum blossom machine learning. (2) The diversity of the sample images, which is a standard data set in this embodiment, covers various handwritten data volumes, the sample images have diversity, and the images in the data set are strictly screened for different angles, illumination and definition, so that the observation angles and the like of each category of images have larger differences.

Step2, setting a neighbor parameter K value, wherein K in the method is 1,3,5,7,9,11,13 and 15;

step3, calculating a projection vector w of a training set according to an LDA (Linear Discriminant analysis) algorithm;

given N training samples characterized by d dimensions

In which is N₁A training sample belonging to the category w₁Having N of₂A training sample belonging to the category w₂，N＝N₁+N₂；

Reducing the dimension of the d-dimensional feature and ensuring that data feature information is not lost after the dimension reduction, namely, determining the class of each sample after the dimension reduction, and referring the optimal vector as w (d dimension), wherein the projection of the training sample x (d dimension) onto w can be represented by y ═ w^TAnd x is calculated.

In order to be simple and easy to understand,in the present invention, we first consider the case where the training sample x is two-dimensional, and intuitively see, as shown in fig. 2, the circular and triangular tables represent two different types of training samples, the training sample x is two-dimensional and includes two eigenvalues, x1 represents one eigenvalue, x2 represents another eigenvalue, the straight line obtained is a straight line capable of separating the two types of training samples, and the straight line y ═ w in fig. 2^Tx may well separate training samples of different classes. This is actually the idea of LDA: maximizing inter-class variance and minimizing intra-class variance, i.e., reducing the differences between the intra-class, broaden the differences between different classes.

The specific process of the quantitative analysis to find the optimal w is described below.

First, find the mean (center point) of each class of training samples, where i is only two (i ═ 1, 2):

the mean of the sample points after x to w projection is given by:

the meaning of each symbol is consistent with the above description, and therefore, the projected mean value is the projection of the center point of the training sample.

the larger the J (w), the better.

In practical applications, J (w) is not as large as possible, and as shown in FIG. 3, the sample points are uniformly distributed in an ellipse projected to the horizontal axis x₁A larger center point separation j (w) can be achieved in the above, but sample points cannot be separated on the x-axis due to the overlap. Projected onto the longitudinal axis x₂In the above-mentioned manner,although j (w) is small, sample points can be separated. Therefore, we must also consider the variance between sample points, the larger the variance, the more difficult it is to separate the sample points.

The projected class is hashed using another metric, called hash (scatter), specifically:

the geometric meaning of a hash value is the density of sample points, with larger values being more dispersed and vice versa more concentrated.

In the present invention, it is necessary to separate different sample points better, and the more similar samples are gathered better, i.e. the larger the mean difference is, the better the hash value is. Measured using j (w) and S, the measurement formula:

according to the above formula, it is necessary to find w that maximizes J (w).

Expanding the hash value formula:

wherein order

I.e. a hash matrix.

Then, let S_w＝S₁+S₂，S_wCalled the Within-class dispersion degree matrix (Within-class scatter matrix). S_B＝(μ₁-μ₂)(μ₁-μ₂)^T，S_BReferred to as the inter-class dispersion matrix (Between-class scanner matrix).

J (w) is finally expressed as:

and (4) carrying out derivation on the derivative, carrying out normalization processing on the denominator before derivation, and if the normalization processing is not carried out, w is expanded by any multiple, and the formula is established, the w cannot be determined. Therefore, in the present invention, let | | w^TS_WW | | | 1, after adding lagrange multiplier, the derivation:

it follows that w is a matrix

The feature vector of (2).

Since any expansion or reduction of w by a factor does not affect the result, the unknown constants λ, λ on both sides can be reduced for simplicity_wTo obtain

We need only find the mean and equation of the original samples to find the best w, as shown in fig. 4.

The above conclusions, although coming from 2 dimensions, are true for multi-dimensions as well. The feature vector segmentation performance corresponding to the large feature value is the best.

Step4, constructing a neighbor graph G (V, E) according to the training set;

constructing a neighbor graph G (V, E) according to the training set, wherein G represents the neighbor graph, V represents a node, namely each data sample, E represents an edge connecting each data sample, and the size of the edge is specifically represented by a formula:

wherein x is^lThe method comprises the steps that the ith characteristic vector of a training sample x is represented, m refers to the number of the characteristic vectors, the value of m is related to the selection of a data set, the characteristic vectors mainly take 5 strokes, outlines, cross points, end points and gray levels of images, and the solution of the characteristic vectors is the prior art and is not written in the invention; x is the number of_i,x_jRespectively, the ith sample and the jth sample, t represents an arbitrary constant, and w represents the projection vector.

step6, return data sample x_textIs estimated value of

And the determination of the sample class is made.

The present invention discusses the case where the objective function is a discrete value (classification problem), i.e. the classification problem function can be described as: f is Rⁿ→ V, where V ═ V₁,v₂,…,v_sV represents a set of data categories, corresponding to s classes. Estimation of KNN algorithm

Is to the data sample x_textEstimation of class, i.e. distance sample x_textThe most common f-number of the most recent K training samples:

wherein the content of the first and second substances,

is the data sample x_textFinal class of f (x)_i) Problem function, x, representing a classification_iRepresents the ith training sample, v represents the class corresponding to the training sample,

table 1 shows the comparison of classification performance of the inventive method with the conventional KNN classification method on USPS datasets. As can be seen from the table, the classification accuracy of the method is obviously higher than that of the traditional KNN classification method.

TABLE 1 comparison of classification Performance of the inventive method with other methods on USPS datasets

Table 2 shows the comparison of the classification performance of the inventive method with the conventional KNN classification method on the MNIST dataset. As can be seen from the table, the classification accuracy of the method is obviously higher than that of the traditional KNN classification method.

TABLE 2 comparison of classification performance of the method of the present invention on MNIST datasets with other methods

With reference to fig. 5 to 6, fig. 5 is a classification performance diagram of the USPS data set according to the embodiment of the present invention, and fig. 6 is a classification performance diagram of the MNIST data set according to the embodiment of the present invention. FIG. 5 is applied to the USPS data set, and the average classification accuracy is 96%, while the average classification accuracy of the conventional KNN is 72%, which is 24% higher than that of the method provided by the present invention; fig. 6 is applied to MNIST datasets with an average classification accuracy of 95%, whereas the conventional KNN average classification accuracy is 88%, which is 7% higher than the method proposed by the present invention. The statistical result shows that the method of the invention is obviously superior to the traditional KNN method and has strong practicability.

the parameter setting module is used for setting a neighbor parameter K value;

k neighbor search modesBlock, for each data sample x in the test set_textFinding data sample x from neighbor map_textK neighbors in the training set;

Wherein the content of the first and second substances,

is the data sample x_textIn the final category of (a) to (b),

and setting K in the parameter setting module to be 1,3,5,7,9,11,13 and 15.

Wherein, the calculation mode of the projection vector w in the projection vector w solving module is as follows,

given N training samples characterized by d dimensions

training sampleY ═ w for projection on x to w^Tx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:

obtaining a hash value of the projected class, specifically:

final pass metric formula

Measuring a projection vector w;

expanding the hash value formula:

wherein order

Namely a hash matrix;

j (w) is finally expressed as:

it follows that w is a matrix

The feature vector of (2);

In the neighbor graph constructing module, the size of the edge in the neighbor graph is specifically represented by a formula:

determination of where x^lThe l-th feature vector, x, representing a training sample x_i,x_jRespectively representing the ith training sample and the jth training sample, wherein m is the number of the feature vectors, t is an arbitrary constant, and w is the projection vector obtained in the step 2.

The specific implementation of each module corresponds to each step, and the invention is not described.

The above description is only a part of the embodiments of the present invention, and is not intended to limit the present invention, and it will be apparent to those skilled in the art that various modifications can be made in the present invention. Any changes, equivalent substitutions or improvements made within the spirit and principle of the present invention should be included within the scope of the present invention. Note that like reference numerals and letters denote like items in the following drawings. Thus, once an item is defined in one drawing, it need not be further defined and explained in subsequent drawings.

Claims

1. A classification method based on an improved Euclid distance KNN is characterized by comprising the following steps:

step2, setting a neighbor parameter K value;

in Step4, the size of the edge in the neighbor graph is specifically represented by the formula:

determination of where x^lThe l-th feature vector, x, representing a training sample x_i,x_jRespectively representing the ith training sample and the jth training sample, wherein m is the number of the characteristic vectors, t represents an arbitrary constant, and w represents a projection vector obtained by Step 3;

the value of m is 5, and the value of m respectively comprises a stroke, a contour, a cross point, an end point and a gray level feature vector of the image;

step5, for each data sample x in the test set_textFinding data samples x from neighbor maps_textK neighbors in the training set;

step6, return data sample x_textIs estimated value of

Wherein the content of the first and second substances,

f(x_i) Problem function, x, representing a classification_iRepresents the ith training sample, V represents the class corresponding to the training sample, V' represents the set of data classes,

is the data sample x_textIn the final category of (a) to (b),

2. the improved Euclid distance KNN classification method according to claim 1, characterized in that: and setting the value range of K to be {1,3,5,7,9,11,13,15} in the Step 2.

3. The improved Euclid distance KNN classification method according to claim 1, characterized in that: the projection vector w in Step3 is calculated as follows,

given N training samples characterized by d dimensions

First, the mean value, i.e. the center point, of each type of training sample is found, and then

Specifically, there is N₁A training sample belonging to class v₁Having N of₂A training sample belonging to class v₂，N＝N₁+N₂，μ_zRepresenting the mean of the class z training samples;

the projection of the training samples x to w is represented by y-w^Tx is calculated, and the mean value of sample points after x to w projection of the training samples is represented as:

obtaining a hash value of the projected class, specifically:

final pass metric formula

Measuring a projection vector w;

expanding the hash value formula:

wherein order

Namely a hash matrix;

j (w) is finally expressed as:

it follows that w is a matrix

The feature vector of (2);

4. An improved Euclid distance KNN classification system is characterized by comprising the following modules:

the parameter setting module is used for setting a neighbor parameter K value;

determination of where x^lThe l-th feature vector, x, representing a training sample x_i,x_jRespectively representing the ith training sample and the jth training sample, m is the number of the characteristic vectors, and t represents the random numberAn intention constant, wherein w represents a projection vector obtained by the projection vector w solving module;

k neighbor search modules for each data sample x in the test set_textFinding data sample x from neighbor map_textK neighbors in the training set;

Wherein the content of the first and second substances,

is the data sample x_textIn the final category of (a) to (b),

5. the improved Euclid distance KNN classification system according to claim 4, characterized in that: the value range of K is set to be {1,3,5,7,9,11,13,15} in the parameter setting module.

6. The improved Euclid distance KNN classification system according to claim 4, characterized in that: the projection vector w in the projection vector w solving module is calculated as follows,

given N training samples characterized by d dimensions