CN113792748B

CN113792748B - Method and device for creating diabetic retinopathy classifier based on feature extraction and double support vector machines

Info

Publication number: CN113792748B
Application number: CN202111366311.6A
Authority: CN
Inventors: 王天棋; 高慧; 孙艺; 王洲洋; 刘传昌; 高宇航; 龙中武; 徐懿
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-05-13
Anticipated expiration: 2041-11-18
Also published as: CN113792748A

Abstract

The invention provides a method and a device for establishing a diabetic retinopathy classifier based on feature extraction and a double-support vector machine, wherein the method comprises the steps of carrying out feature extraction on sample data, carrying out data optimization aiming at different tasks, controlling the membership of each sample point to a training set in a weighting mode by introducing the membership, and simultaneously balancing the constraint conditions of isolated points or noise points in training data by introducing relaxation variables so as to reduce errors caused by the isolated points or the noise points in the samples; furthermore, the cost is controlled in a weighting mode, a cost sensitive learning framework is used, and the cost is introduced in a fuzzy support vector machine through weighting, so that the error of expressing the data imbalance problem by an equation is reduced; further, by generating two independent and non-parallel hyperplanes, each hyperplane is brought close to one of the two categories while being far from the other, so that large-scale classification problems can be handled without the need for additional external optimizers.

Description

Method and device for establishing diabetic retinopathy classifier based on feature extraction and double support vector machines

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for creating a diabetic retinopathy classifier based on feature extraction and a double-support-vector machine.

Background

The traditional Support Vector Machine (SVM) algorithm has been successfully applied to the fields of image classification, bioinformatics, text classification and the like. In general, in order to solve the classification problem, a mathematical model is constructed by using a set of training examples to predict unknown class labels of test examples, so that the prediction result can maximally embody the good generalization capability of a classification algorithm.

When the number of instances representing one class is much smaller than the other classes, some imbalance data is generated in the samples. The classification of unbalanced data is also referred to as skewed class distribution. In the real classification process, the category with the least number of samples is usually the category in which researchers are most interested, and the situation of skewed distribution is also most likely to occur, and the characteristics of the data are as follows: small sample size, high inter-class overlap, small separation amount and the like. The existence of the data not only reduces the performance of the classifier and gradually increases the calculation complexity of the algorithm, but also causes the rapid rise of redundant data and the increase of noise and error mark data, so that the unbalanced classification condition occurs again, and an inertia cycle is formed in the classification process.

Disclosure of Invention

The embodiment of the invention provides a method and a device for creating a diabetic retinopathy classifier based on feature extraction and a double-support-vector machine, which are used for eliminating or improving one or more defects in the prior art and solving the problems of high outlier and poor training effect caused by small quantity of partial class instances.

The technical scheme of the invention is as follows:

the invention provides a method for creating a diabetic retinopathy classifier based on feature extraction and a double-support-vector machine, which comprises the following steps:

acquiring a training sample set, wherein the training sample set comprises a set number of classes, each class comprises a plurality of sample data, and each sample is provided with a class label; the sample data at least comprises images of retinal capillary aneurysm, bleeding spots, hard exudation, velveteen spots, venous beading, intraretinal microvascular abnormality and macular edema states, and corresponding states are marked as labels;

performing feature extraction on each sample data based on vector feature selection or matrix feature selection, wherein a lasso feature selection method is adopted for the vector-based feature selection, and an lr or p-norm-based feature selection method is adopted for the matrix-based feature selection;

introducing a fuzzy support vector machine, controlling the membership degree of each sample point to a training set in a weighting mode, and performing fuzzification processing on each sample data to reduce the membership degree of the isolated points and the noise points in the training sample set relative to the belonged classes; meanwhile, a relaxation variable is introduced into the fuzzy support vector machine, and a penalty factor is introduced based on a cost sensitive learning frame;

generating two independent and nonparallel hyperplanes based on the structure of the fuzzy support vector machine by using the training sample set after feature extraction, and enabling each hyperplane to be close to one of the two categories and to be far away from the other category at the same time so as to create a double classifier.

In some embodiments, a first set proportion of sample data in the training sample set is used to construct the dual classifier, and the sample data in the remainder of the training sample set is used to detect the accuracy of the classifier.

The invention has the beneficial effects that:

in the method and the device for establishing the diabetic retinopathy classifier based on the feature extraction and the double support vector machines, the method performs the feature extraction on sample data, performs the data optimization aiming at different tasks, controls the membership degree of each sample point to a training set by introducing the membership degree in a weighting mode, and simultaneously balances the constraint condition of a solitary point or a noise point in training data by introducing a relaxation variable so as to reduce the error caused by the solitary point or the noise point in the sample; furthermore, the cost is controlled in a weighting mode, a cost sensitive learning framework is used, and the cost is introduced in a fuzzy support vector machine through weighting, so that the error of expressing the data imbalance problem by an equation is reduced; further, by generating two independent and non-parallel hyperplanes, each hyperplane is brought close to one of the two categories while being far from the other, so that large-scale classification problems can be handled without the need for additional external optimizers.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a method for creating a diabetic retinopathy classifier based on feature extraction and a dual support vector machine according to an embodiment of the present invention;

FIG. 2 (a) is a graph of the effect of ordered regression based on the marginal sum strategy;

FIG. 2 (b) is a diagram of two example incremental support vector machines;

FIG. 3 is a graph of TSVM versus data set Australian under noisy and non-noisy conditions;

FIG. 4 is a graph of the results of the classifier's test on dataset Australian with and without noise according to the present invention;

FIG. 5 is a graph of TSVM versus data set Blood under noisy and non-noisy conditions;

FIG. 6 is a graph of the results of the classifier of the present invention testing a data set Blood under noisy and non-noisy conditions;

FIG. 7 is a graph of TSVM versus data set Heart under noisy and quiet conditions;

FIG. 8 is a graph of the results of the test of the classifier of the present invention on a data set Heart under noisy and non-noisy conditions;

FIG. 9 is an original drawing of diabetic retinopathy;

FIG. 10 is a hard exudate feature map obtained from feature extraction of FIG. 9;

FIG. 11 is a characteristic diagram of the microaneurysm of FIG. 9 obtained by feature extraction;

FIG. 12 is a characteristic diagram of the bleeding spot obtained from the feature extraction in FIG. 9;

FIG. 13 is a graph of TSVM versus accuracy of classification of diabetic retinopathy plots by the improved classifier of the present invention under noisy conditions;

FIG. 14 is a thermal map relating ozone levels;

FIG. 15 is a graph comparing TSVM and the improved classifier of the present invention to the accuracy of ozone water bisection detection under noisy conditions.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled," if not specifically stated, may refer herein to not only a direct connection, but also an indirect connection in which an intermediate is present.

For the problem of performance reduction of the classifier caused by data imbalance when the number of partial sample instances is far smaller than that of other classes, some methods balance the distribution of the classes by resampling and constructing a data space, and perform priority processing in a preprocessing stage to reduce the result influenced by unbalanced data. However, such methods, while avoiding modifications to the learning algorithm, lack the assignment of weights, resulting in less than ideal newly constructed results. Some methods adopt a cost-sensitive learning mode, weights are distributed to different data samples according to importance, and the relationship between the samples and the membership degree of a cluster group is ignored. Some approaches employ ensemble learning, combining the outputs of several basic learners together to obtain a new classifier. However, although the learning effect is improved in this way, the overall calculation performance is often neglected.

Most of the above methods are based on local consideration, and although a certain range of problems can be solved, abnormal values still occur in many cases. Classification algorithms that deal with unbalanced data can present serious problems due to the presence of outliers. Training is poor because in the imbalance problem, a few data points may contain few samples and the weight of a few clustered data points may be high, resulting in outliers with higher weights. Although a square-loss function can be used to solve this problem, a linear equation is derived for both systems. However, if the similarity of two minority clusters is extremely high, the two non-parallel hyperplanes are extremely similar according to the two system equations and are likely to fall within the range represented by the linear equation of one system, so that the classification mode capability through the characteristics and the construction of the view and the hyperplane is weakened, the data noise is gradually increased, and subjective data input errors are inevitably caused. Meanwhile, the information for marking each sample is also seriously insufficient, which results in the generation of more serious abnormal values in the data, and finally, the performance of the classifier is gradually reduced, thereby forming an obstacle cycle.

The invention provides a classifier establishing method based on feature extraction and a double-support-vector machine, which comprises the following steps of S101-S103:

step S101: the method comprises the steps of obtaining a training sample set, wherein the training sample set comprises a set number of categories, each category comprises a plurality of sample data, and each sample is provided with a category label.

Step S102: and performing feature extraction on each sample data based on vector feature selection or matrix feature selection.

Step S103: introducing a fuzzy support vector machine, controlling the membership degree of each sample point to the training set in a weighting mode, and performing fuzzification processing on each sample data to reduce the membership degree of the isolated points and the noise points in the training sample set relative to the belonged classes; meanwhile, a relaxation variable is introduced into the fuzzy support vector machine, and a penalty factor is introduced based on a cost sensitive learning framework.

Step S104: two independent and nonparallel hyperplanes are generated based on the structure of the fuzzy support vector machine by utilizing the training sample set after feature extraction, and each hyperplane is close to one of two categories and is far away from the other category at the same time so as to create a double classifier.

In some embodiments, the sample data of the first set proportion in the training sample set is used for constructing a dual classifier, and the sample data of the rest part of the training sample set is used for detecting the precision of the classifier.

Specifically, the following describes a method for creating a diabetic retinopathy classifier based on feature extraction and a dual support vector machine, and the support vector machine obtained by improvement based on the present invention is labeled as PTSVM:

1. principle of classifier creation

1.1 analysis of ordered regression

To have

Order regression problem for individual classes, representing classes as consecutive integers

And the known sorting information is assigned with a value,

is shown as

The number of training samples for a class,

(ii) a First, the

A training sample is recorded

、

Wherein

Is to satisfy

The input space of (2). As figure 2 (a) shows the effect of ordered regression based on the marginal sum strategy,

and

are topologically parallel hyperplanes. When in use

When the support vector is located on the boundary between adjacent classes, pair

Maximization, and discrimination of the parallel hyperplane can be carried out. FIG. 2 (b) shows two example incremental support vector machines. It can be analyzed from the graph if a new sample is added (e.g. the sample is added

) Is a super plane

And hyperplane

The hyperplane in between, then the classifier needs to be adjusted.

After hypothesis training, a classifier is generated

As a weight, in

As a function of the coefficients:

when at the first

When the super-plane is parallel and topological, the method can obtain

Wherein

Wherein

Is as follows

A discrimination threshold for each hyperplane. Is provided with

The decision function is:

（1）

wherein the content of the first and second substances,

is the first

Personal identificationThe edge of the hyperplane, representing the shortest distance of the nearest sample from this hyperplane, passes through

Individual discrimination hyperplane and second

Or a first

And obtaining a class difference value. Based on marginal sum strategy, by

Maximizing the sum of all margins, and according to the intermediate constraints:

（2）

to obtain:

（3）

so that

（4）

（5）

（6）

（7）

（8）

Wherein the content of the first and second substances,

training sample

Mapping to a high-dimensional Reproduction Kernel Hilbert Space (RKHS) by a transformation function, and having a kernel function:

（9）

the inner product is expressed in RKHS. In addition to this, the present invention is,

is measured data

Non-negative relaxation variables, parameters, of the degree of misclassification of

Controlling the trade-off between error in the training samples and margin and maximization,

representing the relaxation variable.

1.2 introduction of relaxation variables

In combination with the analysis of 1.1, a relaxation variable is introduced, and a large number of constraints are involved in the processing of a large number of isolated points and noisy points in the process of dividing the hyperplane. Usually, these constraint conditions are of different types, and therefore these isolated points and noise points are in regions with variable ranges, a slack variable (residual variable) can be introduced in the process of normalizing linear or nonlinear programming, if equal to 0, the original state is converged, if greater than zero, constraint slack is determined, in the process of constraint condition judgment, equalisation is realized by adding (or subtracting) a new non-negative variable (namely, a slack variable) to the left of the inequality, and the initial coefficient of the slack variable in the objective function is zero.

The optimal representation for the hyperplane classification is:

（10）

wherein the content of the first and second substances,

in order to be a penalty factor,

in order to be a function of the relaxation variable,

which represents the degree of membership of the sample,

is a normal vector of the hyperplane,

is a dimension of the training set of fuzzy data,

is as follows

And identifies the edge of the hyperplane.

1.3 introduction of fuzzy support vector machine

In the traditional support vector machine algorithm, the hyperplane can be divided by effectively training each sample point. However, in the data set collected in actual situations, there are often isolated points or noise points with very small differences in the training samples, which can seriously affect the structure of the classifier, so that the classifier generates an overfitting phenomenon, resulting in a reduction in generalization performance of the classifier. In order to reduce the serious interference of the outlier isolated points or the noise points, a fuzzy support vector machine is introduced in combination with the membership function. The degree of membership of each sample point to the training set is controlled in a weighting mode, and the degree of membership of the isolated points and the noise points is obviously lower than that of other points. Thus, the total error of these solitons and noise will be significantly reduced.

Fuzzification processing is carried out on the sample data during classification, matrix transformation is carried out on the actually acquired data set, and a training set representing fuzzy data is as follows:

；（11）

wherein the content of the first and second substances,

，

，

(ii) a Then the optimal representation for the hyperplane classification is:

；（12）

wherein the content of the first and second substances,

in order to be a penalty factor,

in order to be a function of the relaxation variable,

which represents the degree of membership of the sample,

is a normal vector of the hyperplane,

is the dimension of the fuzzy data training set.

1.3 introducing cost control to account for sample skew

In the training phase of actual data samples, the noise and outlier nature of the isolated points can largely lead to an imbalance in the data set, but in most cases, it is unreasonable to express them in terms of equations. Therefore, on the basis of 1.3, the cost is controlled in a weighting mode, and the cost is introduced through weighting in a fuzzy support vector machine by using a cost sensitive learning framework, so that the error of the data imbalance problem represented by an equation is reduced. By positive and negative class penalty factors

And

to represent different costs, to represent the degree of contribution to different importance of misclassification of the data set, and to find the optimal hyperplane by solving the following problem.

Is used for training data sets

Of data points in dimensional space

Set of column vectors

Is shown as follows

An input mode is

Of 1 at

An output mode is

,

And

respectively positive and negative values; order to

A set of sample metrics is represented that are,

and

respectively representing a positive class sample index set and a negative class sample index set.

Then the hyperplane is optimally represented as

：

；（13）

Wherein the content of the first and second substances,

and

is a penalty factor that is a function of,

is a hyperplane normal vector and is,

is a relaxation variable;

calculating the performance of each weight coefficient by adopting an adaptive regression method

Using a weight

The loss caused by the solitary point or the noise point in the calculation process is represented by a penalty factor

When the sample tends to be changed to the limit, the sample is gradually driven to be classified into the correct hyperplane,

the degree, which is the correction to the initial sample, selects the correct sample population in successive iterations, expressed as:

（14）

wherein each weight coefficient

Corresponding to a weight value

Using penalty factor as weight

；

Combining equation (13), we obtain the following equation:

（15）

wherein the content of the first and second substances,

represents a hyperplane normal vector;

equations (10) and (15), eliminating the hyperplane normal vector yields the following equation:

（16）

elimination of the relaxation variables gives the following formula:

（17）

elimination of the penalty factor gives the following formula:

（18）

1.4 improved double support vector machine

The purpose of the dual support vector machine is to generate two independent and non-parallel hyperplanes. The distance between the hyperplane and the target class is different, and the control modes of relaxation variables, cost control and fuzzification provided in 1.2 are combined to break through the mode that all data are controlled by a constraint force, so that the problem of division of two secondary planning related classes with small scale and small change is solved, and the speed and the performance of the algorithm are improved.

If the binary classification problem is calculated in an n-dimensional real number space, the training data set is represented as:

（19）

show, is provided with

Let us order

Matrix A of (A) represents a positive class

In

A sample, an order

Matrix B of (a) represents a negative class

In

The number of the samples is one,

is a kernel matrix. The two separator hyperplanes are represented as:

（20）

（21）

（22）

order to

Is composed of

In the data set

The input of the order is carried out,

is a variance-like matrix;

in order to improve the marginal benefit and reduce the structural risk to the maximum extent, a regularization term is introduced into the formula, and then the optimization of the hyperplane is expressed as

：

； (23)

Wherein the content of the first and second substances,

、

is a regularization term;

further performing the following solution to find an optimal hyperplane:

（24）

elimination of the balance parameters gives:

（25）

（26）

wherein the content of the first and second substances,

、

if the weight vector is the weight vector, the weight vector is eliminated to obtain:

（27）

（28）

2. principle of algorithm

2.1 Linear Classification

For the binary classification problem, a linear loss function is introduced, and the original problem of the linear loss projection dual-support vector machine can be expressed as follows:

（29）

so that

（30）

（31）

（32）

So that

（33）

（34）

Is a positive parameter of the number of bits,

is the relaxation variable. The optimal value of the empirical risk, which can be derived from equations (17) and (18), is:

and

because of

And

which may be negative, and infinite problems may occur, so to balance the influence of each point on the projection class mean and to introduce the concept of a rough set, a weighted linear loss function with a weighting vector is introduced. The WLPTSVM formula is then given as follows.

（35）

So that

（36）

（37）

And

（38）

so that

（39）

（40）

Wherein:

（41）

（42）

is determined by the following equation:

（43）

（44）

wherein the content of the first and second substances,

and

is a parameter.

Before solving the problems of formulae (35) and (38), the problems of formulae (35) + (43) are geometrically explained, while the problems of formulae (38) and (44) are similar. For equation (35), the first term in the objective function is the hyperplane that controls the model complexity to find the optimal projection direction

. The second term in the objective function is to minimize the empirical risk, minimizing the intra-class variance of the projection samples of the own class. At the same time, the projection samples of the other category are as scattered as possible. In addition, the weight vector

The effect of each point will be balanced with the projection-like mean. During the training process, the control of the empirical risk needs to be compatible with the whole process. Therefore, from this viewpoint, the problems (35) and (38) of the sum are superior to those in the PTSVM.

To facilitate algorithm verification, the above problem can be solved by the following approximation algorithm. Considering the problem (35), and substituting the equality constraint into the objective function yields:

（45）

order:

（46）

（47）

then equation (45) is converted into

（48）

Will (48) pair w₁Is set to 0, one can obtain:

（49）

the solution to QPP (35) is obtained from the system of linear equations as follows.

（50）

Considering the problem (38), and substituting the equation constraint into the objective function yields:

（51）

（52）

order:

（53）

convert equation (51) to

（54）

Will be (54) opposite to

Setting the gradient of (c) to zero may result in:

（55）

then, a solution for QPP (38) can be obtained from the system of linear equations as follows.

（56）

In order to find suitable ones defined in (55) and (56)

And

and approximate solutions of problems (52) + (55) and (54) + (56), a weight setting method with two steps is constructed. In general, the first step is to solve the linear loss function problems (25) and (26) and find the corresponding

And

. The second step being using the compounds obtained

And

computing

And

then using the obtained

And

the solutions of the problems (29) and (32) are found and taken as the approximate solution required. The detailed algorithm is as follows:

step 2.1.1: given training input matrices A and B, let

，

With appropriate penalty parameters

In formulae (38) and (44)

And

。

step 2.1.2: from formulae (35) and (38)

And

calculating a relaxation variable

And

then obtained from the formulae (43) and (44)

And

wherein

，

。

Step 2.1.3: by using

and

Finding solutions of equations (38) and (44)

And

。

step 2.1.4: the decision is constructed as:

（57）

wherein the content of the first and second substances,

is an absolute value.

2.2 nonlinear Classification

For the non-linear classification problem, first define

And selects an appropriate kernel function

The weighted linear loss is then projected onto the twin support vector machine. The original problem of the non-linear version is represented as follows:

（58）

so that

（59）

（60）

（61）

So that

（62）

（63）

Wherein the content of the first and second substances,

and

are determined by equations (41) and (42).

The derivation process is similar to the linear case, assuming that

And

we can obtain solutions to problems (58) and (61) as follows.

（64）

（65）

Wherein the content of the first and second substances,

and

is a unit matrix, then a matrix

（66）

（67）

（68）

（69）

The specific algorithm is as follows:

step 2.2.1: given training input matrices A and B, let

，

With appropriate penalty parameters

Obtaining of the compounds of formulae (64) and (65)

、

And a kernel function K.

Step 2.2.2: from formulae (58) and (61)

And

calculating a relaxation variable

And

then obtained from the formulae (43) and (44)

And

wherein

，

。

Step 2.2.3: by using

And

finding solutions of equations (64) and (65)

And

。

step 2.2.4: constructing a decision as

（70）

Wherein the content of the first and second substances,

is an absolute value.

3. Comparison of the improved support vector machine of the present invention with a conventional twin vector machine

Twin Support Vector Machine (TSVM) is an effective classifier, especially for binary data, defined by the square of the norm distance in the objective lens. It is well known that norm distances are susceptible to outliers, which can lead to errors. The improved algorithm of the invention improves the precision of the test result under the condition of noise and no noise, and has robustness in multi-classification problems.

3.1 testing with dataset Australian

The number of samples of dataset Australian is 690, the data dimension is 14. Fig. 3 is generated by testing the original algorithm TSVM on the dataset Australian, fig. 3 is a description of the change of the classification accuracy of the original algorithm TSVM on the dataset Australian with the adjustment parameter, the red line in fig. 3 represents the change of the classification accuracy with the adjustment parameter under the noise-free condition, the optimal performance of the algorithm is only 77.81251%, the blue line represents the change of the classification accuracy with the adjustment parameter under the noise condition, and the optimal performance of the algorithm is only 95.625025%.

Fig. 4 is a result of a test of the improved algorithm on the dataset Australian of the present invention, which is a description of the change of the classification accuracy of the improved algorithm on the dataset Australian with the adjustment parameter, a red line in fig. 4 shows the change of the classification accuracy with the adjustment parameter under the noise-free condition, the optimal performance of the algorithm is improved to 93.12505%, a blue line shows the change of the classification accuracy with the adjustment parameter under the noise condition, the optimal performance of the algorithm is improved to 96.87505%, the accuracy of the algorithm under both conditions is higher than that of a graph represented by the conventional algorithm, and the improvement is more significant particularly under the noise condition. While under noisy conditions, optimal at C =0.5, the time complexity rate under noisy conditions is reduced compared to the red line of fig. 3 (optimal at C = 1).

3.2 testing with dataset Blood

The number of samples in the dataset Blood is 748 and the data dimension is 4. Fig. 5 is a graph generated by testing the original algorithm TSVM on the data set Blood, fig. 5 is a description of the classification accuracy of the original algorithm TSVM on the data set Blood as a function of the tuning parameter, a red line in the graph represents the classification accuracy as a function of the tuning parameter under a noise-free condition, the optimal performance of the algorithm is only 76.8751%, a blue line represents the classification accuracy as a function of the tuning parameter under a noise condition, and the optimal performance of the algorithm is only 93.4375%.

It can be seen that fig. 6 is generated by testing the improved algorithm on the data set Blood, and is a description of the change of the classification accuracy of the improved algorithm on the data set Blood with the adjustment parameter, a red line in the figure represents the change of the classification accuracy with the adjustment parameter under the noise-free condition, the optimal performance of the algorithm is improved to 87.8125%, a blue line represents the change of the classification accuracy with the adjustment parameter under the noise condition, the optimal performance of the algorithm is improved to 97.5%, the accuracy of the algorithm is higher than that of the graph represented by the conventional algorithm under both conditions, and the improvement is more significant particularly under the noise condition. Meanwhile, under the noiseless condition, the optimal time is reached when the C =0.0625, compared with the red line (the optimal time is reached when the C = 0.25) in the figure 6, under the noised condition, the optimal time is reached when the C =0.25, compared with the red line (the optimal time is reached when the C = 1) in the figure 5, and the time complexity rate under the two conditions is reduced.

3.3 testing with dataset her

The number of samples of the data set her is 270 and the data dimension is 13. Fig. 7 is a graph generated by testing the original algorithm TSVM on the data set Heart, and is a description of the change of the classification accuracy of the original algorithm on the data set Heart along with the adjustment parameter, the red line in the graph represents the change of the classification accuracy along with the adjustment parameter under the noise-free condition, the optimal performance of the algorithm is only 76.5625%, the blue line represents the change of the classification accuracy along with the adjustment parameter under the noise condition, and the optimal performance of the algorithm is only 93.75%.

Fig. 8 is a result of testing the improved algorithm on the data set Heart, and is a description of the change of the classification accuracy of the improved algorithm on the data set Heart along with the adjustment parameter, a red line in the figure represents the change of the classification accuracy along with the adjustment parameter under the noise-free condition, the optimal performance of the algorithm is improved to 88.4375%, a blue line represents the change of the classification accuracy along with the adjustment parameter under the noise condition, the optimal performance of the algorithm is improved to 98.75%, the accuracy of the algorithm is higher than that of the graph represented by the conventional algorithm under both conditions, and the improvement is more significant particularly under the noise condition. While under noisy conditions, optimal at C =0.0625, the time complexity rate under noisy conditions is reduced compared to the red line of fig. 1-2 (optimal at C = 0.125).

3.4 precision testing of each data set under noisy conditions using linear and Gaussian kernel functions

The method comprises the following steps of respectively adopting a traditional Support Vector Machine (SVM), a Twin Support Vector Machine (TSVM) and an improved algorithm classifier of the invention, testing data sets, namely Heart, Australian, Pima, Sonar, Spect, germ, Monk1, cancer, Ionodata, splice, cmc, flood and haberman based on a linear kernel function and a Gaussian kernel function, wherein the sample size and the data dimension of each data set are shown in a table 1:

TABLE 1

The test accuracy of the linear kernel function and the gaussian kernel function for each data set under noisy conditions is shown in tables 2 (a) and 2 (b):

TABLE 2 (a)

TABLE 2 (b)

Table 2 shows the accuracy and fluctuation degree of each data set test by the conventional Support Vector Machine (SVM), Twin Support Vector Machine (TSVM) and the classifier generated by the improved algorithm of the present invention, and it can be seen that through the comparison of the three algorithms, the improved algorithm of the present invention has improved accuracy, reduced standard deviation, and is more efficient and stable than the two conventional algorithms.

The invention is illustrated below with reference to specific examples:

example 1

A method for establishing a diabetic retinopathy classifier based on feature extraction and a double support vector machine is provided, and data distribution plays an important role in the classification work of diabetic retinopathy images. Lesions of retinal capillaries appear as aneurysms, bleeding spots, hard exudation, lint spots, venous beading, intraretinal microvascular abnormalities (IRMA), and macular edema. Diabetes can cause two types of retinopathy, proliferative and non-proliferative retinopathy. Diabetic retinopathy is one of the major blinding eye diseases, and when the abnormal pathological changes in the image have a very low rate and cannot be distinguished, the problem of data imbalance occurs, so that abnormal values and noise are continuously increased, and the speed and performance of classification are greatly influenced. The true cause of retinal capillary pathology is indistinguishable. In view of the above problems, the present embodiment proposes an improved diabetic retinopathy detection scheme using the classifier creation method provided in steps S101 to S104. Starting from the motivation and mathematical expression of classification, based on the relation of proliferative and non-proliferative retinopathy, combining the relation between three characteristics of hard effusion, microaneurysm and bleeding spot of pathological changes, applying two modes of vector-based characteristic selection and matrix-based characteristic selection, wherein the vector-based characteristic selection can adopt a lasso characteristic selection method, the matrix-based characteristic selection can adopt a lr and p-norm-based characteristic selection method, in addition, the fast classification method of the double-bounded support vector machine and the WRSVM are used for solving the characteristics of outlier generated in the imbalance problem, can construct a dual classifier for setting weight values in two steps aiming at the problem of nonlinear classification of diabetic retinopathy images, adopts a mode of feature selection to identify multiple tasks, the problem that the classification of the unbalanced distribution of the image data cannot be effectively processed is solved without any additional external optimizer.

Specifically, first, feature extraction is performed on sample data for a specific task, and as shown in the original image of diabetic retinopathy shown in fig. 9, data including hard exudate features shown in fig. 10, data including microaneurysm features shown in fig. 11, and data including bleeding spot features shown in fig. 12 can be extracted by feature selection.

Further, a classifier is constructed based on the following steps:

the first step is as follows: the method comprises the steps of constructing a mapping function from a training sample of diabetic retinopathy to a category, and then judging the shortest distance from a specific hyperplane to a lesion sample closest to the adjacent category in all parallel topologically superplanes, so that the edge of the hyperplane can be clearly divided; and based on a marginal sum strategy, maximizing the sum of all margins, and finally considering the constraint between all pathological changes and non-pathological change types to solve the main classification problem.

The second step is that: and introducing a fuzzy support vector machine, performing membership grade division on each diabetic retinopathy sample data, performing fuzzification processing on the diabetic retinopathy sample data, and finding out the hyperplane optimization problem of optimally dividing the lesion type if the original data set is changed into a fuzzy data training set.

The third step: in order to solve the problem that the proportion of the abnormal lesion in the image is too low in the second step, and the data set is unbalanced, the samples are weighted. And applying a cost sensitive learning framework to a support vector machine to obtain the following weighted optimization problem. Through the optimization problem, an optimal hyperplane for dividing the type of the diabetic retinopathy training sample is found.

The fourth step: in the third step, the large coefficient is estimated with bias, so it is not optimal. To solve the problem of sample skewing caused by abnormal diabetic retinopathy, a penalty factor is used for each coefficient. And the importance degree of the loss brought by the outliers is represented, and when the penalty factor is infinite, the hyperplane is forced to correctly divide all samples, so that the samples are degenerated into a hard interval classifier.

The fifth step: after introducing the linear loss function, in order to balance the influence of each diabetic retina sample point on the projection-like mean, we follow the concept of a rough set, introducing a weighted linear loss function with a weighting vector, and the first term in the objective function is to control the model complexity to find the optimal projection direction. The second term in the objective function is to minimize the empirical risk, which attempts to minimize the intra-class variance of the projection samples of one lesion class, while the projection samples of another lesion class are as scattered as possible. Furthermore, the weighting vector will balance the effect of each point with the projection-like mean. During the training process, the empirical risk also attempts to achieve the consistency required for lesion image data classification. And finally, constructing a weight setting method with two steps according to the relation among the three characteristics of hard exudates, microaneurysms and bleeding spots, and carrying out classification calculation.

The test results were as follows:

fig. 13 is generated by testing a conventional twin support vector machine algorithm TSVM and a classifier constructed according to this embodiment on a noisy diabetic retinopathy image dataset, and is a description of the classification accuracy varying with the training times, a red line in the graph represents the performance representation of the original algorithm TSVM, a blue line represents the performance representation of the improved algorithm according to the present invention, and it is obvious from the model accuracy that the improved algorithm is higher than the graph represented by the conventional algorithm and has robustness to random noise.

In the present embodiment, the following three features are used: hard exudates, microaneurysms and bleeding spots. This example tests a noisy data set to conclude that: the improved model is the best choice for diabetic retinopathy detection, providing better results in terms of accuracy and robustness to noise. The improved selection classifier of the present invention also helps to reduce the time consumption of the classification of diabetes and non-diabetes corresponding to the data set.

Example 2

The pollution flashover accident is a serious accident having a wide influence on the power grid. The accurate and timely classification of the pollution severity is a key to prevent pollution flashover accidents and is a technical difficulty. In the embodiment, the method in steps S101 to S104 is adopted, and a supervision classifier is created for detecting the ozone level on the basis of extracting relevant features by using principal component analysis, so as to solve the problem of poor generalization caused by insufficient labeled samples in industrial application. The ozone level detection data set is used, and the specific data form can be referred to a standard ozone level characteristic correlation heat map, and comprises two ground ozone level data sets, one is a peak value set (eighthr. data) of eight hours, and the other is a peak value set (onehr. data) of one hour. These data were collected in houston, garwinton and brasorlia from 1998 to 2004.

Constructing a classifier based on the following steps:

the first step is as follows: the method comprises the steps of (1) judging the shortest distance from a specific hyperplane to the nearest polluted sample in an adjacent category by constructing a mapping function from ozone level detection training samples to categories and considering all parallel topologically-possible hyperplanes, and then dividing and identifying the edge of the hyperplane; and then based on a marginal sum strategy, maximizing the sum of all margins, and then solving the main classification problem on the basis of considering the constraint among all ozone pollution degrees.

The second step is that: and introducing a fuzzy support vector machine, performing membership degree division on each ozone level detection training sample data, performing fuzzification processing on the ozone level detection training sample data, and changing an original data set into a fuzzy data training set to obtain a hyperplane optimization problem seeking the optimal division pollution degree.

The third step: in order to solve the problems that the proportion of the change of the abnormal ozone pollution value in the image is too low and the data set is unbalanced in the second step, the sample is weighted, and a cost-sensitive learning frame is applied to a support vector machine to obtain a weighted optimization problem. Through the optimization problem, an optimal hyperplane for dividing the ozone level detection training sample types is found.

The fourth step: in the third step, the large coefficients are estimated with bias and are therefore not optimal. In order to solve the problem that the sample is deflected due to an abnormal ozone pollution value, a penalty factor is used for each coefficient, the importance degree of loss brought by outliers is represented, when the penalty factor is infinite, a hyperplane is forced to divide all samples correctly, and the hard interval classifier is degraded.

The fifth step: after introducing the linear loss function, in order to balance the influence of each ozone level detection sample point on the projection class mean value, following the concept of a rough set, a weighted linear loss function with a weighting vector is introduced, and the first term in the objective function is to control the model complexity to find the optimal projection direction. The second term in the objective function is to minimize the empirical risk, which attempts to minimize the intra-class variance of the projection samples of the self-contamination class, while the projection samples of the other lesion class are as scattered as possible. Furthermore, the weighting vector will balance the effect of each point with the projection-like mean. During the training process, the empirical risk also attempts to achieve the consistency required for the classification of the contamination level data. And finally, according to the relation among characteristics of local ozone peak prediction headwind, headwind ozone background level, precursor emission related factors, highest temperature, basic temperature generated by net ozone, total solar radiation amount of one day, sunrise wind speed, midday wind speed and the like, a weight setting method with two steps is constructed, and classification calculation is carried out.

The test results were as follows:

in the standard ozone water level characteristic correlation thermal map of fig. 14, each color block represents the ratio of the number of pixels classified by each contamination severity level to the total number of sample pixels. Fig. 15 is a result of testing the classifier constructed by the TSVM algorithm and the present embodiment on the noisy ozone level detection data set, and is a description of the classification accuracy varying with the training times, where the red line in the graph represents the performance representation of the TSVM algorithm of the original algorithm, and the blue line represents the performance representation of the improved algorithm of the present invention, and it is obvious from the model accuracy that the improved algorithm of the present invention is higher than the graph represented by the conventional algorithm, and is robust to random noise.

To sum up, in the method and the device for creating the diabetic retinopathy classifier based on feature extraction and a dual support vector machine, the method performs feature extraction on sample data, performs data optimization for different tasks, controls the membership of each sample point to a training set in a weighting mode by introducing the membership, and reduces errors caused by the singular points or noise points in the sample by introducing constraint conditions of slack variables to balance the singular points or noise points in the training data; furthermore, the cost is controlled in a weighting mode, a cost sensitive learning framework is used, and the cost is introduced in a fuzzy support vector machine through weighting, so that the error of expressing the data imbalance problem by an equation is reduced; further, by generating two independent and non-parallel hyperplanes, each hyperplane is brought close to one of the two categories while being far from the other, so that large-scale classification problems can be handled without the need for additional external optimizers.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for creating a diabetic retinopathy classifier based on feature extraction and a double support vector machine is characterized by comprising the following steps:

acquiring a training sample set, wherein the training sample set comprises a set number of classes, each class comprises a plurality of sample data, and each sample is provided with a class label; the sample data at least comprises images of retinal capillary aneurysm, bleeding spots, hard exudation, cotton velvet spots, venous beading, intraretinal microvascular abnormality and macular edema;

generating two independent and nonparallel hyperplanes based on the structure of the fuzzy support vector machine by using the training sample set after feature extraction, and enabling each hyperplane to be close to one of two categories and to be far away from the other category at the same time so as to create a double classifier; wherein, controlling the membership degree of each sample point to the training set by a weighting mode comprises:

；

wherein the content of the first and second substances,

，

，

；

introducing relaxation variables in the fuzzy support vector machine, including:

introducing a relaxation variable in the process of standardizing linear or nonlinear programming, if the relaxation variable is equal to 0, converging to an original state, and if the relaxation variable is greater than zero, determining constraint relaxation;

then after introducing the fuzzy vector machine and the relaxation variable, the optimization expression of the hyperplane is shown as

：

Wherein the content of the first and second substances,

in order to be a penalty factor,

in order to be a function of the relaxation variable,

which represents the degree of membership of the sample,

is a normal vector of the hyperplane,

is a dimension of the training set of fuzzy data,

is as follows

A hyperplane edge;

introducing a penalty factor based on a cost sensitive learning framework, comprising:

by positive and negative class penalty factors

And

representing different costs, representing the contribution degrees of different importance to the data set error classification, and finding the optimal hyperplane by solving the following problems;

is used for training data sets

Of data points in dimensional space

Set of column vectors

Is shown as follows

An input mode is

Of 1 at

An output mode is

,

And

respectively positive and negative values; order to

A set of sample metrics is represented that are,

and

respectively representing a positive sample index set and a negative sample index set;

then the hyperplane is optimally represented as

：

Wherein the content of the first and second substances,

and

is a penalty factor that is a function of,

is a hyperplane normal vector and is,

is a relaxation variable;

the performance is calculated for each coefficient using an adaptive regression method

Using a weight

is the correction of the initial sample, and the correct sample population is selected in successive iterations as:

wherein each coefficient

Corresponding to a weight value

Using penalty factors as weights

；

Introduction of

To obtain the following formula:

wherein the content of the first and second substances,

represents a hyperplane normal vector;

bonding of

And

removing the hyperplane normal vector yields the following equation:

elimination of the relaxation variables gives the following formula:

elimination of the penalty factor gives the following formula:

；

generating two independent and non-parallel hyperplanes based on the structure of the fuzzy support vector machine using the training sample set, such that each hyperplane is close to one of the two classes while being far from the other, creating a dual classifier comprising:

for a computational binary classification problem in an n-dimensional real space, the training sample set is represented as:

；

wherein, it is provided with

Let us order

Matrix A of (A) represents a set of positive sample indices

In

A sample, an order

Matrix B of (A) represents a set of negative class sample indices

In

A sample is obtained; order to

For a kernel matrix, then the two hyperplanes are represented as:

；

；

；

order to

Is composed of

In the data set

The input of the order is carried out,

is a variance-like matrix;

：

；

Wherein the content of the first and second substances,

、

is a regularization term;

further performing the following solution to find an optimal hyperplane:

；

elimination of the balance parameters gives:

；

；

wherein the content of the first and second substances,

、

。

2. the method of claim 1 wherein a weighted linear loss function with weight vectors is introduced to create the dual classifiers.

3. The method for creating a classifier for diabetic retinopathy based on feature extraction and dual support vector machines according to claim 2, characterized in that for the binary classification problem, a linear loss function is introduced, and the original problem of the linear loss projection dual support vector machine is expressed as:

；

so that

；

；

；

So that

；

；

Wherein the content of the first and second substances,

is a positive parameter of the number of bits,

is a relaxation variable; the optimal value for empirical risk was determined as:

and

the coarse set and the weighted linear loss function with the weighted vector are referred to, and the following first problem is solved:

；

so that

；

；

And solving a second problem:

；

so that

；

；

Wherein:

；

；

is determined by the following equation:

wherein the content of the first and second substances,

and

is a parameter;

the first problem is considered and the equality constraint is substituted into the objective function by the following approximation algorithm:

order:

；

；

then

To convert to:

；

will be the above type pair

Is set to 0, resulting in:

；

obtaining a solution to the first problem from a system of linear equations:

；

wherein the content of the first and second substances,

is an identity matrix;

considering the second problem and substituting the equation constraint into the objective function yields:

；

order:

；

will be provided with

To convert to:

；

will be provided with

Relative to

Setting the gradient of (d) to 0 yields:

；

then, the solution to the second problem is obtained from a system of linear equations as:

。

4. the method of claim 3, wherein for non-linear classification problems, defining a classifier for diabetic retinopathy based on feature extraction and dual support vector machines

And determining a kernel function

Projecting the weighted linear loss onto the twin support vector machine, the original problem of the non-linear version is represented as the third problem:

；

so that

；

；

And the fourth problem:

；

so that

；

；

Determining based on the same derivation process as linear classification

And

then, obtaining solutions of the third problem and the fourth problem, which are respectively:

；

；

wherein

And

is a unit matrix, then a matrix

；

。

5. The method of feature extraction and dual support vector machine based diabetic retinopathy classifier creation as claimed in claim 3, further comprising:

and using the sample data of the first set proportion in the training sample set to construct the double classifier, and detecting the precision of the classifier by adopting the sample data of the rest part of the training sample set.

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 5 when executing the program.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.