CN107729926B

CN107729926B - Data amplification method and machine identification system based on high-dimensional space transformation

Info

Publication number: CN107729926B
Application number: CN201710899032.3A
Authority: CN
Inventors: 赵凤军; 吴斌; 贺小伟; 侯榆青; 易黄建; 曹欣; 王宾
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2021-07-13
Anticipated expiration: 2037-09-28
Also published as: CN107729926A

Abstract

The invention belongs to the technical field of image processing and machine learning, and discloses a data amplification method and a machine identification system based on high-dimensional space transformation, wherein background sample data is transformed to a high-dimensional space from an original space; obtaining the distribution of a high-dimensional space target sample based on the distribution histogram of the background sample, and generating high-dimensional space target sample data; and performing equation set transformation by using the distance function, and transforming the amplification data from a high-dimensional space to an original space. According to the invention, through learning the distribution histogram of the negative sample, the corresponding positive sample data set is amplified, the problem of mismatching of the positive sample data and the negative sample data in the machine learning model is solved, the classification performance is improved, and the classification precision of the positive sample is especially improved; the method has the advantages that statistical analysis is carried out based on the background sample, the distribution of the target sample data to be generated is obtained, the target sample is further generated, the effectiveness of data amplification is improved, and the problems of sample overlapping and model overfitting generated when a new target sample is synthesized based on a small amount of samples in the prior art are solved.

Description

Data amplification method and machine identification system based on high-dimensional space transformation

Technical Field

The invention belongs to the technical field of image processing and machine learning, and particularly relates to a data amplification method and a machine identification system based on high-dimensional space transformation.

Background

Machine learning is a study on the recognition of existing knowledge by a machine, the acquisition of new knowledge and new skills, and has been widely applied to various fields, such as image recognition, data mining, fault diagnosis, and the like. In the machine learning technology, sample data needs to be processed and trained first. In practical application, the sample data sets are often unbalanced, the number of negative samples in the data sets is usually much greater than that of positive samples, and the result of training the data sets is that the classification performance of the classifier is reduced; for example, in the blood vessel plaque identification problem, blood vessel plaques in a blood vessel system sample tend to occupy a small amount, most of the blood vessel plaques belong to healthy blood vessels, training with the blood vessel system sample has low accuracy of an obtained classifier, and a normal blood vessel may be identified as a blood vessel with plaque, so that the condition of a patient is judged by mistake, and the blood vessel with plaque may also be identified as a normal blood vessel, thereby delaying the condition of the patient. Therefore, the method can be used for correctly classifying the unbalanced data, so that the classification accuracy is improved, and the method has very important significance for the research field. At present, there are two main aspects to processing an unbalanced data set, namely, from the perspective of data, a purpose of balancing the data set is achieved by sampling or amplifying a research sample, and secondly, from the perspective of an algorithm, performance of the algorithm is improved to improve performance of a classifier. The method comprises the following steps that (1) a traditional method for processing an unbalanced data set from the data perspective mainly comprises two methods, one method is a sampling algorithm, a negative sample is sampled to be equal to an original positive sample set, the method can cause the loss of information carried by the non-sampled sample, and most of information of a research sample is lost and the number of samples participating in training is seriously insufficient for the sample of which the negative sample is far larger than the positive sample data; the other method is to increase the number of positive samples by a data amplification technology, wherein the technology is to analyze based on a target sample and artificially synthesize a new sample according to the target sample to balance a data set, such as simply copying the positive sample, adding noise to the positive sample, rotating the positive sample, turning over and the like, but the simple data amplification technology easily causes the problems of sample overlapping and model overfitting, and increases the training difficulty of the model; aiming at improvement of a simple data amplification technology, some scholars propose a new amplification algorithm, for example, the SMOTE algorithm balances a data set by artificially synthesizing new samples through linear interpolation between positive samples with similar positions, the method generates new samples for each positive sample, improves the overfitting problem of a model, but easily causes sample overlapping, meanwhile, the algorithm ignores the influence of samples close to a classification boundary and an isolated point on the classification performance of a target sample, and has certain blindness when synthesizing the new samples; the BSMOTE algorithm is based on the SMOTE algorithm, a nearest neighbor algorithm is used for classifying target samples to obtain noise samples, internal samples (samples far away from a classification boundary) and boundary samples of the target samples, and new samples are synthesized by using the target samples of the classification boundary.

In summary, the problems of the prior art are as follows: a new sample is synthesized based on the analysis of a target sample, so that the problems of sample overlapping, boundary neglect, isolated point and the like are easily caused, the classification of a classifier is inaccurate due to the limitation of a training sample, certain limitation exists on the improvement of the classification performance of the target sample, for example, the problem of model overfitting possibly caused by sample overlapping, the problem of classification error of the sample points caused by the neglect of the boundary and the isolated point and the like.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a data amplification method and a machine identification system based on high-dimensional space transformation.

The invention is realized by the data amplification method based on the high-dimensional space transformation, which transforms the background sample data from the original space to the high-dimensional space; obtaining the distribution of a high-dimensional space target sample based on the distribution histogram of the background sample, and generating high-dimensional space target sample data; and performing equation set transformation by using the distance function, and transforming the amplification data from a high-dimensional space to an original space.

Further, the data amplification method based on the high-dimensional space transformation comprises the following steps:

dividing a data sample into a positive sample and a negative sample, wherein the positive sample is a target sample, and the negative sample is a background sample; respectively calculating the squared Euclidean distance between each background sample data and all the background samples to obtain the high-dimensional space transformation of the background samples, so that the background sample data is transformed to the high-dimensional space from the original space;

respectively counting histograms of high-dimensional space background samples in each dimension, and normalizing the distribution of sample data in each dimension; complementing the histogram of the normalized background sample to obtain the histogram distribution of the target sample in each dimension, and normalizing to obtain the probability distribution of the target sample; acquiring the number of sample points required to be generated in each dimension and the value range thereof according to the probability distribution in each dimension; generating preliminary target sample data for each dimension of probability distribution, and randomly disordering the internal sequence of each obtained dimension value to generate target sample data of a high-dimensional space;

step three, the distance between the background sample point and the generated target sample point is a distance function, and a distance function equation set of the background sample point and a certain data point in the amplification data is obtained through the distance function; carrying out difference on two adjacent terms of the distance function equation set, and carrying out term shifting and coefficient combination to obtain a non-homogeneous linear equation set about a certain point in the data to be generated; solving a certain point of the data to be generated and popularizing the point to all the points in the data to be generated, obtaining a matrix equation about the low-dimensional amplification data to be generated, solving the matrix equation, and transforming the amplification data from a high-dimensional space to an original space to obtain the amplified target sample data.

Further, transforming the background sample data from the original space to the high-dimensional space in the first step specifically includes:

(1) dividing original data into research samples and background samples, wherein the number of the background samples is N, and the point of the background sample is x₀₁,x₀₂,…,x_0n,…,x_0NWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector x_0i＝[x_0i1,x_0i2,…,x_0iq,…,x_0iQ]；

(2) For each background sample data point x_0iAnd calculating the squared Euclidean distance between the data points and all background sample data points to obtain: d_i,1,d_i,2,…,d_i,n,…,d_i,NWherein d is_i,n＝||x_0i-x_0n||₂ ²＝(x_0i1-x_0n1)²+(x_0i2-x_0n2)²+…+(x_0iq-x_0nq)²+…+(x_0iQ-x_0nQ)²(i is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N), wherein | | | x_0i-x_0n||₂Represents (x)_0i-x_0n) The norm of L2 finally obtains the N-dimensional space sample data of the background sample:

further, the generating target sample data of the high-dimensional space in the second step specifically includes:

(1) respectively counting histograms of N data in high-dimensional space transformation of the background sample according to dimensions, and equally dividing each dimension of data of the histograms into h intervals;

(2) counting the sample count of each interval, denoted as y_t，y_tA line vector is used for representing the sample count of each interval of t-th dimension data in the high-dimensional space transformation of the background sample, and the sample count y of the interval of the dimension data_tNormalizing except for the maximum value of the number of samples in all intervals

(3) Normalized Interval sample count y_t' conducting complementation and standardization to obtain the probability distribution of the target sample

(4) Calculating the number k of target sample data points to be generated in each interval in the dimensional data_t＝M×p_t， k_tIs a line vector representing each interval of the t-th dimensionGenerating a count of data, M representing the number of data points to be generated, randomly generating k in each interval according to a uniform distribution_tA data point and recording the generated target sample data as l_1,t,l_2,t,…,l_m,t,…,l_M,t；

(5) Performing the above process on each dimension of sample data in the high-dimensional space transformation of the background sample to generate each dimension of sample data of the high-dimensional space of M data points to be amplified, and performing internal random scrambling on the sample data according to dimensions to obtain the high-dimensional space sample data of the amplified data:

further, the step three of transforming the amplification data from the high-dimensional space to the original space specifically includes:

(1) m sample points of amplification are marked as x₁,x₂,…,x_m,…,x_MWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector x_i＝[x_i1,x_i2,…,x_iq,…,x_iQ]From a distance function l_m,n＝||x_m-x_0n||₂ ² (1≤m≤M,1≤n≤N)，x_mTo generate the mth sample point, x, of the target sample_0nFor the nth sample point of the background sample, a distance function equation set of the background sample point and the amplification data can be obtained:

(2) expanding a quadratic term of the distance function equation set, and making a difference between an nth term and an N +1 th term, wherein N is more than or equal to 1 and less than or equal to N; obtaining a linear equation for generating the mth data in the amplification data:

after the linear equation is subjected to term shifting and coefficient combination, the following results are obtained:

writing the system of equations as a matrix equation:

calculating a certain point x of the amplification data by solving a matrix equation_m；

(3) Will calculate a certain point x in the amplification data_mThe process of (2) is generalized to all M points, resulting in a matrix equation for the data points to be generated:

AX＝B+C；

wherein

Solving the above equation system to obtain the unknown quantity X ═ A^-1(B + C) wherein A^-1And expressing the pseudo-inverse matrix of the matrix A, wherein the obtained data result is an amplification data point, and the amplification data is transformed from a high-dimensional space to an original space.

Another object of the present invention is to provide a machine recognition system using the data amplification method based on high-dimensional spatial transformation.

Another object of the present invention is to provide an image recognition system using the data augmentation method based on high-dimensional spatial transformation.

The invention has the advantages and positive effects that: in the machine learning model, the classifier trained on the basis of the original sample has lower classification performance due to insufficient number of positive samples, and the corresponding positive sample data set is amplified by learning the distribution histogram of the negative samples, so that the problem of mismatching of the positive and negative sample data in the machine learning model is solved, the classification performance is improved, and the classification precision of the positive samples is greatly improved particularly; the method and the device perform statistical analysis based on the background sample (negative sample) to obtain the data distribution of the target sample (positive sample) to be generated, further generate the target sample, and solve the problem that the boundary and the isolated point are ignored when the target sample is generated in the traditional method, thereby improving the validity of the amplified data, and avoiding the problems of sample overlapping, model overfitting and the like when a new target sample is synthesized based on a small amount of traditional samples.

Drawings

FIG. 1 is a flowchart of a method for amplifying data based on high-dimensional spatial features according to an embodiment of the present invention.

Fig. 2 is a region selection diagram of sample spatial feature extraction according to an embodiment of the present invention.

FIG. 3 is a flowchart illustrating an embodiment of generating amplification data in a data amplification method based on high-dimensional spatial transformation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

According to the invention, through learning the distribution histogram of the negative sample, the corresponding positive sample data set is amplified, and the problem of mismatching of the positive sample data and the negative sample data in the machine learning model is solved; and performing statistical analysis based on the background sample (negative sample) to obtain the data distribution of the target sample (positive sample) to be generated, so as to generate the target sample and improve the validity of the amplified data.

The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.

As shown in fig. 1, the data amplification method based on high-dimensional spatial transformation according to the embodiment of the present invention includes the following steps:

s101: preprocessing a sample, and transforming background sample data from an original space to a high-dimensional space;

s102: carrying out histogram statistics and analysis on high-dimensional space background sample data, obtaining high-dimensional space target sample distribution, and generating high-dimensional space target sample data;

s103: and performing equation set transformation by using the distance function, and transforming the amplification data to the original space.

The application of the principles of the present invention will now be described in further detail with reference to the accompanying drawings.

As shown in fig. 3, the data amplification method based on high-dimensional spatial transformation provided in the embodiment of the present invention specifically includes the following steps:

(1) preprocessing a sample, and transforming background sample data from an original space to a high-dimensional space;

(1a) the data used in this example are cross-sectional images of blood vessels along a direction perpendicular to the centerline in the human vascular system;

(1b) selecting a normal blood vessel section image as a background sample, using a blood vessel plaque section image as a target sample, and obtaining the number of the background samples and recording the number as N, wherein the point of the background sample is x₀₁,x₀₂,…,x_0n,…,x_0N；

(1c) As shown in fig. 2, a current background sample center point is taken as a circle center, samples are respectively taken on circles according to 1, 3 and 5 voxels of the sample center point, and sampling is performed at sampling angles of 90 degrees, 45 degrees and 30 degrees in sequence from an innermost circle to obtain 24 sampling areas;

(1d) extracting the features of the background sample, wherein the average gray value of each region is the gray average value of all voxels in the region, and obtaining 24 feature vectors [ x ]_0i1,x_0i2,…,x_0i24]Wherein i represents the ith background sample; calculating the average curvature of each region and recording the average curvature as the curvature characteristic of the region to obtain 24 characteristic vectors [ x ]_0i25,x_0i26,…,x_0i48](ii) a Texture features were obtained from 90 filtered texture maps by two-dimensional Gabor filtering,obtain a feature vector [ x_0i49,x_0i50,…,x_0i72](ii) a Calculating Hessian matrix of each point to obtain three eigenvalues representing the direction of the point, and obtaining eigenvector [ x ]_0i73,x_0i74,…,x_0i144]；

(1e) Performing an upsampling mode on each background sample, calculating a feature vector of each background sample, and obtaining data with the dimension Q being 144 and each background sample point being composed of four types of features, wherein the ith sample data is a row of vectors x_0i＝[x_0i1,x_0i2,…,x_0iq,…,x_0iQ]；

(1f) For each background sample data point x_0iAnd calculating the squared Euclidean distance between the data points and all background sample data points to obtain: d_i,1,d_i,2,…,d_i,n,…,d_i,NWherein d is_i,n＝||x_0i-x_0n||₂ ²＝(x_0i1-x_0n1)²+(x_0i2-x_0n2)²+…+(x_0iq-x_0nq)²+…+(x_0iQ-x_0nQ)²(i is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N), wherein | | | x_0i-x_0n||₂Represents (x)_0i-x_0n) The norm of L2 finally obtains the N-dimensional space sample data of the background sample:

(2) analyzing the background sample data of the high-dimensional space, and generating the target sample data of the high-dimensional space in the following specific process:

(2a) respectively counting histograms of N data in high-dimensional space transformation of the background sample according to dimensions, and equally dividing each dimension of data of the histograms into h intervals;

(2b) counting the sample count of each interval, denoted as y_t，y_tA line vector is used for representing the sample count of each interval of t-th dimension data in the high-dimensional space transformation of the background sample, and the sample count y of the interval of the dimension data_tExcept for samples in all intervalsNormalizing the maximum value of the number

(2c) Normalized Interval sample count y_t' conducting complementation and standardization to obtain the probability distribution of the target sample

(2d) Calculating the number k of target sample data points to be generated in each interval in the dimensional data_t＝M×p_t， k_tA line vector is used for representing the count of data generated in each interval of the t-th dimension, M represents the number of data points to be generated, and k is randomly generated in each interval according to uniform distribution_tA data point and recording the generated target sample data as l_1,t,l_2,t,…,l_m,t,…,l_M,t；

(2e) Performing the above process on each dimension of sample data in the high-dimensional space transformation of the background sample to generate each dimension of sample data of the high-dimensional space of M data points to be amplified, and performing internal random scrambling on the sample data according to dimensions to obtain the high-dimensional space sample data of the amplified data:

(3) transforming the amplified data from high dimensional space to original space as follows:

(3a) m sample points of amplification are marked as x₁,x₂,…,x_m,…,x_MWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector x_i＝[x_i1,x_i2,…,x_iq,…,x_iQ]From a distance function l_m,n＝||x_m-x_0n||₂ ² (1≤m≤M,1≤n≤N)，x_mTo generate the mth sample point, x, of the target sample_0nFor the nth sample point of the background sample, the background sample point and the amplification number can be obtainedAccording to the distance function equation:

(3b) the linear equation for generating the mth data in the amplification data can be obtained by expanding the quadratic terms of the distance function equation set and making a difference between the nth term and the (N + 1) th term (N is more than or equal to 1 and less than or equal to N):

the system of equations can be written as a matrix equation:

by solving the matrix equation, a certain point x of the amplification data can be calculated_m；

(3c) Will calculate a certain point x in the amplification data_mThe process of (2) is generalized to all M points, resulting in a matrix equation for the data points to be generated:

AX＝B+C；

wherein

C＝[c,c,...,c]，

Solving the above equation system to obtain the unknown quantity X ═ A^-1(B + C) wherein A^-1And expressing the pseudo-inverse matrix of the matrix A, wherein the obtained data result is an amplification data point, and the conversion of the amplification data from a high-dimensional space to an original space is completed.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A data amplification method based on high-dimensional space transformation is characterized in that the data amplification method based on high-dimensional space transformation transforms background sample data from an original space to a high-dimensional space; obtaining the distribution of a high-dimensional space target sample based on the distribution histogram of the background sample, and generating high-dimensional space target sample data; performing equation set transformation by using a distance function, and transforming the amplification data from a high-dimensional space to an original space;

selecting a normal blood vessel section image as a background sample, using a blood vessel plaque section image as a target sample, and obtaining the number of the background samples and recording the number as N, wherein the point of the background sample is x₀₁,x₀₂,…,x_0n,…,x_0N；

The data amplification method based on the high-dimensional spatial transformation comprises the following steps:

respectively counting histograms of high-dimensional space background samples in each dimension, and normalizing the distribution of sample data in each dimension; complementing the histogram of the normalized background sample to obtain the histogram distribution of the target sample in each dimension, and normalizing the histogram distribution to obtain the probability distribution of the target sample; acquiring the number of sample points required to be generated in each dimension and the value range thereof according to the probability distribution in each dimension; generating preliminary target sample data for each dimension of probability distribution according to the method, and randomly disordering the internal sequence of each obtained dimension value to generate target sample data of a high-dimensional space;

step three, the distance between the background sample point and the generated target sample point is a distance function, and a distance function equation set of the background sample point and a certain data point in the amplification data is obtained through the distance function; performing difference on two adjacent terms of the distance function equation set, and performing term shift and coefficient combination to obtain a non-homogeneous linear equation set about a certain point in the data to be generated; and popularizing from a certain point of solving the data to be generated to all points in the data to be generated, obtaining a matrix equation about the low-dimensional amplification data to be generated, solving the matrix equation, and transforming the amplification data from a high-dimensional space to an original space to obtain the amplified target sample data.

2. The method according to claim 1, wherein transforming the background sample data from the original space to the high-dimensional space in the first step specifically comprises:

(2) For each background sample data point x_0iAnd calculating the squared Euclidean distance between the data points and all background sample data points to obtain: d_i,1,d_i,2,…,d_i,n,…,d_i,NWherein d is_i,n＝||x_0i-x_0n||₂ ²＝(x_0i1-x_0n1)²+(x_0i2-x_0n2)²+…+(x_0iq-x_0nq)²+…+(x_0iQ-x_0nQ)²(i is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N), wherein||x_0i-x_0n||₂Represents (x)_0i-x_0n) The norm of L2 finally obtains the N-dimensional space sample data of the background sample:

3. the method according to claim 1, wherein the generating target sample data of the high-dimensional space in the second step specifically includes:

(4) Calculating the number k of target sample data points to be generated in each interval in the dimensional data_t＝M×p_t，k_tA line vector is used for representing the count of data generated in each interval of the t-th dimension, M represents the number of data points to be generated, and k is randomly generated in each interval according to uniform distribution_tA data point and recording the generated target sample data as l_1,t,l_2,t,…,l_m,t,…,l_M,t；

4. the method for data amplification based on high-dimensional space transformation according to claim 1, wherein the step three of transforming the amplified data from the high-dimensional space to the original space specifically comprises:

(1) m sample points of amplification are marked as x₁,x₂,…,x_m,…,x_MWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector x_i＝[x_i1,x_i2,…,x_iq,…,x_iQ]From a distance function l_m,n＝||x_m-x_0n||₂ ²(1≤m≤M,1≤n≤N)，x_mTo generate the mth sample point, x, of the target sample_0nFor the nth sample point of the background sample, a distance function equation set of the background sample point and the amplification data can be obtained:

(2) the linear equation for generating the mth data in the amplification data can be obtained by expanding the quadratic terms of the distance function equation set and making a difference between the nth term and the (N + 1) th term (N is more than or equal to 1 and less than or equal to N):

the system of equations can be written as a matrix equation:

AX＝B+C；

wherein

C＝[c,c,...,c]，

5. A machine recognition system using the high-dimensional spatial transform-based data amplification method according to any one of claims 1 to 4.

6. An image recognition system using the high-dimensional spatial transform-based data augmentation method of any one of claims 1 to 4.