CN107729926B - Data amplification method and machine identification system based on high-dimensional space transformation - Google Patents

Data amplification method and machine identification system based on high-dimensional space transformation Download PDF

Info

Publication number
CN107729926B
CN107729926B CN201710899032.3A CN201710899032A CN107729926B CN 107729926 B CN107729926 B CN 107729926B CN 201710899032 A CN201710899032 A CN 201710899032A CN 107729926 B CN107729926 B CN 107729926B
Authority
CN
China
Prior art keywords
data
sample
dimensional space
background
amplification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710899032.3A
Other languages
Chinese (zh)
Other versions
CN107729926A (en
Inventor
赵凤军
吴斌
贺小伟
侯榆青
易黄建
曹欣
王宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern University
Original Assignee
Northwestern University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern University filed Critical Northwestern University
Priority to CN201710899032.3A priority Critical patent/CN107729926B/en
Publication of CN107729926A publication Critical patent/CN107729926A/en
Application granted granted Critical
Publication of CN107729926B publication Critical patent/CN107729926B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/758Involving statistics of pixels or of feature values, e.g. histogram matching

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of image processing and machine learning, and discloses a data amplification method and a machine identification system based on high-dimensional space transformation, wherein background sample data is transformed to a high-dimensional space from an original space; obtaining the distribution of a high-dimensional space target sample based on the distribution histogram of the background sample, and generating high-dimensional space target sample data; and performing equation set transformation by using the distance function, and transforming the amplification data from a high-dimensional space to an original space. According to the invention, through learning the distribution histogram of the negative sample, the corresponding positive sample data set is amplified, the problem of mismatching of the positive sample data and the negative sample data in the machine learning model is solved, the classification performance is improved, and the classification precision of the positive sample is especially improved; the method has the advantages that statistical analysis is carried out based on the background sample, the distribution of the target sample data to be generated is obtained, the target sample is further generated, the effectiveness of data amplification is improved, and the problems of sample overlapping and model overfitting generated when a new target sample is synthesized based on a small amount of samples in the prior art are solved.

Description

Data amplification method and machine identification system based on high-dimensional space transformation
Technical Field
The invention belongs to the technical field of image processing and machine learning, and particularly relates to a data amplification method and a machine identification system based on high-dimensional space transformation.
Background
Machine learning is a study on the recognition of existing knowledge by a machine, the acquisition of new knowledge and new skills, and has been widely applied to various fields, such as image recognition, data mining, fault diagnosis, and the like. In the machine learning technology, sample data needs to be processed and trained first. In practical application, the sample data sets are often unbalanced, the number of negative samples in the data sets is usually much greater than that of positive samples, and the result of training the data sets is that the classification performance of the classifier is reduced; for example, in the blood vessel plaque identification problem, blood vessel plaques in a blood vessel system sample tend to occupy a small amount, most of the blood vessel plaques belong to healthy blood vessels, training with the blood vessel system sample has low accuracy of an obtained classifier, and a normal blood vessel may be identified as a blood vessel with plaque, so that the condition of a patient is judged by mistake, and the blood vessel with plaque may also be identified as a normal blood vessel, thereby delaying the condition of the patient. Therefore, the method can be used for correctly classifying the unbalanced data, so that the classification accuracy is improved, and the method has very important significance for the research field. At present, there are two main aspects to processing an unbalanced data set, namely, from the perspective of data, a purpose of balancing the data set is achieved by sampling or amplifying a research sample, and secondly, from the perspective of an algorithm, performance of the algorithm is improved to improve performance of a classifier. The method comprises the following steps that (1) a traditional method for processing an unbalanced data set from the data perspective mainly comprises two methods, one method is a sampling algorithm, a negative sample is sampled to be equal to an original positive sample set, the method can cause the loss of information carried by the non-sampled sample, and most of information of a research sample is lost and the number of samples participating in training is seriously insufficient for the sample of which the negative sample is far larger than the positive sample data; the other method is to increase the number of positive samples by a data amplification technology, wherein the technology is to analyze based on a target sample and artificially synthesize a new sample according to the target sample to balance a data set, such as simply copying the positive sample, adding noise to the positive sample, rotating the positive sample, turning over and the like, but the simple data amplification technology easily causes the problems of sample overlapping and model overfitting, and increases the training difficulty of the model; aiming at improvement of a simple data amplification technology, some scholars propose a new amplification algorithm, for example, the SMOTE algorithm balances a data set by artificially synthesizing new samples through linear interpolation between positive samples with similar positions, the method generates new samples for each positive sample, improves the overfitting problem of a model, but easily causes sample overlapping, meanwhile, the algorithm ignores the influence of samples close to a classification boundary and an isolated point on the classification performance of a target sample, and has certain blindness when synthesizing the new samples; the BSMOTE algorithm is based on the SMOTE algorithm, a nearest neighbor algorithm is used for classifying target samples to obtain noise samples, internal samples (samples far away from a classification boundary) and boundary samples of the target samples, and new samples are synthesized by using the target samples of the classification boundary.
In summary, the problems of the prior art are as follows: a new sample is synthesized based on the analysis of a target sample, so that the problems of sample overlapping, boundary neglect, isolated point and the like are easily caused, the classification of a classifier is inaccurate due to the limitation of a training sample, certain limitation exists on the improvement of the classification performance of the target sample, for example, the problem of model overfitting possibly caused by sample overlapping, the problem of classification error of the sample points caused by the neglect of the boundary and the isolated point and the like.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a data amplification method and a machine identification system based on high-dimensional space transformation.
The invention is realized by the data amplification method based on the high-dimensional space transformation, which transforms the background sample data from the original space to the high-dimensional space; obtaining the distribution of a high-dimensional space target sample based on the distribution histogram of the background sample, and generating high-dimensional space target sample data; and performing equation set transformation by using the distance function, and transforming the amplification data from a high-dimensional space to an original space.
Further, the data amplification method based on the high-dimensional space transformation comprises the following steps:
dividing a data sample into a positive sample and a negative sample, wherein the positive sample is a target sample, and the negative sample is a background sample; respectively calculating the squared Euclidean distance between each background sample data and all the background samples to obtain the high-dimensional space transformation of the background samples, so that the background sample data is transformed to the high-dimensional space from the original space;
respectively counting histograms of high-dimensional space background samples in each dimension, and normalizing the distribution of sample data in each dimension; complementing the histogram of the normalized background sample to obtain the histogram distribution of the target sample in each dimension, and normalizing to obtain the probability distribution of the target sample; acquiring the number of sample points required to be generated in each dimension and the value range thereof according to the probability distribution in each dimension; generating preliminary target sample data for each dimension of probability distribution, and randomly disordering the internal sequence of each obtained dimension value to generate target sample data of a high-dimensional space;
step three, the distance between the background sample point and the generated target sample point is a distance function, and a distance function equation set of the background sample point and a certain data point in the amplification data is obtained through the distance function; carrying out difference on two adjacent terms of the distance function equation set, and carrying out term shifting and coefficient combination to obtain a non-homogeneous linear equation set about a certain point in the data to be generated; solving a certain point of the data to be generated and popularizing the point to all the points in the data to be generated, obtaining a matrix equation about the low-dimensional amplification data to be generated, solving the matrix equation, and transforming the amplification data from a high-dimensional space to an original space to obtain the amplified target sample data.
Further, transforming the background sample data from the original space to the high-dimensional space in the first step specifically includes:
(1) dividing original data into research samples and background samples, wherein the number of the background samples is N, and the point of the background sample is x01,x02,…,x0n,…,x0NWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector x0i=[x0i1,x0i2,…,x0iq,…,x0iQ];
(2) For each background sample data point x0iAnd calculating the squared Euclidean distance between the data points and all background sample data points to obtain: di,1,di,2,…,di,n,…,di,NWherein d isi,n=||x0i-x0n||2 2=(x0i1-x0n1)2+(x0i2-x0n2)2+…+(x0iq-x0nq)2+…+(x0iQ-x0nQ)2(i is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N), wherein | | | x0i-x0n||2Represents (x)0i-x0n) The norm of L2 finally obtains the N-dimensional space sample data of the background sample:
Figure RE-GDA0001494400790000031
further, the generating target sample data of the high-dimensional space in the second step specifically includes:
(1) respectively counting histograms of N data in high-dimensional space transformation of the background sample according to dimensions, and equally dividing each dimension of data of the histograms into h intervals;
(2) counting the sample count of each interval, denoted as yt,ytA line vector is used for representing the sample count of each interval of t-th dimension data in the high-dimensional space transformation of the background sample, and the sample count y of the interval of the dimension datatNormalizing except for the maximum value of the number of samples in all intervals
Figure RE-GDA0001494400790000041
(3) Normalized Interval sample count yt' conducting complementation and standardization to obtain the probability distribution of the target sample
Figure RE-GDA0001494400790000042
(4) Calculating the number k of target sample data points to be generated in each interval in the dimensional datat=M×pt, ktIs a line vector representing each interval of the t-th dimensionGenerating a count of data, M representing the number of data points to be generated, randomly generating k in each interval according to a uniform distributiontA data point and recording the generated target sample data as l1,t,l2,t,…,lm,t,…,lM,t
(5) Performing the above process on each dimension of sample data in the high-dimensional space transformation of the background sample to generate each dimension of sample data of the high-dimensional space of M data points to be amplified, and performing internal random scrambling on the sample data according to dimensions to obtain the high-dimensional space sample data of the amplified data:
Figure RE-GDA0001494400790000043
further, the step three of transforming the amplification data from the high-dimensional space to the original space specifically includes:
(1) m sample points of amplification are marked as x1,x2,…,xm,…,xMWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector xi=[xi1,xi2,…,xiq,…,xiQ]From a distance function lm,n=||xm-x0n||2 2 (1≤m≤M,1≤n≤N),xmTo generate the mth sample point, x, of the target sample0nFor the nth sample point of the background sample, a distance function equation set of the background sample point and the amplification data can be obtained:
Figure RE-GDA0001494400790000051
(2) expanding a quadratic term of the distance function equation set, and making a difference between an nth term and an N +1 th term, wherein N is more than or equal to 1 and less than or equal to N; obtaining a linear equation for generating the mth data in the amplification data:
Figure RE-GDA0001494400790000052
after the linear equation is subjected to term shifting and coefficient combination, the following results are obtained:
Figure RE-GDA0001494400790000053
writing the system of equations as a matrix equation:
Figure RE-GDA0001494400790000054
calculating a certain point x of the amplification data by solving a matrix equationm
(3) Will calculate a certain point x in the amplification datamThe process of (2) is generalized to all M points, resulting in a matrix equation for the data points to be generated:
AX=B+C;
wherein
Figure RE-GDA0001494400790000055
Figure RE-GDA0001494400790000056
Figure RE-GDA0001494400790000061
Solving the above equation system to obtain the unknown quantity X ═ A-1(B + C) wherein A-1And expressing the pseudo-inverse matrix of the matrix A, wherein the obtained data result is an amplification data point, and the amplification data is transformed from a high-dimensional space to an original space.
Another object of the present invention is to provide a machine recognition system using the data amplification method based on high-dimensional spatial transformation.
Another object of the present invention is to provide an image recognition system using the data augmentation method based on high-dimensional spatial transformation.
The invention has the advantages and positive effects that: in the machine learning model, the classifier trained on the basis of the original sample has lower classification performance due to insufficient number of positive samples, and the corresponding positive sample data set is amplified by learning the distribution histogram of the negative samples, so that the problem of mismatching of the positive and negative sample data in the machine learning model is solved, the classification performance is improved, and the classification precision of the positive samples is greatly improved particularly; the method and the device perform statistical analysis based on the background sample (negative sample) to obtain the data distribution of the target sample (positive sample) to be generated, further generate the target sample, and solve the problem that the boundary and the isolated point are ignored when the target sample is generated in the traditional method, thereby improving the validity of the amplified data, and avoiding the problems of sample overlapping, model overfitting and the like when a new target sample is synthesized based on a small amount of traditional samples.
Drawings
FIG. 1 is a flowchart of a method for amplifying data based on high-dimensional spatial features according to an embodiment of the present invention.
Fig. 2 is a region selection diagram of sample spatial feature extraction according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating an embodiment of generating amplification data in a data amplification method based on high-dimensional spatial transformation according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
According to the invention, through learning the distribution histogram of the negative sample, the corresponding positive sample data set is amplified, and the problem of mismatching of the positive sample data and the negative sample data in the machine learning model is solved; and performing statistical analysis based on the background sample (negative sample) to obtain the data distribution of the target sample (positive sample) to be generated, so as to generate the target sample and improve the validity of the amplified data.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.
As shown in fig. 1, the data amplification method based on high-dimensional spatial transformation according to the embodiment of the present invention includes the following steps:
s101: preprocessing a sample, and transforming background sample data from an original space to a high-dimensional space;
s102: carrying out histogram statistics and analysis on high-dimensional space background sample data, obtaining high-dimensional space target sample distribution, and generating high-dimensional space target sample data;
s103: and performing equation set transformation by using the distance function, and transforming the amplification data to the original space.
The application of the principles of the present invention will now be described in further detail with reference to the accompanying drawings.
As shown in fig. 3, the data amplification method based on high-dimensional spatial transformation provided in the embodiment of the present invention specifically includes the following steps:
(1) preprocessing a sample, and transforming background sample data from an original space to a high-dimensional space;
(1a) the data used in this example are cross-sectional images of blood vessels along a direction perpendicular to the centerline in the human vascular system;
(1b) selecting a normal blood vessel section image as a background sample, using a blood vessel plaque section image as a target sample, and obtaining the number of the background samples and recording the number as N, wherein the point of the background sample is x01,x02,…,x0n,…,x0N
(1c) As shown in fig. 2, a current background sample center point is taken as a circle center, samples are respectively taken on circles according to 1, 3 and 5 voxels of the sample center point, and sampling is performed at sampling angles of 90 degrees, 45 degrees and 30 degrees in sequence from an innermost circle to obtain 24 sampling areas;
(1d) extracting the features of the background sample, wherein the average gray value of each region is the gray average value of all voxels in the region, and obtaining 24 feature vectors [ x ]0i1,x0i2,…,x0i24]Wherein i represents the ith background sample; calculating the average curvature of each region and recording the average curvature as the curvature characteristic of the region to obtain 24 characteristic vectors [ x ]0i25,x0i26,…,x0i48](ii) a Texture features were obtained from 90 filtered texture maps by two-dimensional Gabor filtering,obtain a feature vector [ x0i49,x0i50,…,x0i72](ii) a Calculating Hessian matrix of each point to obtain three eigenvalues representing the direction of the point, and obtaining eigenvector [ x ]0i73,x0i74,…,x0i144];
(1e) Performing an upsampling mode on each background sample, calculating a feature vector of each background sample, and obtaining data with the dimension Q being 144 and each background sample point being composed of four types of features, wherein the ith sample data is a row of vectors x0i=[x0i1,x0i2,…,x0iq,…,x0iQ];
(1f) For each background sample data point x0iAnd calculating the squared Euclidean distance between the data points and all background sample data points to obtain: di,1,di,2,…,di,n,…,di,NWherein d isi,n=||x0i-x0n||2 2=(x0i1-x0n1)2+(x0i2-x0n2)2+…+(x0iq-x0nq)2+…+(x0iQ-x0nQ)2(i is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N), wherein | | | x0i-x0n||2Represents (x)0i-x0n) The norm of L2 finally obtains the N-dimensional space sample data of the background sample:
Figure RE-GDA0001494400790000081
(2) analyzing the background sample data of the high-dimensional space, and generating the target sample data of the high-dimensional space in the following specific process:
(2a) respectively counting histograms of N data in high-dimensional space transformation of the background sample according to dimensions, and equally dividing each dimension of data of the histograms into h intervals;
(2b) counting the sample count of each interval, denoted as yt,ytA line vector is used for representing the sample count of each interval of t-th dimension data in the high-dimensional space transformation of the background sample, and the sample count y of the interval of the dimension datatExcept for samples in all intervalsNormalizing the maximum value of the number
Figure RE-GDA0001494400790000082
(2c) Normalized Interval sample count yt' conducting complementation and standardization to obtain the probability distribution of the target sample
Figure RE-GDA0001494400790000091
(2d) Calculating the number k of target sample data points to be generated in each interval in the dimensional datat=M×pt, ktA line vector is used for representing the count of data generated in each interval of the t-th dimension, M represents the number of data points to be generated, and k is randomly generated in each interval according to uniform distributiontA data point and recording the generated target sample data as l1,t,l2,t,…,lm,t,…,lM,t
(2e) Performing the above process on each dimension of sample data in the high-dimensional space transformation of the background sample to generate each dimension of sample data of the high-dimensional space of M data points to be amplified, and performing internal random scrambling on the sample data according to dimensions to obtain the high-dimensional space sample data of the amplified data:
Figure RE-GDA0001494400790000092
(3) transforming the amplified data from high dimensional space to original space as follows:
(3a) m sample points of amplification are marked as x1,x2,…,xm,…,xMWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector xi=[xi1,xi2,…,xiq,…,xiQ]From a distance function lm,n=||xm-x0n||2 2 (1≤m≤M,1≤n≤N),xmTo generate the mth sample point, x, of the target sample0nFor the nth sample point of the background sample, the background sample point and the amplification number can be obtainedAccording to the distance function equation:
Figure RE-GDA0001494400790000093
(3b) the linear equation for generating the mth data in the amplification data can be obtained by expanding the quadratic terms of the distance function equation set and making a difference between the nth term and the (N + 1) th term (N is more than or equal to 1 and less than or equal to N):
Figure RE-GDA0001494400790000094
after the linear equation is subjected to term shifting and coefficient combination, the following results are obtained:
Figure RE-GDA0001494400790000101
the system of equations can be written as a matrix equation:
Figure RE-GDA0001494400790000102
by solving the matrix equation, a certain point x of the amplification data can be calculatedm
(3c) Will calculate a certain point x in the amplification datamThe process of (2) is generalized to all M points, resulting in a matrix equation for the data points to be generated:
AX=B+C;
wherein
Figure RE-GDA0001494400790000103
Figure RE-GDA0001494400790000104
C=[c,c,...,c],
Figure RE-GDA0001494400790000105
Solving the above equation system to obtain the unknown quantity X ═ A-1(B + C) wherein A-1And expressing the pseudo-inverse matrix of the matrix A, wherein the obtained data result is an amplification data point, and the conversion of the amplification data from a high-dimensional space to an original space is completed.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (6)

1. A data amplification method based on high-dimensional space transformation is characterized in that the data amplification method based on high-dimensional space transformation transforms background sample data from an original space to a high-dimensional space; obtaining the distribution of a high-dimensional space target sample based on the distribution histogram of the background sample, and generating high-dimensional space target sample data; performing equation set transformation by using a distance function, and transforming the amplification data from a high-dimensional space to an original space;
selecting a normal blood vessel section image as a background sample, using a blood vessel plaque section image as a target sample, and obtaining the number of the background samples and recording the number as N, wherein the point of the background sample is x01,x02,…,x0n,…,x0N
The data amplification method based on the high-dimensional spatial transformation comprises the following steps:
dividing a data sample into a positive sample and a negative sample, wherein the positive sample is a target sample, and the negative sample is a background sample; respectively calculating the squared Euclidean distance between each background sample data and all the background samples to obtain the high-dimensional space transformation of the background samples, so that the background sample data is transformed to the high-dimensional space from the original space;
respectively counting histograms of high-dimensional space background samples in each dimension, and normalizing the distribution of sample data in each dimension; complementing the histogram of the normalized background sample to obtain the histogram distribution of the target sample in each dimension, and normalizing the histogram distribution to obtain the probability distribution of the target sample; acquiring the number of sample points required to be generated in each dimension and the value range thereof according to the probability distribution in each dimension; generating preliminary target sample data for each dimension of probability distribution according to the method, and randomly disordering the internal sequence of each obtained dimension value to generate target sample data of a high-dimensional space;
step three, the distance between the background sample point and the generated target sample point is a distance function, and a distance function equation set of the background sample point and a certain data point in the amplification data is obtained through the distance function; performing difference on two adjacent terms of the distance function equation set, and performing term shift and coefficient combination to obtain a non-homogeneous linear equation set about a certain point in the data to be generated; and popularizing from a certain point of solving the data to be generated to all points in the data to be generated, obtaining a matrix equation about the low-dimensional amplification data to be generated, solving the matrix equation, and transforming the amplification data from a high-dimensional space to an original space to obtain the amplified target sample data.
2. The method according to claim 1, wherein transforming the background sample data from the original space to the high-dimensional space in the first step specifically comprises:
(1) dividing original data into research samples and background samples, wherein the number of the background samples is N, and the point of the background sample is x01,x02,…,x0n,…,x0NWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector x0i=[x0i1,x0i2,…,x0iq,…,x0iQ];
(2) For each background sample data point x0iAnd calculating the squared Euclidean distance between the data points and all background sample data points to obtain: di,1,di,2,…,di,n,…,di,NWherein d isi,n=||x0i-x0n||2 2=(x0i1-x0n1)2+(x0i2-x0n2)2+…+(x0iq-x0nq)2+…+(x0iQ-x0nQ)2(i is more than or equal to 1 and less than or equal to N, N is more than or equal to 1 and less than or equal to N), wherein||x0i-x0n||2Represents (x)0i-x0n) The norm of L2 finally obtains the N-dimensional space sample data of the background sample:
Figure FDA0003079525860000021
3. the method according to claim 1, wherein the generating target sample data of the high-dimensional space in the second step specifically includes:
(1) respectively counting histograms of N data in high-dimensional space transformation of the background sample according to dimensions, and equally dividing each dimension of data of the histograms into h intervals;
(2) counting the sample count of each interval, denoted as yt,ytA line vector is used for representing the sample count of each interval of t-th dimension data in the high-dimensional space transformation of the background sample, and the sample count y of the interval of the dimension datatNormalizing except for the maximum value of the number of samples in all intervals
Figure FDA0003079525860000022
(3) Normalized Interval sample count yt' conducting complementation and standardization to obtain the probability distribution of the target sample
Figure FDA0003079525860000023
(4) Calculating the number k of target sample data points to be generated in each interval in the dimensional datat=M×pt,ktA line vector is used for representing the count of data generated in each interval of the t-th dimension, M represents the number of data points to be generated, and k is randomly generated in each interval according to uniform distributiontA data point and recording the generated target sample data as l1,t,l2,t,…,lm,t,…,lM,t
(5) Performing the above process on each dimension of sample data in the high-dimensional space transformation of the background sample to generate each dimension of sample data of the high-dimensional space of M data points to be amplified, and performing internal random scrambling on the sample data according to dimensions to obtain the high-dimensional space sample data of the amplified data:
Figure FDA0003079525860000031
4. the method for data amplification based on high-dimensional space transformation according to claim 1, wherein the step three of transforming the amplified data from the high-dimensional space to the original space specifically comprises:
(1) m sample points of amplification are marked as x1,x2,…,xm,…,xMWherein each sample point comprises Q-dimensional data, and the ith sample data is a row vector xi=[xi1,xi2,…,xiq,…,xiQ]From a distance function lm,n=||xm-x0n||2 2(1≤m≤M,1≤n≤N),xmTo generate the mth sample point, x, of the target sample0nFor the nth sample point of the background sample, a distance function equation set of the background sample point and the amplification data can be obtained:
Figure FDA0003079525860000032
(2) the linear equation for generating the mth data in the amplification data can be obtained by expanding the quadratic terms of the distance function equation set and making a difference between the nth term and the (N + 1) th term (N is more than or equal to 1 and less than or equal to N):
Figure FDA0003079525860000033
after the linear equation is subjected to term shifting and coefficient combination, the following results are obtained:
Figure FDA0003079525860000034
the system of equations can be written as a matrix equation:
Figure FDA0003079525860000035
by solving the matrix equation, a certain point x of the amplification data can be calculatedm
(3) Will calculate a certain point x in the amplification datamThe process of (2) is generalized to all M points, resulting in a matrix equation for the data points to be generated:
AX=B+C;
wherein
Figure FDA0003079525860000041
Figure FDA0003079525860000042
C=[c,c,...,c],
Figure FDA0003079525860000043
Solving the above equation system to obtain the unknown quantity X ═ A-1(B + C) wherein A-1And expressing the pseudo-inverse matrix of the matrix A, wherein the obtained data result is an amplification data point, and the conversion of the amplification data from a high-dimensional space to an original space is completed.
5. A machine recognition system using the high-dimensional spatial transform-based data amplification method according to any one of claims 1 to 4.
6. An image recognition system using the high-dimensional spatial transform-based data augmentation method of any one of claims 1 to 4.
CN201710899032.3A 2017-09-28 2017-09-28 Data amplification method and machine identification system based on high-dimensional space transformation Active CN107729926B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710899032.3A CN107729926B (en) 2017-09-28 2017-09-28 Data amplification method and machine identification system based on high-dimensional space transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710899032.3A CN107729926B (en) 2017-09-28 2017-09-28 Data amplification method and machine identification system based on high-dimensional space transformation

Publications (2)

Publication Number Publication Date
CN107729926A CN107729926A (en) 2018-02-23
CN107729926B true CN107729926B (en) 2021-07-13

Family

ID=61208384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710899032.3A Active CN107729926B (en) 2017-09-28 2017-09-28 Data amplification method and machine identification system based on high-dimensional space transformation

Country Status (1)

Country Link
CN (1) CN107729926B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182286A (en) * 2018-01-29 2018-06-19 重庆交通大学 A kind of highway maintenance detection and virtual interactive interface method based on Internet of Things
CN108491456A (en) * 2018-03-02 2018-09-04 西安财经学院 The processing method of purchase information is sold in a kind of insurance service based on big data
CN108164291A (en) * 2018-03-22 2018-06-15 广西鸿光农牧有限公司 A kind of chicken manure fertilizer device for making
CN108388203A (en) * 2018-04-09 2018-08-10 衢州学院 A kind of intelligent numerical control machine tool heat dissipation monitoring system
CN108549281A (en) * 2018-04-11 2018-09-18 湖南城市学院 A kind of architectural design safe escape method of calibration and system
CN109344904B (en) * 2018-10-16 2020-10-30 杭州睿琪软件有限公司 Method, system and storage medium for generating training samples
CN109919183B (en) * 2019-01-24 2020-12-18 北京大学 Image identification method, device and equipment based on small samples and storage medium
CN110033417B (en) * 2019-04-12 2023-06-13 江西财经大学 Image enhancement method based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268593A (en) * 2014-09-22 2015-01-07 华东交通大学 Multiple-sparse-representation face recognition method for solving small sample size problem
CN104751191A (en) * 2015-04-23 2015-07-01 重庆大学 Sparse self-adaptive semi-supervised manifold learning hyperspectral image classification method
CN106096640A (en) * 2016-05-31 2016-11-09 合肥工业大学 A kind of feature dimension reduction method of multi-mode system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010124284A1 (en) * 2009-04-24 2010-10-28 Hemant Virkar Methods for mapping data into lower dimensions

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268593A (en) * 2014-09-22 2015-01-07 华东交通大学 Multiple-sparse-representation face recognition method for solving small sample size problem
CN104751191A (en) * 2015-04-23 2015-07-01 重庆大学 Sparse self-adaptive semi-supervised manifold learning hyperspectral image classification method
CN106096640A (en) * 2016-05-31 2016-11-09 合肥工业大学 A kind of feature dimension reduction method of multi-mode system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Construction for true three-dimensional imaging display system and analysis based on state-space model;Yi Yu,and etc;《2015 IEEE International Conference on Mechatronics and Automation (ICMA)》;20150903;第2437-2442页 *
彩色图像特征空间变换的新算法及其应用;王守觉等;《电子学报》;20070228;第35卷(第2期);第193-196页 *

Also Published As

Publication number Publication date
CN107729926A (en) 2018-02-23

Similar Documents

Publication Publication Date Title
CN107729926B (en) Data amplification method and machine identification system based on high-dimensional space transformation
CN109493308B (en) Medical image synthesis and classification method for generating confrontation network based on condition multi-discrimination
CN106056595B (en) Based on the pernicious assistant diagnosis system of depth convolutional neural networks automatic identification Benign Thyroid Nodules
CN110930416B (en) MRI image prostate segmentation method based on U-shaped network
CN107194937B (en) Traditional Chinese medicine tongue picture image segmentation method in open environment
CN103400388B (en) A kind of method utilizing RANSAC to eliminate Brisk key point error matching points pair
CN111784721B (en) Ultrasonic endoscopic image intelligent segmentation and quantification method and system based on deep learning
Zhao et al. Adaptive logit adjustment loss for long-tailed visual recognition
CN107977642A (en) A kind of High Range Resolution target identification method of kernel adaptive average discriminant analysis
CN107194329B (en) One-dimensional range profile identification method based on adaptive local sparse preserving projection
CN113450328B (en) Medical image key point detection method and system based on improved neural network
CN107862680B (en) Target tracking optimization method based on correlation filter
CN110942472B (en) Nuclear correlation filtering tracking method based on feature fusion and self-adaptive blocking
CN109712149B (en) Image segmentation method based on wavelet energy and fuzzy C-means
CN111881933A (en) Hyperspectral image classification method and system
CN110516525A (en) SAR image target recognition method based on GAN and SVM
CN112489096A (en) Remote sensing image change detection method under low registration precision based on graph matching model
Liu et al. Sagan: Skip-attention gan for anomaly detection
CN110766657A (en) Laser interference image quality evaluation method
CN107729863B (en) Human finger vein recognition method
CN111639555B (en) Finger vein image noise accurate extraction and adaptive filtering denoising method and device
CN109002828A (en) Image texture characteristic extracting method based on mean value bounce mark transformation
CN116342653A (en) Target tracking method, system, equipment and medium based on correlation filter
CN109886212A (en) From the method and apparatus of rolling fingerprint synthesis fingerprint on site
CN116047418A (en) Multi-mode radar active deception jamming identification method based on small sample

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant