CN107885854B

CN107885854B - semi-supervised cross-media retrieval method based on feature selection and virtual data generation

Info

Publication number: CN107885854B
Application number: CN201711124618.9A
Authority: CN
Inventors: 孙建德; 于恩; 李静; 张化祥
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2020-01-31
Anticipated expiration: 2037-11-14
Also published as: CN107885854A

Abstract

The invention provides methods based on feature selection and virtualizationThe method provides virtual data points generated according to the characteristics of training data to expand the training data, and adopts l in the process of learning two pairs of projection matrixes_2,1Specifically, class centers are found for each classes of images and text, new data points are randomly generated around the class centers to form new training data, and the new data are used to learn two pairs of projection matrices while l is used_2,1The method not only generates random data points to improve the diversity of training data, but also can select characteristics with more distinguishing and rich information when learning two pairs of projection matrixes.

Description

semi-supervised cross-media retrieval method based on feature selection and virtual data generation

Technical Field

The invention relates to a cross-media retrieval method, in particular to semi-supervised cross-media retrieval methods based on feature selection and virtual data generation.

Background

The cross-media retrieval refers to using modals as query data to retrieve other modals having the same semantic information, and using pictures and texts as examples, the pictures can be used to retrieve texts having corresponding semantic information, which is abbreviated as I2T, or the texts can be used to retrieve pictures having corresponding semantic information, which is abbreviated as T2I.

In cross-media retrieval techniques, the most important issue is that data of different modalities have different feature representations, which are in different dimensional spaces, so that the similarity between heterogeneous data is not directly comparable.therefore, the main concern across the media retrieval field is how to cross this Semantic gap. popular solutions are subspace learning methods.subspace learning methods aim to learn potential Semantic spaces in which the similarity of heterogeneous data can be directly measured. popular methods are learning versus projection matrices by which versus projection matrices can map data of different modalities into potential Semantic spaces, so that the similarity of heterogeneous data can be measured. popular methods are Canonical Correlation Analysis (CCA), CCA) which learns versus projection matrices, maximizing the Correlation between heterogeneous data when mapping different modalities to Semantic spaces, based on the Correlation between the Semantic data, and obtaining a multi-angle Correlation regression analysis by using a Semantic regression algorithm 8678 (gmanalysis), and obtaining a multi-angle Correlation regression analysis by using a Semantic regression method, which maximizes the Correlation between the Semantic data obtained by gmanalysis and a multiple angle regression method.

However, the common Cross-media Retrieval task has directionality, namely an image Retrieval text (I2T) and a text Retrieval image (T2I), and the above method only learns pairs of projection matrixes and does not emphasize the importance of query data, specifically, in the I2T task, a picture is more decisive for learning the projection matrixes, and in the same way, in the T2I task, the importance of the text is more emphasized.

Secondly, the current method only aims at learning a more effective projection matrix from the perspective of how to measure the similarity between heterogeneous data, so that more accurate comparison can be obtained in a semantic space, but the current method ignores the selection of richer information content and more distinctive features when learning the projection matrix, so that semi-supervised methods capable of randomly generating virtual data points are invented based on MDCR, and l is adopted_2,1And selecting characteristics by using the norm.

Disclosure of Invention

The invention provides semi-supervised cross-media retrieval technologies based on feature selection and pseudo-random data generation, and the traditional cross-media retrieval method is either a supervised method for training only by using marked data or a semi-supervised method for selecting parts of unmarked data and adding the unmarked data into training_2,1And selecting characteristics by using the norm. In general, our approach considers both the diversity of the training data and the choice of valid features.

The specific technical scheme of the invention is as follows:

semi-supervised cross-media retrieval technique based on feature selection and virtual data generation, comprising the following steps:

step 1: given data set

n represents the total number of data pairs, x_iRepresenting a picture feature, t_iWhich represents a feature of the text that,then, the picture and text feature matrices can be represented as: x_G＝[x₁,x₂,...,x_n-1,x_n]And T_G＝[t₁,t₂,t₃,...,t_n-1,t_n]；

Step 2: generating a pseudo-random virtual data point, and expanding an original data set, wherein the specific method comprises the following steps: calculating X_GAnd T_GCalculating the mean value of each dimensionality of the data of each class, namely for each class of data, obtaining a new vector formed by the mean values of all dimensionalities as the class center of the class, then randomly generating n 'numerical values at the upper part and the lower part by taking the mean value of each dimensionality as the center, combining the random values on all the dimensionalities at to generate n' pseudo-random virtual data

Adding the pseudo-random virtual data point into the original data set to obtain an expanded data set G_allThe expanded picture and text feature matrices are expressed as: x ═ X₁,...x_n,x₁',x'₂...x'_n]And T ═ T₁,...,t_n,t₁',t'₂,...,t'_n]；

And step 3: constructing an objective function:

defining an objective function:

wherein, U and V represent pairs of projection matrixes to be learned by the method, C (U and V) is a correlation analysis item, so that multi-modal data can keep paired neighbor relations in a potential semantic space, L (U and V) is a linear regression item from an image or text modal feature space to the semantic space, and is used for keeping the neighbor relations of different modal data with the same semantic meaning, and N (U and V) is a regular item and is used for selecting features;

according to formula (1), the objective functions of the retrieval tasks of the image retrieval text I2T and the text retrieval image T2I are obtained as follows:

(1) the objective function of I2T is:

wherein, U₁,V₁The projection matrix to be learned in the task of I2T is respectively corresponding to U in the formula (1), V, β is a balance coefficient and is more than or equal to 0 and less than or equal to β and less than or equal to 1, and Y is a semantic matrix;

(2) the objective function for T2I is:

wherein, U₂,V₂Is the projection matrix to be learned in the task of T2I, corresponding to U, V in equation (1);

and 4, step 4: obtaining an optimal projection matrix through an iterative solution method:

since the formulas (2) and (3) are non-convex, the solution is carried out by adopting a control variable method, namely, the partial derivatives of U and V are respectively solved and are made to be equal to zero, and the values of the projection matrixes U and V can be obtained; and then, continuously iterating until convergence, and solving the optimal values of the projection matrixes U and V.

Specifically, in step 3, N (U, V) ═ λ₁||U||_2,1+λ₂||V||_2,1Wherein λ is₁，λ₂The method is used for balancing two regular terms which are both positive numbers, and the constraint term is used for selecting more distinctive and rich information characteristics when a projection matrix is learned.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

1. Data set processing:

wikipedia, contains a total of 10 classes, 2866 picture-text pairs. We selected 2173 picture-text pairs as the initial training data, with the remainder being test data. The picture characteristic is a CNN characteristic of 4096 dimensions, and the text characteristic is a 100-dimensional LDA characteristic.

Pascal sequence, 20 total classes, 50 picture-text pairs per class. We selected 30 image-text pairs in each class as initial training data, with the remainder being test data. The picture characteristic is a CNN characteristic of 4096 dimensions, and the text characteristic is a 100-dimensional LDA characteristic.

INRIA-Websearch, 353 classes total, 71478 image-text pairs. We randomly selected 70% of them as initial training data, and the rest as test data. The picture characteristic is a CNN characteristic of 4096 dimensions, and the text characteristic is a 1000-dimensional LDA characteristic.

2. The method comprises the following specific implementation steps:

step 1: given data set

n represents the total number of data pairs, x_iRepresenting a picture feature, t_iRepresenting text features, then the picture and text feature matrices can be represented as: x_G＝[x₁,x₂,...,x_n-1,x_n]And T_G＝[t₁,t₂,t₃,...,t_n-1,t_n]。

Adding the pseudo-random virtual data points into the original data set to obtain an expanded data set G_allThe expanded picture and text feature matrices are expressed as: x ═ X₁,...x_n,x₁',x'₂...x'_n]And T ═ T₁,...,t_n,t′₁,t′₂,...,t′_n]。

And step 3: constructing an objective function:

defining an objective function:

(1) the objective function of I2T is:

wherein, U₁,V₁Is a projection matrix to be learned in the task of I2T, β is a balance coefficient and is more than or equal to 0 and less than or equal to β and less than or equal to 1, Y is a semantic matrix, and N (U)₁,V₁)＝λ₁||U₁||_2,1+λ₂||V₁||_2,1Wherein λ is₁，λ₂The two regular terms are used for balancing and are both positive numbers;

(2) the objective function for T2I is:

wherein, U₂,V₂Is the projection matrix N (U) to be learned in the task of T2I₂,V₂)＝λ₁||U₂||_2,1+λ₂||V₂||_2,1；

In particular, for l_2,1The norm may be derived using traces such as: defining a matrix U, then: | U | luminance_2,1＝Tr(U^TRU), R is diagonal matrices,uⁱfor each rows representing U, ε is tiny real numbers.

3. Evaluation criteria (mAP)

We evaluated the final search effect using mean average precision (mAP) evaluation criteria first we defined an average precision for every queries:

and N represents the total number of samples in the test data, and when the result sequence of the ith retrieval is the same as the corresponding class label, rel (i) is 1, otherwise rel (i) is 0. P (i) represents the precision of the sorting result of the ith retrieval. Then the average of all queried AP values is the maps.

4. Algorithm implementation

(1)I2T:

Inputting: picture feature matrix X_GAnd text feature matrix T_GSample marker matrix Y, parameter λ₁，λ₂，β

Virtual data is generated by first computing, for each class of data, the mean of each dimension, which is the class center of the class,

then, taking the mean value of every dimensions as the center, randomly generating n 'numerical values above and below the mean value, combining the random values on all dimensions at to form n' virtual data, and finally adding the generated virtual data into the input picture and text feature matrix to obtain a new training picture feature matrix X and a new text feature matrix T.

Initialization: initializing the projection matrix U₁，V₁Is an identity matrix.

Solving an optimal solution: according to the obtained U₁＝(XX^T+λ₁R₁₁)^-1[βXY+(1-β)XT^TV₁]And

V₁＝[(1-β)TT^T+λ₂R₁₂]^-1(1-β)TX^TU₁by iterating continuously until the result converges to obtain the optimum U₁，V₁。

The pseudo code for this process is as follows:

(2)T2I：

similar to the task of I2T, the optimal projection matrix U is obtained₂，V₂

5. Comparison of results

We performed experiments on three datasets separately and compared 7 other methods (PLS, CCA, SM, SCM, GMMFA, GMLDA, MDCR) that are currently popular, and the following table shows that the method of the present invention shows better retrieval on different datasets.

Claims

1, A semi-supervised cross-media retrieval method based on feature selection and virtual data generation, comprising the following steps:

step 1: given data set

n represents the total number of data pairs, x_iRepresenting a picture feature, t_iRepresenting text features, then the picture and text feature matrices can be represented as: x_G＝[x₁,x₂,...x_n-1,x_n]And T_G＝[t₁,t₂,t₃,...t_n-1,t_n]；

Step 2: generating a pseudo-random virtual data point, and expanding an original data set, wherein the specific method comprises the following steps: calculating X_GAnd T_GFor each class center of classes, calculating the mean value of each dimensionality of the class data for each class data, taking a new vector formed by the mean values of all dimensionalities as the class center of the class, then randomly generating n 'numerical values at the upper part and the lower part by taking the mean value of each dimensionality as the center, combining the random values on all dimensionalities in to generate n' pseudo-random virtual data

Adding the pseudo-random virtual data points into the original data set to obtain an expanded data set G_allThe expanded picture and text feature matrices are expressed as: x ═ X₁,...,x_n,x′₁,x'₂,...x'_n]And T ═ T₁,...,t_n,t′₁,t'₂,...,t'_n]；

And step 3: constructing an objective function:

defining an objective function:

(1) the objective function of I2T is:

(2) the objective function for T2I is:

2. The semi-supervised cross-media retrieval method based on feature selection and virtual data generation as recited in claim 1, wherein: in step 3, N (U, V) ═ λ₁||U||_2,1+λ₂||V||_2,1Wherein λ is₁，λ₂The method is used for balancing two regular terms which are both positive numbers, and the constraint term is used for selecting more distinctive and rich information characteristics when a projection matrix is learned.