CN114943017A

CN114943017A - Cross-modal retrieval method based on similarity zero sample hash

Info

Publication number: CN114943017A
Application number: CN202210696434.4A
Authority: CN
Inventors: 舒振球; 永凯玲; 余正涛; 高盛祥; 毛存礼
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-08-26
Anticipated expiration: 2042-06-20
Also published as: CN114943017B

Abstract

The invention discloses a cross-modal retrieval method based on similarity zero sample hash. A new zero sample hash framework is provided to fully mine and supervise semantic information, and the framework combines intra-modal similarity, inter-modal similarity, semantic tags and class attributes to guide the learning of the zero sample hash code. In this framework, both intra-modality and inter-modality similarities are considered. The intra-modal similarity represents the manifold structure and the feature similarity of multi-modal data, and the inter-modal similarity represents the semantic correlation between the modalities. In addition, semantic labels and class attributes are embedded into the hash code, and a more discriminative hash code is learned for each instance. However, due to the embedding of class properties, the relationship between visible and invisible classes can be well captured in the hash code, so that property knowledge can be transferred from the visible class to the invisible class. The invention realizes the high-precision retrieval of zero sample cross-modal data.

Description

Cross-modal retrieval method based on similarity zero sample hash

Technical Field

The invention relates to a cross-modal retrieval method based on similarity zero sample hash, and belongs to the field of cross-modal hash retrieval.

Background

Most existing cross-modal hash retrieval methods are studied in visible class datasets. However, with the explosive growth of multimedia data, a large number of new concepts (invisible classes) emerge. It is not feasible to retrain existing cross-modal hash models by collecting data for new concepts, as this would take a lot of time and space. Therefore, it is necessary to propose a cross-modal hash model in which the training data does not contain new concepts, but which can still handle the new concepts. However, zero sample learning can identify classes of data that have never been seen. That is, the trained classifier is not only able to identify the classes of data that are already in the training set, but also to distinguish between data from classes that are not seen. This makes zero sample learning a focus of research for invisible class retrieval tasks.

Zero sample learning has been widely used in the single modality search task over the past few years. Some researchers achieve potential semantic transitions by projecting tags into the word embedding space. Some researchers have proposed a zero sample hash based on asymmetric ratio similarity matrices to improve the knowledge transfer capability from visible classes to invisible classes. Other researchers have proposed a zero sample learning model for multi-label image retrieval that predicts the data labels of the invisible classes with an example conceptual consistency ranking algorithm. However, the above-mentioned work is a research on a monomodal search task, and a research on an invisible cross-modal search task is still insufficient. In the big data era with new concepts continuously emerging, the existing cross-modal retrieval method has the following problems: (1) the existing method only considers the visible class data and ignores the invisible class data. Therefore, such a model is not suitable for cross-modal data retrieval in the big data era. (2) Most methods do not use class attribute information in hash learning, which is detrimental to the transfer of knowledge from visible to invisible classes. (3) The existing few zero sample cross-modal retrieval methods fail to use intra-modal similarity, inter-modal similarity, class labels and class attributes to train models at the same time.

Disclosure of Invention

In view of the above existing challenges, the present invention provides a cross-modal retrieval method based on similarity zero sample hash. The invention is used for solving the cross-modal retrieval problem containing invisible class data by fusing intra-modal similarity, inter-modal similarity, label information and class attributes.

In order to achieve the purpose of the invention, the technical scheme of the cross-modal retrieval method based on the similarity zero sample hash is as follows: the invention provides a novel zero sample hash framework for fully mining and supervising semantic information, and the framework combines intra-modal similarity, inter-modal similarity, semantic labels and class attributes to guide the learning process of a zero sample hash code. In this framework, both intra-modality and inter-modality similarities are considered. The intra-modal similarity represents characteristics and semantic similarity among samples in the modal, and the inter-modal similarity represents semantic correlation among the modal. In addition, semantic labels and class attributes are embedded into the hash code, and a more discriminative hash code is learned for each instance. However, due to the embedding of class attributes, the relationship between visible and invisible classes can be well captured in the hash code, so that supervised knowledge can be transferred from the visible class to the invisible class. The invention comprises the following steps:

step1, acquiring a cross-modal data set, and performing feature extraction and class attribute vector extraction on the cross-modal data set;

step2, processing of cross-modal data set: processing the existing cross-modal data set into a cross-modal zero sample data set; the original data set is firstly divided into a training set and a query set, then 20% of classes are randomly selected from all classes of the original data set as invisible classes, and the rest classes are visible classes. For a zero sample cross-modal retrieval scene, the method takes a sample pair corresponding to an invisible class in an original query set as a new query set; taking the sample pair corresponding to the visible class in the original training set as a new training set; the retrieval set consists of an original training set;

step3, learning an objective function: the intra-modal similarity, the inter-modal similarity, the semantic label, the class attribute, the hash code and the hash function are fused and learned into the same frame, so that the target function is obtained, and the hash code with more discriminative performance is learned;

step4, performing iterative update of the objective function: iteratively updating the variable matrix in the target function obtained in the last step until the target function converges or reaches the maximum iteration times to obtain a hash function and a hash code of a training set;

step5, performing zero-sample cross-modality retrieval: and inputting a query sample, and obtaining a hash code of the query sample according to the hash function obtained at Step 4. The hash codes of the query samples are substituted into the retrieval set for query, and because the query is performed in a binary space, the query result is obtained by calculating the Hamming distance between the query sample and each sample in the retrieval set. The sample corresponding to the minimum hamming distance in the search set is the query result for us.

Further, the cross-modality retrieved data set includes a plurality of sample pairs, each sample pair including: text, images, and corresponding semantic tags.

Further, in the Step1, extracting image features through a VGG-16 model; extracting text features through a bag-of-words model; and extracting class attributes, and extracting a corresponding word vector for each class name by a Glove method to form a class attribute matrix.

Further, in Step2, in order to ensure the generalization ability of the model, each time data enters the model for training, a random selection method is used to process and divide the data set. And taking an average value as a final result through multiple training.

The intra-modal similarity in Step3 is divided into a feature similarity calculated by Euclidean similarity and a semantic similarity measured by Jaccard similarity.

Further, the inter-modality similarity in Step3 refers to semantic similarity between instances of different modalities, and the semantic similarity is measured by label semantic information.

Further, the target function obtained in Step3 includes two parts, namely hash code learning and a hash function, where the hash code learning refers to learning a hash code by combining intra-modal similarity, inter-modal similarity, semantic tags, and class attributes; the learning of the hash function refers to learning the hash function through a least square regression problem, and the learning of the hash code and the learning of the hash function are put into the same model for learning, so that the semantic relation between the hash code and the hash function is enhanced, and the high-precision zero sample cross-modal retrieval is realized.

Further, the iterative update of the target function in Step4 is to update the target function obtained in Step4 as an original function. It is clear that the objective function is not optimal and needs to be optimized. Since the objective function is a non-convex problem, when other variables are fixed and a matrix variable is updated, the function is a convex problem, and the objective function is convenient to update. The alternating iteration algorithm is adopted to update the matrix variable until the target function converges or the maximum iteration times is reached, and finally the optimal hash code and the hash function are obtained.

Further, in Step3, establishing intra-modal similarity and inter-modal similarity by a kernel-based supervised hashing (KSH) optimization model to link with the hash code, so as to enhance semantic information in the hash code in a manner of embedding the similarity into the hash code; establishing a relation between the semantic label and the class attribute and the hash code in a label reconstruction mode, embedding a label in the hash code, and enhancing semantic information contained in the hash code; and embedding class attributes in the hash codes, wherein the attribute knowledge in the visible class is transferred to the invisible class, so that the retrieval of the invisible class is realized.

Further, in Step4, since the original model is a non-convex problem, it is a difficult problem to directly optimize it. However, when the other variables are fixed and only one variable is optimized, the new problem converted by the original model is a convex problem and can be directly optimized and solved. By analogy, each variable is optimized in such a way until convergence or the maximum iteration number is reached, and the optimal result is obtained.

(1) The existing method only considers the visible class data and ignores the invisible class data. Therefore, such a model is not suitable for cross-modal data retrieval in the big data era. (2) Most methods do not use class attribute information in hash learning, which is detrimental to the transfer of knowledge from visible to invisible classes. (3) The existing few zero sample cross-modal retrieval methods fail to train models using intra-modal similarity, inter-modal similarity, class labels, and class attributes simultaneously.

The invention has the beneficial effects that:

the invention provides a cross-modal retrieval method based on similarity zero sample hash. The method overcomes the limitation that most of the existing cross-modal retrieval methods cannot solve zero sample data. The method learns the hash codes by simultaneously using intra-modal similarity, inter-modal similarity and class attributes, so that the relationship between visible classes and invisible classes can be well captured, and the supervision knowledge is transferred from the visible classes to the invisible classes. Furthermore, to account for supervised tag information, the present invention improves accuracy by embedding tag information into the attribute space. Therefore, a more discriminative hash code can be generated from the model proposed by the present invention. In addition, the invention provides a discrete optimization scheme to solve the proposed model, thereby effectively avoiding quantization errors.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention.

FIG. 2 is a flow chart of the SAZH model iterative update of the present invention.

Detailed Description

The following description is exemplary in nature and is intended to further illustrate the present invention and the accompanying drawings.

Example one

Fig. 1 is a flowchart of a cross-modal retrieval method based on similarity zero sample hash according to the present invention.

In this example, referring to fig. 1, the method of the present invention specifically comprises the following processes:

1. and acquiring a cross-modal data set, and performing feature extraction and class attribute vector extraction on the cross-modal data set. In this example, the data set used includes both image and text modalities, with labels corresponding one-to-one to them. In Step1, class attributes are extracted, and a Glove method is adopted to extract a corresponding word vector for each class name to form a class attribute matrix.

2. Processing of cross-modality data sets. Since the problem to be solved by the present invention is the zero-sample cross-modality retrieval problem, the acquired cross-modality dataset cannot be directly used. The data set should be processed according to the application scenario of the zero sample so as to conform to the application scenario of the zero sample cross-modal retrieval. The specific treatment method comprises the following steps:

the original data set is firstly divided into a training set and a query set, then 20% of classes are randomly selected from all classes of the original data set as invisible classes, and the rest classes are visible classes. For a zero sample cross-modal retrieval scene, the method takes a sample pair corresponding to an invisible class in an original query set as a new query set; taking the sample pairs corresponding to the visible classes in the original training set as a new training set; the search set is composed of the original training set.

In the present invention, a set of multiple modes is givenThe state data is:

wherein

Is a multi-modal data point that is,

the feature vector corresponding to the ith instance of the image modality,

feature vector, l, for the ith instance of the text modality _i The ith instances of the two modalities correspond to a common label vector, and n is the total number of instances of the data set.

By processing and partitioning the data set, the invention is useful

To represent multi-modal data of a training set, where n _s The number of training samples.

3. And intra-modal similarity, inter-modal similarity, label information, class attributes, hash codes and hash functions are fused and learned into the same frame, so that the target function is obtained, and the more discriminative hash codes are learned. The learning models of the respective modules will be described in detail below:

3.1 in-modality similarity learning

The intra-modal similarity is divided into a feature similarity calculated by Euclidean similarity and a semantic similarity measured by Jaccard similarity. Because the Euclidean distance is simple to calculate and reflects the distance between two vectors, the method adopts the Euclidean distance as a characteristic similarity measurement method. First of all, the first step is to,

and

the Euclidean distance between them is:

then

And

the similarity is as follows:

wherein the content of the first and second substances,

and

the ith and jth samples respectively representing the tth modality, and t ═ 1,2 indicates that two modalities are set forth in the present invention.

Furthermore, we measure semantic similarity with Jaccard similarity as follows:

wherein the content of the first and second substances,

corresponding to the number of tags assigned to the ith instance in the tth modality. The label of an instance depends on its features, and the semantic similarity is positively correlated with the feature similarity of the respective instance. Therefore, we can combine the feature similarity and semantic similarity between data to obtain the following learning model:

wherein, the first and the second end of the pipe are connected with each other,

for the total similarity within a modality to be,

representing the similarity of features between two samples,

is semantic similarity measured by the Jaccard similarity method.

3.2, study of similarity between modalities

The inter-modality similarity refers to semantic similarity between instances of different modalities, and the semantic similarity is measured by label semantic information;

specifically, in the present invention, inter-modal similarity is calculated by a class label matrix. Is provided with

Is a corresponding label matrix, where L _ij 1 represents X _i* Belong to class j, otherwise 0. Furthermore, the inter-modal similarity matrix, expressed as

Can be constructed from a matrix of tags if

Description of X _i* And X _j* Are similar. Otherwise, X _i* And X _j* Are dissimilar. Where c represents the number of categories.

3.3 Hash function learning

The hash function in the present invention is learned by minimizing the following least squares regression problem:

wherein beta is a non-negative parameter, B ₁ And B ₂ Corresponding to two modalities, image and text, respectivelyHash code, W ₁ And W ₂ The projection matrixes respectively correspond to the image modality and the text modality.

3.4 similarity preserving learning

In combination with a kernel-based supervised hash (KSH) optimization model, the similarity-preserving learning model provided by the invention comprehensively considers intra-modal and inter-modal similarities. The model expression is as follows:

wherein S is ¹¹ And S ²² Intra-modality similarity matrix, S, for two modalities, image and text respectively ¹² Is an inter-modality similarity matrix for both image and text modalities.

3.5 class Attribute and tag embedding

The invention embeds the label information into the hash code, which is beneficial to generating optimized binary code by fully utilizing the label information and has stronger robustness when processing data in large scale. Thus, the optimized hash code is obtained by optimizing the following model:

wherein alpha is a non-negative parameter, C ₁ And C ₂ Respectively representing projection matrices for projecting the two modal hashcodes of image and text into the label.

In addition, class attribute information is added into the proposed model, which not only can be helpful for generating more discriminative hash codes, but also mainly realizes the transfer of attribute knowledge from visible classes to invisible classes, so that the problem of zero sample cross-modal retrieval is solved. The embedding of the class attribute information is performed by embedding a class attribute matrix corresponding to each class name into the projection matrix of formula (5). Therefore, the label information and the class attribute information can be simultaneously embedded into the learning of the hash code. Update equation (5) to:

wherein, V ₁ And V ₂ And respectively projecting the image and text two-mode hash codes to a conversion projection matrix added with class attribute information in a label.

3.6 objective function

By combining the steps, the objective function of the invention is obtained as follows:

wherein the content of the first and second substances,

a regularization term representing the model, with the purpose of preventing overfitting; γ is a parameter that controls the regularization term. X ⁽¹⁾ And X ⁽²⁾ Feature matrixes of two modes of an image and a text are respectively set; y is a label matrix; a is a class attribute matrix; s ¹¹ And S ²² Intra-modality similarity matrix, S, for two modalities, image and text respectively ¹² An inter-modality similarity matrix which is an image modality and a text modality; w ₁ 、W ₂ 、V ₁ 、V ₂ Is a projection matrix; α and β are non-negative parameters.

4. Performing iterative update of the objective function: and updating the target function obtained in the last step through iteration until the target function converges or reaches the maximum iteration times to obtain the hash function and the hash code of the training set.

The function (7) is not optimal and needs to be updated iteratively subsequently. It is clear that the overall objective function is a non-convex optimization problem. Therefore, we propose an efficient alternate iteration algorithm to solve this problem.

Specifically, referring to fig. 2, the optimization procedure for equation (7) is as follows:

B ₁ -step: fixed variable W ₁ ,W ₂ ,V ₁ ,V ₂ ,B ₂ Thus, for B ₁ Equation (7) can be simplified as:

by setting B ₁ Is zero, B can be deduced ₁ The closed solution of (1). The following were used:

B ₂ -step: and B ₁ The updating steps are similar to obtain B ₂ The closed solution of (2). The following were used:

V ₁ -step: fixed variable W ₁ ,W ₂ ,V ₂ ,B ₁ ,B ₂ Thus, for V ₁ Equation (7) can be simplified as:

by setting V ₁ Is zero, we can derive the following equation:

we define

B ₁₁ ＝AA ^T ，

However, equation (12) can be rewritten as:

A ₁₁ V ₁ +V ₁ B ₁₁ ＝C ₁₁ (13)

equation (13) is a Sylvester equation that can be solved using the Sylvester function in MATLAB.

V ₂ -step: similarly, with respect to V ₂ We have:

A ₂₂ V ₂ +V ₂ B ₂₂ ＝C ₂₂ (14)

wherein the content of the first and second substances,

B ₂₂ ＝AA ^T ，

W ₁ -step: similarly, with respect to W ₁ We have:

A ₃₃ W ₁ +W ₁ B ₃₃ ＝C ₃₃ (15)

B ₃₃ ＝B ₁ ^T B ₁ ，

W ₂ -step: similarly, with respect to W ₂ We have:

A ₄₄ W ₂ +W ₂ B ₄₄ ＝C ₄₄ (16)

wherein the content of the first and second substances,

B ₄₄ ＝B ₂ ^T B ₂ ，

and (4) optimizing the formula (7) through the steps until the function is converged or the maximum iteration number is reached, and stopping iteration.

5. And (3) inquiring, and carrying out zero sample cross-modal retrieval: the method comprises the steps of firstly obtaining a hash code corresponding to a retrieval set, inputting a query sample, and obtaining the hash code of the query sample according to the hash function obtained in the last step. And substituting the hash code of the query sample into the retrieval set for query. The specific implementation steps are as follows:

given that the query sample of images and text corresponds to a feature matrix of

And

combining the projection matrix W obtained in the previous step ₁ And W ₂ . By the formula

And

and obtaining the hash code corresponding to the query sample. In this embodiment, we perform two main retrieval tasks: image query text and text query images.

Because the query task of the invention is carried out in a binary space, the query result is obtained by calculating the Hamming distance between the query sample and each sample in the retrieval set. The sample corresponding to the minimum hamming distance in the search set is the query result we obtained.

In order to illustrate the effect of the present invention, the following further describes the technical solution of the present invention by specific examples:

1. simulation conditions

The invention adopts Matlab software to carry out experimental simulation. Experiments were performed on a cross-modal dataset Wiki (containing both image and text modalities), which included two query tasks: (1) text query image (Text2Img), (2) image query Text (Img2 Text). The parameters in the experiment are set to α ═ 1e-2, β ═ 1e5, and γ ═ 1 e-4.

2. Emulated content

Compared with the existing non-zero sample cross-modal Hash retrieval method and the zero sample cross-modal Hash retrieval method, the method provided by the invention comprises the following steps: (1) collaborative Matrix Factorization Hashing (CMFH), (2) Joint and Individual Matrix Factorization Hashing (JIMFH), (3) Discrete Robust Matrix Factorization Hashing (DRMFH), (4) Asymmetric Supervised Consistent and Specific Hashing (ASCSH), (5) label consistent matrix factorization hashing (LCMFFH); the zero-sample single-mode Hash retrieval method comprises the following steps: (1) zero sample hashing (TSK) based on supervised knowledge transfer, (2) attribute hashing Algorithm (AH) for zero sample image retrieval; the zero sample cross-modal Hash retrieval method comprises the following steps: (1) cross-modal attribute hashing (CMAH), (2) orthogonal hashing algorithm (CHOP) for zero sample cross-modal retrieval. For the zero-sample single-mode Hash retrieval method, Hash codes of two modes of an image and a text are obtained through a single-mode model respectively, and then the following query tasks are carried out.

3. Simulation result

The simulation experiment respectively provides the comparison method and the experimental result of the method provided by the invention under the data set Wiki. In order to meet the zero-sample cross-modal retrieval scenario, 20% of classes are selected as invisible classes in the random data set Wiki. The data set Wiki contains 8 classes in total, two classes are randomly selected from the data set Wiki as invisible classes according to experimental setting, and the processing mode of the rest data sets is the same as that of the data set of the invention.

In the simulation, a widely used index was used to measure the performance of the SASH method proposed by the present invention and other comparative methods. I.e. the average of the average precision (mAP). Given a query and a list of search results, the Average Precision (AP) is defined as:

where N is the number of relevant instances in the search set, p (r) is defined as the precision of the r-th search instance, and δ (r) is 1 if the r-th search instance is a real neighbor of the query; otherwise δ (r) is 0. Then, all queried APs are averaged to obtain the mAP. The evaluation rule is that the larger the mAP value is, the better the performance is.

The hash codes obtained from the simulation experiments have lengths of 8 bits, 12 bits, 16 bits and 32 bits, and the corresponding mAP values of the SAZH method proposed by the present invention and other comparison methods are shown in Table 1.

TABLE 1 mAP values on Text query image (Text2Img) task for all methods on Wiki dataset

TABLE 2 mAP values on image query (Img2Text) task for all methods on Wiki dataset

As can be seen from tables 1 and 2, the SAZH method proposed by the present invention has higher mAP values in both query tasks in the zero-sample cross-modal retrieval scenario of Wiki dataset than the other comparison methods. The superiority of the SAZH method provided by the invention in zero sample cross-modal retrieval is further proved.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A cross-modal retrieval method based on similarity zero sample hash is characterized in that: the method comprises the following specific steps:

step1, acquiring a cross-modal data set, and extracting the characteristics and the class attributes of the cross-modal data set;

step2, processing of cross-modal data set: processing the existing cross-modal data set into a cross-modal zero sample data set;

step4, performing iterative update of the objective function: iteratively updating the variable matrix in the target function obtained at Step3 until the target function converges or reaches the maximum iteration number, and obtaining a hash function and a hash code of the training set;

step 5: performing zero sample cross-modal retrieval: the method comprises the steps of firstly obtaining hash codes corresponding to a retrieval set, then solving the hash codes of a query set through a hash function obtained at Step4, putting the hash codes into the retrieval set for query, and obtaining a query result by calculating the Hamming distance between each sample in the query set and each sample in the retrieval set, wherein the minimum Hamming distance is the final query result.

2. The cross-modal retrieval method based on similarity zero-sample hashing according to claim 1, wherein: in Step1, class attributes are extracted, and a Glove method is adopted to extract a corresponding word vector for each class name to form a class attribute matrix.

3. The cross-modal retrieval method based on similarity zero-sample hash as claimed in claim 1, wherein: the specific method of Step2 comprises the following steps: firstly, dividing an original data set into a training set and a query set, then randomly selecting 20% of classes from all classes of the original data set as invisible classes, and selecting the rest classes as visible classes; for a zero sample cross-modal retrieval scene, taking a sample pair corresponding to an invisible class in an original query set as a new query set; taking the sample pair corresponding to the visible class in the original training set as a new training set; the search set is composed of the original training set.

4. The cross-modal retrieval method based on similarity zero-sample hashing according to claim 1, wherein: the intra-modal similarity in Step3 is divided into feature similarity calculated by Euclidean similarity and semantic similarity measured by Jaccard similarity.

5. The cross-modal retrieval method based on similarity zero-sample hashing according to claim 1, wherein: the inter-modality similarity in Step3 refers to the semantic similarity between instances of different modalities, and the semantic similarity is measured by label semantic information.

6. The cross-modal retrieval method based on similarity zero-sample hashing according to claim 1, wherein: the target function obtained in Step3 comprises two parts, namely hash code learning and a hash function, wherein the hash code learning refers to learning the hash code by combining intra-modal similarity, inter-modal similarity, semantic labels and class attributes; the learning of the hash function refers to learning the hash function through the least square regression problem, and the learning of the hash code and the learning of the hash function are put into the same model for learning, so that the semantic relation between the hash code and the hash function is enhanced, and the high-precision zero-sample cross-modal retrieval is realized.

7. The cross-modal retrieval method based on similarity zero-sample hashing according to claim 1, wherein: the iterative update of the objective function in Step4 is to update the objective function obtained in Step4 as an original function, obviously, the objective function is not optimal, and the function needs to be optimized, because the objective function is a non-convex problem, when other variables are fixed and a matrix variable is updated, the function is a convex problem, and the update of the objective function is facilitated; and updating the matrix variable by adopting the alternating iteration algorithm until the target function converges or the maximum iteration times is reached, and finally obtaining the optimal hash code and the optimal hash function.

8. The cross-modal retrieval method based on similarity zero-sample hashing according to claim 1, wherein: the objective function in Step3 is:

wherein the content of the first and second substances,

a regularization term representing a model for preventing overfitting; gamma is a parameter controlling the regularization term, X ⁽¹⁾ And X ⁽²⁾ Feature matrixes of two modes of images and texts are respectively used; y is a label matrix; a is a class attribute matrix; s ¹¹ And S ²² Intra-modality similarity matrix, S, for two modalities, image and text respectively ¹² An inter-modality similarity matrix which is two modalities of an image and a text; w ₁ 、W ₂ 、V ₁ 、V ₂ Is a projection matrix; alpha and beta are non-negative parameters, n _s The number of training samples.