CN114067385A

CN114067385A - Cross-modal face retrieval Hash method based on metric learning

Info

Publication number: CN114067385A
Application number: CN202111175867.7A
Authority: CN
Inventors: 梁籍云; 沃焱; 韩国强
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-02-18
Anticipated expiration: 2041-10-09
Also published as: CN114067385B

Abstract

The invention discloses a cross-modal face retrieval hash method based on metric learning, which comprises the following steps: 1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction; 2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; 3) and using the trained common expression generation network and ITQ iterative quantization to extract the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and sequencing the retrieval items according to the Hamming distance of the image Hash and the video Hash. The cross-modal face hash function of the present invention can generate image and video features that are robust and discernable both within and between modalities. The invention tests on YTC face video data set, and can ensure high accuracy of cross-modal retrieval.

Description

Cross-modal face retrieval Hash method based on metric learning

Technical Field

The invention relates to the technical field of multimedia processing, in particular to a cross-modal face retrieval hash method based on metric learning.

Background

The cross-mode face retrieval refers to retrieving a face video related to a certain face ID by using an image of the certain face ID or retrieving a corresponding face image by using the face video of the certain face ID. At present, the creation and the propagation of multimedia information become increasingly wide, and the demand diversity of users for retrieval is also greatly improved, so that the cross-mode face retrieval has wide application scenes, for example, related movie and television plays can be retrieved on a video playing website according to roles, or certain video segments can be played in the movie and television plays according to character selection.

Generally, an image only contains two-dimensional spatial description, and a video contains spatial transformation in a continuous time, and the two modes have large difference. The predominant method to reduce inter-modality differences is to project two different features into a common space where the image features and video features have semantic consistency. However, for images and videos, the inter-mode difference is not only reflected in the difference between a time domain and a space domain, but also reflected in the view angle difference between the time domain and the space domain, the human faces in the human face images and the human face videos may have a large view angle difference, and the view angle difference may involve complex changes such as occlusion, nonlinear distortion, position change and the like, so that the features of different modes in a common space are more dispersed, and the accuracy of cross-mode retrieval is further influenced. This complex inter-modality difference makes cross-modality face retrieval a challenging problem.

In order to solve the above problems, the present invention provides a two-stage method, in which video features and image features robust to view angle changes are extracted in a first stage, two feature spaces are mapped to a common space in a second stage, two heterogeneous features are aligned in the common space by metric learning, and finally, image hash and video hash are obtained by ITQ iterative quantization.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a cross-modal face retrieval hash method based on metric learning, which can extract hashes with robustness and separability in and among modes and improve the consistency and accuracy of cross-modal metric learning.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: the cross-modal face retrieval hashing method based on metric learning comprises the following steps:

1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction;

2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; during training, training a feature extraction network by using a face image training set and a face video training set, updating parameters of the feature extraction network by using cross entropy loss, training a co-expression mapping network by using the face image training set and the face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of the trained feature extraction network;

3) in order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash.

Further, in step 1), cutting each video in the face video data set into a plurality of video segments with the same length; 70% of videos in the face video data set form a face video training set, each frame of the videos in the face video training set is extracted to form a face image training set, and 1 frame is randomly extracted from the rest 30% of videos in the face video data set to serve as a face image test set; performing alignment operation on faces in the video frames and the images in the data set by using a pre-trained MTCNN, wherein the MTCNN is a deep face area detection network, and the aligned video frames and images are scaled to H multiplied by W, wherein H is the length of a picture, and W is the width of the picture; and extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.

Further, in step 2), for the feature extraction network, the following operations are performed:

2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of a plurality of fully-connected layers and nonlinear layer network structures, and the classifier consists of a fully-connected layer and a Softmax layer;

2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the feature extraction network of the image, an attention network and a classifier, and the classifier consists of a full connection layer and a Softmax layer;

2.13) initializing the parameters of the feature extraction network, pre-training the feature extraction network using a CASIA-WebFace dataset, which is a large-scale public face image dataset comprising 494414 face images from 10575 categories collected from the network, and initializing the number of iterations i₁＝0；

2.14) determining whether the number of iterations is satisfied

Wherein e₁Denotes the epoch number, N₁Representing the total number of samples, N, of the training set of face images_batchRepresenting the number of image samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to the step 2.15);

2.15) random selection of N_cEach category is selected randomly in the face image training set and the face video training set of each category

An image and

taking each video as data of one iteration;

2.16) input the ith sample in the face image training set

First using VGG network extraction

Secondly, mapping characteristics of the Yaw angle parameter Yaw and the Yaw angle residual error

Adding the product of (2) to the original feature

Obtaining image features:

finally, the classifier is used to characterize the image

Classifying;

2.17) input the ith sample in the face video training set

Firstly, extracting video frame characteristics by using VGG network

Secondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: computing m kernel vectors at the first level

And video frame characteristics

And obtaining an importance coefficient vector of each frame by nonlinear mapping of the ReLU

Wherein the j th₁A vector of importance coefficients

Expressed as:

then using Softmax function pair

Normalizing to obtain m groups of weight vectors

Wherein the j th₁Group weight vector

Expressed as:

in the second layer, according to the previous layer

Calculating corresponding importance coefficient by the obtained tensor

Wherein the j th₂A coefficient of importance

Expressed as:

wherein the content of the first and second substances,

j represents the second layer₂Kernel vectors, then using Softmax function pairs

Normalizing to obtain m weight coefficients

Wherein, j is₂Coefficient of weight

Expressed as:

by passing

Fusing video frame features into one video feature

Finally, the classifier is used to characterize the video

Classifying;

2.18) obtaining image features and video features using cross-entropy loss as a supervision

Cross entropy loss of

Expressed as:

wherein the content of the first and second substances,

a label representing the ith image, P represents a feature distribution,

weight vector, video feature, representing full connected layers in classifier

Cross entropy loss of

Expressed as:

wherein the content of the first and second substances,

a tag representing the ith video is selected,

a weight vector representing a fully connected layer in the classifier;

2.19) loss function optimized as a whole

Comprises the following steps:

updating a parameter θ of a feature extraction network by a back propagation algorithm_f：

Wherein the content of the first and second substances,

is the ith₁The features of the sub-iteration extract the network parameters,

is the ith₁+1 iteration of feature extraction network parameters, l₁D represents the differential for the learning rate of the batch;

2.110)i₁＝i₁+1, go to 2.14).

Further, in step 2), for the co-expression mapping network, the following operations are performed:

2.21) constructing a common expression mapping network of the images, wherein the network comprises two full connection layers and a Softmax layer, inputting image characteristics, outputting the common expression of the images by the first full connection layer, and classifying the common expression of the images by a classifier by the second full connection layer and the Softmax layer;

2.22) constructing a video co-expression mapping network, wherein the network comprises two full connection layers and a Softmax layer, video characteristics are input, the first full connection layer outputs video co-expression, and the second full connection layer and the Softmax layer form a classifier for classifying the video co-expression;

2.23) initializing parameters of the coexpression mapping network, number of iterations i₂＝0；

2.24) determining whether the number of iterations is satisfied

Wherein e₂Denotes the epoch number, N₁Representing the total number of samples, N, of the training set of face images_batchRepresenting the number of samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to 2.25);

2.25) random selection of N_cEach category is selected randomly in the face image training set and the face video training set of each category

An image and

taking each video as data of one iteration;

2.26) network acquisition image Co-expression r Using Co-expression Generation_i ^ICo-expression r with video_i ^VWherein r is_i ^IRepresenting a common representation of the ith image, r_i ^VA common representation representing an ith video;

2.27) use of Cross entropy loss

Retention of co-expressed intra-modal similarity relationships:

wherein l_i ^ILabel representing ith image, l_i ^VA tag representing the ith video;

2.28) constructing global semi-difficult triples and local semi-difficult triples to keep the characteristic relation between modes, wherein the semi-difficult triples refer to that all positive sample pairs and negative sample pairs which are more than the positive sample pairs and within the distance margin are selected, and the loss L is learned by using the metric_TripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss L_localAnd semi-difficult global triplet loss L_globalExpressed as:

wherein N is_IDenotes the use of r_i ^INumber of semi-difficult triples as anchor points, N_VDenotes the use of r_i ^VNumber of semi-difficult triples as anchor points, r_i,r_jRepresenting common expression of the same class, r_i,r_kRepresenting different classes of co-expression, c_jRepresentative table and r_iClass centers of the same class, c_kRepresentative table and r_iClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression r_i ^I,r_i ^VAnd a weight vector for a corresponding class in a second fully-connected layer of the co-expression mapping network

Maximizing the weight vector corresponding to the common expression and other classes

Thus using the co-expression mapping network of images and the weight vector of the second fully connected layer in the co-expression mapping network of video

Represents

2.29) the overall optimized loss function is:

wherein the content of the first and second substances,

is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithm_gWhile fine tuning the parameter θ of the feature extraction network_f，l₂Learning rate for this batch:

wherein the content of the first and second substances,

is the ith₂The co-expression of the sub-iteration maps network parameters,

is the ith₂The co-expression of +1 iterations maps network parameters,

is the ith₂The features of the sub-iteration extract the network parameters, d represents the differential,

is the ith₂Extracting network parameters according to the characteristics of +1 iteration;

2.210)i₂＝i₂+1, go to 2.24).

Further, the step 3) comprises the following steps:

3.1) taking the face image test set as a query set and the face video training set as a retrieval set, and respectively extracting the query by using the trained common expression generation networkCo-expression matrix r of collection and search set_queryAnd r_retrival；

3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCA_retrivalBy minimizing r_retrivalThe low-dimensional expression and the distance iterative training of the binary hash thereof obtain a rotation matrix R, wherein R is used for balancing the variances of different dimensions;

3.3) obtaining a binary Hash matrix B for extracting a query set and a retrieval set by using W and R_queryAnd B_retrival：

B_query＝sgn(r_queryWR)

B_retrival＝sgn(r_retrivalWR)

Wherein sgn represents a sign function;

and 3.4) calculating the binary hash of each query item in the query set and the Hamming distance of all the binary hashes in the retrieval set, sequencing the retrieval set according to the Hamming distance, and returning the sequenced retrieval set.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention provides a two-stage cross-mode face image video hash generation method, which can generate images and video hashes with robustness and distinguishability in and among modes.

2. The method aligns the image characteristics of different yaw angles through a light-weight yaw angle residual mapping, and obtains the image characteristics with robustness to the change of the yaw angle.

3. The method of the invention fuses the video frame characteristics through a multi-layer attention mechanism, and obtains the video characteristics with robustness.

4. The method provided by the invention is different from the traditional metric learning method, the semantic class center is adopted to construct the global triple of the sample and the class center, the consistency and the accuracy of metric learning are improved by combining the local triple, and meanwhile, the convergence of a loss function is accelerated by screening of semi-difficult sample pairs.

Drawings

FIG. 1 is an architectural diagram of the method of the present invention.

Fig. 2 is a schematic diagram of metric learning proposed by the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1, the present embodiment discloses a cross-modal face retrieval hashing method based on metric learning, which specifically includes the following main technical steps:

1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction; the specific situation is as follows:

the data set selected in this example is the YTC face video data set containing 1910 face videos from 47 celebrities collected from the Youtube video website, each video in the face video data set being cropped into a plurality of video segments of 30 frames in length.

And (3) forming a face video training set by 70% of videos in the YTC face video data set, extracting each frame of the videos in the face video training set to form a face image training set, and randomly extracting 1 frame from each of the rest 30% of videos in the face video data set to serve as a face image test set.

The pre-trained MTCNN, which is a deep face region detection network, is used to align the faces in the video frames and images in the data set, and the aligned video frames and images are scaled to 64 × 64.

And extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.

2) Training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network.

The method for training the feature extraction network by using the face image training set and the face video training set and updating the parameters of the feature extraction network by using cross entropy loss comprises the following steps:

2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of two fully-connected layers, namely a nonlinear layer network structure, and the classifier consists of a fully-connected layer FC1 and a Softmax layer;

2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the image feature extraction network, an attention network and a classifier, and the classifier consists of a full connection layer FC4 and a Softmax layer;

2.13) initializing the parameters of the feature extraction network, pre-training the feature extraction network by using a CASIA-WebFace, wherein the CASIA-WebFace data set is a large-scale public face image data set and comprises 494414 face images from 10575 categories collected from the network, and the number of initialization iterations i₁Initial learning rate l is 0₁＝1e-3；

2.14) determining whether the number of iterations is satisfied

Wherein e₁1 denotes the epoch number, N₁Representing the total number of samples, N, of the training set of face images_batch60 denotes the number of image samples in one iteration, if the condition is not satisfied, the iteration is ended, and if the condition is satisfied, the process goes to 2.15);

2.15) randomly selecting 12 classes, and then randomly selecting 5 images and 5 videos as data of one iteration in a face image training set and a face video training set of each class;

2.16) input the ith sample in the face image training set

First using VGG network extraction

Adding the product of (2) to the original feature

Obtaining image features:

dimension 512 × 1, and finally, using a classifier to characterize the image

Classifying, wherein the classification characteristic dimension is 47 multiplied by 1, and a Dropout layer is added after FC1 to reduce overfitting;

2.17) input the ith sample in the face video training set

Firstly, extracting video frame characteristics by using VGG network

Secondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: calculating m-3 kernel vectors in the first layer

And video frame characteristics

Wherein the j th₁Heavy pieceEssential coefficient vector

Expressed as:

then using Softmax function pair

Normalizing to obtain m groups of weight vectors

Wherein the j th₁Group weight vector

Expressed as:

in the second layer, according to the previous layer

Calculating corresponding importance coefficient by the obtained tensor

Wherein the j th₂A coefficient of importance

Expressed as:

wherein the content of the first and second substances,

Normalizing to obtain m weight coefficients

Wherein, j is₂Coefficient of weight

Expressed as:

by passing

Fusing video frame features into one video feature

Dimension 512 × 1, and finally, using a classifier to characterize the video

Classifying, wherein the classification characteristic dimension is 47 multiplied by 1, and a Dropout layer is added after FC4 to reduce overfitting;

Cross entropy loss of

Expressed as:

wherein the content of the first and second substances,

a label representing the ith image, P represents a feature distribution,

weight vector, video feature, representing full connected layers in classifier

Cross entropy loss of

Expressed as:

wherein the content of the first and second substances,

a tag representing the ith video is selected,

a weight vector representing a fully connected layer in the classifier;

2.19) the overall optimized loss function is:

by reversingPropagation algorithm updating parameter theta of feature extraction network_f：

Wherein the content of the first and second substances,

is the ith₁The features of the sub-iteration extract the network parameters,

is the ith₁+1 iteration of feature extraction network parameters, l₁For the learning rate of the batch, d represents the differential, the gradient parameters are updated using an Adam optimizer, Adam's beta₁And beta₂Set to 0.9 and 0.99, respectively;

2.110)i₁＝i₁+1, go to 2.14).

The method comprises the following steps of training a co-expression mapping network by using a face image training set and a face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of a trained feature extraction network, and comprises the following steps:

2.21) constructing a co-expression mapping network of the images, wherein the network comprises two full connection layers FC2 and FC3 and a Softmax layer, inputting image characteristics, outputting image co-expression by FC2, wherein the dimension of the image co-expression is 48 multiplied by 1, and classifying the image co-expression by a classifier of the FC3 layer and the Softmax layer;

2.22) constructing a co-expression mapping network of the video, wherein the network comprises two full connection layers FC5 and FC6 and a Softmax layer, inputting video characteristics, outputting video co-expression by FC5, wherein the dimension of the video co-expression is 48 multiplied by 1, and classifying the video co-expression by an FC6 and Softmax layer component classifier;

2.23) initializing parameters of the coexpression mapping network, number of iterations i₂Initial learning rate l is 0₂＝1e-3；

2.24) determining whether the number of iterations is satisfied

Wherein e₂20 denotes the epoch number, N₁Representing the total number of samples, N, of the training set of face images_batch60 represents the number of samples of one iteration, if the condition is not met, the iteration is ended, and if the condition is met, the process goes to 2.25);

2.25) randomly selecting 12 classes, and then randomly selecting 5 images and 5 videos as data of one iteration in a face image training set and a face video training set of each class;

2.26) forming a common expression generation network by the trained feature extraction network and the common expression mapping network, and acquiring an image common expression r by using the common expression generation network_i ^ICo-expression r with video_i ^VWherein r is_i ^IRepresenting a common representation of the ith image, r_i ^VA common representation representing an ith video;

2.27) use of Cross entropy loss

Retention of co-expressed intra-modal similarity relationships:

wherein the content of the first and second substances,

a label representing the ith image,

a tag representing the ith video;

2.28) constructing global and local semi-difficult triples as shown in fig. 2, wherein semi-difficult triples refer to selecting all pairs of positive samples and pairs of negative samples greater than the pairs by a distance margin, where margin is 2, using metric learning loss to maintain inter-modal feature relationshipsL_TripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss L_localAnd semi-difficult global triplet loss L_globalExpressed as:

wherein N is_IDenotes the use of r_i ^INumber of semi-difficult triples as anchor points, N_VDenotes the use of r_i ^VNumber of semi-difficult triples as anchor points, r_i,r_jRepresenting common expression of the same class, r_i,r_kRepresenting different classes of co-expression, c_jRepresentative table and r_iClass centers of the same class, c_kRepresentative table and r_iClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression r_i ^I,r_i ^VAnd weight vectors of corresponding classes in a second full connection layer FC3, FC6 of the co-expression mapping network

Thus using the co-expression mapping network of images and the weight vectors of the FC3, FC6 in the co-expression mapping network of videos

Represents

2.29) the overall optimized loss function is:

wherein the content of the first and second substances,

is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithm_gWhile fine tuning the parameter θ of the feature extraction network_f：

Wherein the content of the first and second substances,

is the ith₂The co-expression of the sub-iteration maps network parameters,

is the ith₂The co-expression of +1 iterations maps network parameters,

is the ith₂The features of the sub-iteration extract the network parameters,

is the ith₂Extracting network parameters from the features of +1 iteration, wherein d represents differentiation; the Adam optimizer and the SGD optimizer are combined for gradient updating of back propagation, rapid convergence of a loss function is guaranteed, and meanwhile the generalization capability of the algorithm is guaranteed: training 10 epoch, Adam's beta using Adam optimizer early in training₁And beta₂Respectively setting the values to be 0.9 and 0.99, training 10 epochs by using an SGD optimizer at the later stage, setting momentum in the SGD to be 0.9, adding L2 regularization to network parameters during training by using the SGD optimizer to enhance the generalization capability of the network, and setting the regularization coefficient of L2 to be 0.0025;

2.210)i₂＝i₂+1, go to 2.24).

3) In order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash, and the method comprises the following steps:

3.1) taking the face image test set as a query set and the face video training set as a retrieval set, and respectively extracting common expression matrixes r of the query set and the retrieval set by using the trained common expression generation network_queryAnd r_retrival；

3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCA_retrivalBy minimizing r_retrivalThe low-dimensional expression and the distance iterative training of the binary hash of the low-dimensional expression obtain a rotation matrix R, wherein R is used for balancing variances of different dimensions, and the length of a hash code is 48;

B_query＝sgn(r_queryWR)

B_retrival＝sgn(r_retrivalWR)

Wherein sgn represents a sign function;

The experimental results are as follows:

in the YTC data set, the mAP of the face image training set retrieved by the face image test set is 0.6659, and the mAP of the face image training set retrieved by the face image test set is 0.6829.

In conclusion, the method can fully utilize intra-modal information to extract single-mode features with robustness and separability, and meanwhile, semantic centers of different modalities are used to fully mine similarity information among the modalities, so that a better effect is achieved in a cross-modal face retrieval task, and the method is worthy of popularization.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The cross-modal face retrieval hashing method based on metric learning is characterized by comprising the following steps of:

2. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in the step 1), cutting each video in the face video data set into a plurality of video segments with the same length; 70% of videos in the face video data set form a face video training set, each frame of the videos in the face video training set is extracted to form a face image training set, and 1 frame is randomly extracted from the rest 30% of videos in the face video data set to serve as a face image test set; performing alignment operation on faces in the video frames and the images in the data set by using a pre-trained MTCNN, wherein the MTCNN is a deep face area detection network, and the aligned video frames and images are scaled to H multiplied by W, wherein H is the length of a picture, and W is the width of the picture; and extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.

3. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in step 2), for the feature extraction network, the following operations are performed:

2.13) initializing feature extraction network parameters, pre-training the feature extraction network using a CASIA-WebFace dataset, which is a large-scale public facial image dataset comprising 494414 facial images from 10575 classes collected from the network, initializing iteration timesNumber i₁＝0；

2.14) determining whether the number of iterations is satisfied

An image and

taking each video as data of one iteration;

2.16) input the ith sample in the face image training set

First using VGG network extraction

Adding the product of (2) to the original feature

Obtaining image features:

finally, the classifier is used to characterize the image

Classifying;

2.17) input the ith sample in the face video training set

Firstly, extracting video frame characteristics by using VGG network

And video frame characteristics

Wherein the j th₁A vector of importance coefficients

Expressed as:

then using Softmax function pair

Normalizing to obtain m groups of weight vectors

Wherein the j th₁Group weight vector

Expressed as:

in the second layer, according to the previous layer

Calculating corresponding importance coefficient by the obtained tensor

Wherein the j th₂A coefficient of importance

Expressed as:

wherein the content of the first and second substances,

Normalizing to obtain m weight coefficients

Wherein, j is₂Coefficient of weight

Expressed as:

by passing

Fusing video frame features into one video feature

Finally, the classifier is used to characterize the video

Classifying;

Cross entropy loss of

Expressed as:

wherein the content of the first and second substances,

a label representing the ith image, P represents a feature distribution,

weight vector, video feature, representing full connected layers in classifier

Cross entropy loss of

Expressed as:

wherein the content of the first and second substances,

a tag representing the ith video is selected,

a weight vector representing a fully connected layer in the classifier;

2.19) loss function optimized as a whole

Comprises the following steps:

Wherein the content of the first and second substances,

is the ith₁Features of sub-iterationThe network parameters are characterized and extracted,

2.110)i₁＝i₁+1, go to 2.14).

4. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in step 2), for the co-expression mapping network, the following operations are performed:

2.24) determining whether the number of iterations is satisfied

An image and

taking each video as data of one iteration;

2.27) use of Cross entropy loss

Retention of co-expressed intra-modal similarity relationships:

wherein the content of the first and second substances,

a label representing the ith image,

a tag representing the ith video;

wherein N is_IDenotes the use of r_i ^INumber of semi-difficult triples as anchor points, N_VDenotes the use of r_i ^VNumber of semi-difficult triples as anchor points, r_i，r_jRepresenting common expression of the same class, r_i,r_kRepresenting different classes of co-expression, c_jRepresentative table and r_iClass centers of the same class, c_kRepresentative table and r_iClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression r_i ^I,r_i ^VAnd a weight vector for a corresponding class in a second fully-connected layer of the co-expression mapping network

Represents

2.29) the overall optimized loss function is:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

is the ith₂The co-expression of the sub-iteration maps network parameters,

is the ith₂The co-expression of +1 iterations maps network parameters,

is the ith₂The features of the sub-iteration extract the network parameters,

is the ith₂Extracting network parameters from the features of +1 iteration, wherein d represents differentiation;

2.210)i₂＝i₂+1, go to 2.24).

5. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: the step 3) comprises the following steps:

3.1) testing set of face imagesAs a query set, a face video training set is used as a retrieval set, and a common expression matrix r of the query set and the retrieval set is respectively extracted by using a trained common expression generation network_queryAnd r_retrival；

B_query＝sgn(r_queryWR)

B_retrival＝sgn(r_retrivalWR)

Wherein sgn represents a sign function;