CN114067385A - Cross-modal face retrieval Hash method based on metric learning - Google Patents
Cross-modal face retrieval Hash method based on metric learning Download PDFInfo
- Publication number
- CN114067385A CN114067385A CN202111175867.7A CN202111175867A CN114067385A CN 114067385 A CN114067385 A CN 114067385A CN 202111175867 A CN202111175867 A CN 202111175867A CN 114067385 A CN114067385 A CN 114067385A
- Authority
- CN
- China
- Prior art keywords
- video
- network
- face
- expression
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 90
- 230000004186 co-expression Effects 0.000 claims abstract description 59
- 238000000605 extraction Methods 0.000 claims abstract description 56
- 238000013507 mapping Methods 0.000 claims abstract description 44
- 230000014509 gene expression Effects 0.000 claims abstract description 29
- 230000006870 function Effects 0.000 claims abstract description 24
- 238000012360 testing method Methods 0.000 claims abstract description 20
- 238000013139 quantization Methods 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000012163 sequencing technique Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 33
- 239000000126 substance Substances 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 230000004069 differentiation Effects 0.000 claims description 2
- 230000001815 facial effect Effects 0.000 claims 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 5
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a cross-modal face retrieval hash method based on metric learning, which comprises the following steps: 1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction; 2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; 3) and using the trained common expression generation network and ITQ iterative quantization to extract the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and sequencing the retrieval items according to the Hamming distance of the image Hash and the video Hash. The cross-modal face hash function of the present invention can generate image and video features that are robust and discernable both within and between modalities. The invention tests on YTC face video data set, and can ensure high accuracy of cross-modal retrieval.
Description
Technical Field
The invention relates to the technical field of multimedia processing, in particular to a cross-modal face retrieval hash method based on metric learning.
Background
The cross-mode face retrieval refers to retrieving a face video related to a certain face ID by using an image of the certain face ID or retrieving a corresponding face image by using the face video of the certain face ID. At present, the creation and the propagation of multimedia information become increasingly wide, and the demand diversity of users for retrieval is also greatly improved, so that the cross-mode face retrieval has wide application scenes, for example, related movie and television plays can be retrieved on a video playing website according to roles, or certain video segments can be played in the movie and television plays according to character selection.
Generally, an image only contains two-dimensional spatial description, and a video contains spatial transformation in a continuous time, and the two modes have large difference. The predominant method to reduce inter-modality differences is to project two different features into a common space where the image features and video features have semantic consistency. However, for images and videos, the inter-mode difference is not only reflected in the difference between a time domain and a space domain, but also reflected in the view angle difference between the time domain and the space domain, the human faces in the human face images and the human face videos may have a large view angle difference, and the view angle difference may involve complex changes such as occlusion, nonlinear distortion, position change and the like, so that the features of different modes in a common space are more dispersed, and the accuracy of cross-mode retrieval is further influenced. This complex inter-modality difference makes cross-modality face retrieval a challenging problem.
In order to solve the above problems, the present invention provides a two-stage method, in which video features and image features robust to view angle changes are extracted in a first stage, two feature spaces are mapped to a common space in a second stage, two heterogeneous features are aligned in the common space by metric learning, and finally, image hash and video hash are obtained by ITQ iterative quantization.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a cross-modal face retrieval hash method based on metric learning, which can extract hashes with robustness and separability in and among modes and improve the consistency and accuracy of cross-modal metric learning.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: the cross-modal face retrieval hashing method based on metric learning comprises the following steps:
1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction;
2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; during training, training a feature extraction network by using a face image training set and a face video training set, updating parameters of the feature extraction network by using cross entropy loss, training a co-expression mapping network by using the face image training set and the face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of the trained feature extraction network;
3) in order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash.
Further, in step 1), cutting each video in the face video data set into a plurality of video segments with the same length; 70% of videos in the face video data set form a face video training set, each frame of the videos in the face video training set is extracted to form a face image training set, and 1 frame is randomly extracted from the rest 30% of videos in the face video data set to serve as a face image test set; performing alignment operation on faces in the video frames and the images in the data set by using a pre-trained MTCNN, wherein the MTCNN is a deep face area detection network, and the aligned video frames and images are scaled to H multiplied by W, wherein H is the length of a picture, and W is the width of the picture; and extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.
Further, in step 2), for the feature extraction network, the following operations are performed:
2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of a plurality of fully-connected layers and nonlinear layer network structures, and the classifier consists of a fully-connected layer and a Softmax layer;
2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the feature extraction network of the image, an attention network and a classifier, and the classifier consists of a full connection layer and a Softmax layer;
2.13) initializing the parameters of the feature extraction network, pre-training the feature extraction network using a CASIA-WebFace dataset, which is a large-scale public face image dataset comprising 494414 face images from 10575 categories collected from the network, and initializing the number of iterations i1=0;
2.14) determining whether the number of iterations is satisfiedWherein e1Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of image samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to the step 2.15);
2.15) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each categoryAn image andtaking each video as data of one iteration;
2.16) input the ith sample in the face image training setFirst using VGG network extractionSecondly, mapping characteristics of the Yaw angle parameter Yaw and the Yaw angle residual errorAdding the product of (2) to the original featureObtaining image features:
2.17) input the ith sample in the face video training setFirstly, extracting video frame characteristics by using VGG networkSecondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: computing m kernel vectors at the first levelAnd video frame characteristicsAnd obtaining an importance coefficient vector of each frame by nonlinear mapping of the ReLUWherein the j th1A vector of importance coefficientsExpressed as:
then using Softmax function pairNormalizing to obtain m groups of weight vectorsWherein the j th1Group weight vectorExpressed as:
in the second layer, according to the previous layerCalculating corresponding importance coefficient by the obtained tensorWherein the j th2A coefficient of importanceExpressed as:
wherein the content of the first and second substances,j represents the second layer2Kernel vectors, then using Softmax function pairsNormalizing to obtain m weight coefficientsWherein, j is2Coefficient of weightExpressed as:
by passingFusing video frame features into one video featureFinally, the classifier is used to characterize the videoClassifying;
2.18) obtaining image features and video features using cross-entropy loss as a supervisionCross entropy loss ofExpressed as:
wherein the content of the first and second substances,a label representing the ith image, P represents a feature distribution,weight vector, video feature, representing full connected layers in classifierCross entropy loss ofExpressed as:
wherein the content of the first and second substances,a tag representing the ith video is selected,a weight vector representing a fully connected layer in the classifier;
updating a parameter θ of a feature extraction network by a back propagation algorithmf:
Wherein the content of the first and second substances,is the ith1The features of the sub-iteration extract the network parameters,is the ith1+1 iteration of feature extraction network parameters, l1D represents the differential for the learning rate of the batch;
2.110)i1=i1+1, go to 2.14).
Further, in step 2), for the co-expression mapping network, the following operations are performed:
2.21) constructing a common expression mapping network of the images, wherein the network comprises two full connection layers and a Softmax layer, inputting image characteristics, outputting the common expression of the images by the first full connection layer, and classifying the common expression of the images by a classifier by the second full connection layer and the Softmax layer;
2.22) constructing a video co-expression mapping network, wherein the network comprises two full connection layers and a Softmax layer, video characteristics are input, the first full connection layer outputs video co-expression, and the second full connection layer and the Softmax layer form a classifier for classifying the video co-expression;
2.23) initializing parameters of the coexpression mapping network, number of iterations i2=0;
2.24) determining whether the number of iterations is satisfiedWherein e2Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to 2.25);
2.25) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each categoryAn image andtaking each video as data of one iteration;
2.26) network acquisition image Co-expression r Using Co-expression Generationi ICo-expression r with videoi VWherein r isi IRepresenting a common representation of the ith image, ri VA common representation representing an ith video;
wherein li ILabel representing ith image, li VA tag representing the ith video;
2.28) constructing global semi-difficult triples and local semi-difficult triples to keep the characteristic relation between modes, wherein the semi-difficult triples refer to that all positive sample pairs and negative sample pairs which are more than the positive sample pairs and within the distance margin are selected, and the loss L is learned by using the metricTripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss LlocalAnd semi-difficult global triplet loss LglobalExpressed as:
wherein N isIDenotes the use of ri INumber of semi-difficult triples as anchor points, NVDenotes the use of ri VNumber of semi-difficult triples as anchor points, ri,rjRepresenting common expression of the same class, ri,rkRepresenting different classes of co-expression, cjRepresentative table and riClass centers of the same class, ckRepresentative table and riClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression ri I,ri VAnd a weight vector for a corresponding class in a second fully-connected layer of the co-expression mapping networkMaximizing the weight vector corresponding to the common expression and other classesThus using the co-expression mapping network of images and the weight vector of the second fully connected layer in the co-expression mapping network of videoRepresents
2.29) the overall optimized loss function is:
wherein the content of the first and second substances,is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithmgWhile fine tuning the parameter θ of the feature extraction networkf,l2Learning rate for this batch:
wherein the content of the first and second substances,is the ith2The co-expression of the sub-iteration maps network parameters,is the ith2The co-expression of +1 iterations maps network parameters,is the ith2The features of the sub-iteration extract the network parameters, d represents the differential,is the ith2Extracting network parameters according to the characteristics of +1 iteration;
2.210)i2=i2+1, go to 2.24).
Further, the step 3) comprises the following steps:
3.1) taking the face image test set as a query set and the face video training set as a retrieval set, and respectively extracting the query by using the trained common expression generation networkCo-expression matrix r of collection and search setqueryAnd rretrival;
3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCAretrivalBy minimizing rretrivalThe low-dimensional expression and the distance iterative training of the binary hash thereof obtain a rotation matrix R, wherein R is used for balancing the variances of different dimensions;
3.3) obtaining a binary Hash matrix B for extracting a query set and a retrieval set by using W and RqueryAnd Bretrival:
Bquery=sgn(rqueryWR)
Bretrival=sgn(rretrivalWR)
Wherein sgn represents a sign function;
and 3.4) calculating the binary hash of each query item in the query set and the Hamming distance of all the binary hashes in the retrieval set, sequencing the retrieval set according to the Hamming distance, and returning the sequenced retrieval set.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides a two-stage cross-mode face image video hash generation method, which can generate images and video hashes with robustness and distinguishability in and among modes.
2. The method aligns the image characteristics of different yaw angles through a light-weight yaw angle residual mapping, and obtains the image characteristics with robustness to the change of the yaw angle.
3. The method of the invention fuses the video frame characteristics through a multi-layer attention mechanism, and obtains the video characteristics with robustness.
4. The method provided by the invention is different from the traditional metric learning method, the semantic class center is adopted to construct the global triple of the sample and the class center, the consistency and the accuracy of metric learning are improved by combining the local triple, and meanwhile, the convergence of a loss function is accelerated by screening of semi-difficult sample pairs.
Drawings
FIG. 1 is an architectural diagram of the method of the present invention.
Fig. 2 is a schematic diagram of metric learning proposed by the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
As shown in fig. 1, the present embodiment discloses a cross-modal face retrieval hashing method based on metric learning, which specifically includes the following main technical steps:
1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction; the specific situation is as follows:
the data set selected in this example is the YTC face video data set containing 1910 face videos from 47 celebrities collected from the Youtube video website, each video in the face video data set being cropped into a plurality of video segments of 30 frames in length.
And (3) forming a face video training set by 70% of videos in the YTC face video data set, extracting each frame of the videos in the face video training set to form a face image training set, and randomly extracting 1 frame from each of the rest 30% of videos in the face video data set to serve as a face image test set.
The pre-trained MTCNN, which is a deep face region detection network, is used to align the faces in the video frames and images in the data set, and the aligned video frames and images are scaled to 64 × 64.
And extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.
2) Training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network.
The method for training the feature extraction network by using the face image training set and the face video training set and updating the parameters of the feature extraction network by using cross entropy loss comprises the following steps:
2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of two fully-connected layers, namely a nonlinear layer network structure, and the classifier consists of a fully-connected layer FC1 and a Softmax layer;
2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the image feature extraction network, an attention network and a classifier, and the classifier consists of a full connection layer FC4 and a Softmax layer;
2.13) initializing the parameters of the feature extraction network, pre-training the feature extraction network by using a CASIA-WebFace, wherein the CASIA-WebFace data set is a large-scale public face image data set and comprises 494414 face images from 10575 categories collected from the network, and the number of initialization iterations i1Initial learning rate l is 01=1e-3;
2.14) determining whether the number of iterations is satisfiedWherein e11 denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatch60 denotes the number of image samples in one iteration, if the condition is not satisfied, the iteration is ended, and if the condition is satisfied, the process goes to 2.15);
2.15) randomly selecting 12 classes, and then randomly selecting 5 images and 5 videos as data of one iteration in a face image training set and a face video training set of each class;
2.16) input the ith sample in the face image training setFirst using VGG network extractionSecondly, mapping characteristics of the Yaw angle parameter Yaw and the Yaw angle residual errorAdding the product of (2) to the original featureObtaining image features:
dimension 512 × 1, and finally, using a classifier to characterize the imageClassifying, wherein the classification characteristic dimension is 47 multiplied by 1, and a Dropout layer is added after FC1 to reduce overfitting;
2.17) input the ith sample in the face video training setFirstly, extracting video frame characteristics by using VGG networkSecondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: calculating m-3 kernel vectors in the first layerAnd video frame characteristicsAnd obtaining an importance coefficient vector of each frame by nonlinear mapping of the ReLUWherein the j th1Heavy pieceEssential coefficient vectorExpressed as:
then using Softmax function pairNormalizing to obtain m groups of weight vectorsWherein the j th1Group weight vectorExpressed as:
in the second layer, according to the previous layerCalculating corresponding importance coefficient by the obtained tensorWherein the j th2A coefficient of importanceExpressed as:
wherein the content of the first and second substances,j represents the second layer2Kernel vectors, then using Softmax function pairsNormalizing to obtain m weight coefficientsWherein, j is2Coefficient of weightExpressed as:
by passingFusing video frame features into one video feature Dimension 512 × 1, and finally, using a classifier to characterize the videoClassifying, wherein the classification characteristic dimension is 47 multiplied by 1, and a Dropout layer is added after FC4 to reduce overfitting;
2.18) obtaining image features and video features using cross-entropy loss as a supervisionCross entropy loss ofExpressed as:
wherein the content of the first and second substances,a label representing the ith image, P represents a feature distribution,weight vector, video feature, representing full connected layers in classifierCross entropy loss ofExpressed as:
wherein the content of the first and second substances,a tag representing the ith video is selected,a weight vector representing a fully connected layer in the classifier;
2.19) the overall optimized loss function is:
by reversingPropagation algorithm updating parameter theta of feature extraction networkf:
Wherein the content of the first and second substances,is the ith1The features of the sub-iteration extract the network parameters,is the ith1+1 iteration of feature extraction network parameters, l1For the learning rate of the batch, d represents the differential, the gradient parameters are updated using an Adam optimizer, Adam's beta1And beta2Set to 0.9 and 0.99, respectively;
2.110)i1=i1+1, go to 2.14).
The method comprises the following steps of training a co-expression mapping network by using a face image training set and a face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of a trained feature extraction network, and comprises the following steps:
2.21) constructing a co-expression mapping network of the images, wherein the network comprises two full connection layers FC2 and FC3 and a Softmax layer, inputting image characteristics, outputting image co-expression by FC2, wherein the dimension of the image co-expression is 48 multiplied by 1, and classifying the image co-expression by a classifier of the FC3 layer and the Softmax layer;
2.22) constructing a co-expression mapping network of the video, wherein the network comprises two full connection layers FC5 and FC6 and a Softmax layer, inputting video characteristics, outputting video co-expression by FC5, wherein the dimension of the video co-expression is 48 multiplied by 1, and classifying the video co-expression by an FC6 and Softmax layer component classifier;
2.23) initializing parameters of the coexpression mapping network, number of iterations i2Initial learning rate l is 02=1e-3;
2.24) determining whether the number of iterations is satisfiedWherein e220 denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatch60 represents the number of samples of one iteration, if the condition is not met, the iteration is ended, and if the condition is met, the process goes to 2.25);
2.25) randomly selecting 12 classes, and then randomly selecting 5 images and 5 videos as data of one iteration in a face image training set and a face video training set of each class;
2.26) forming a common expression generation network by the trained feature extraction network and the common expression mapping network, and acquiring an image common expression r by using the common expression generation networki ICo-expression r with videoi VWherein r isi IRepresenting a common representation of the ith image, ri VA common representation representing an ith video;
wherein the content of the first and second substances,a label representing the ith image,a tag representing the ith video;
2.28) constructing global and local semi-difficult triples as shown in fig. 2, wherein semi-difficult triples refer to selecting all pairs of positive samples and pairs of negative samples greater than the pairs by a distance margin, where margin is 2, using metric learning loss to maintain inter-modal feature relationshipsLTripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss LlocalAnd semi-difficult global triplet loss LglobalExpressed as:
wherein N isIDenotes the use of ri INumber of semi-difficult triples as anchor points, NVDenotes the use of ri VNumber of semi-difficult triples as anchor points, ri,rjRepresenting common expression of the same class, ri,rkRepresenting different classes of co-expression, cjRepresentative table and riClass centers of the same class, ckRepresentative table and riClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression ri I,ri VAnd weight vectors of corresponding classes in a second full connection layer FC3, FC6 of the co-expression mapping networkMaximizing the weight vector corresponding to the common expression and other classesThus using the co-expression mapping network of images and the weight vectors of the FC3, FC6 in the co-expression mapping network of videosRepresents
2.29) the overall optimized loss function is:
wherein the content of the first and second substances,is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithmgWhile fine tuning the parameter θ of the feature extraction networkf:
Wherein the content of the first and second substances,is the ith2The co-expression of the sub-iteration maps network parameters,is the ith2The co-expression of +1 iterations maps network parameters,is the ith2The features of the sub-iteration extract the network parameters,is the ith2Extracting network parameters from the features of +1 iteration, wherein d represents differentiation; the Adam optimizer and the SGD optimizer are combined for gradient updating of back propagation, rapid convergence of a loss function is guaranteed, and meanwhile the generalization capability of the algorithm is guaranteed: training 10 epoch, Adam's beta using Adam optimizer early in training1And beta2Respectively setting the values to be 0.9 and 0.99, training 10 epochs by using an SGD optimizer at the later stage, setting momentum in the SGD to be 0.9, adding L2 regularization to network parameters during training by using the SGD optimizer to enhance the generalization capability of the network, and setting the regularization coefficient of L2 to be 0.0025;
2.210)i2=i2+1, go to 2.24).
3) In order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash, and the method comprises the following steps:
3.1) taking the face image test set as a query set and the face video training set as a retrieval set, and respectively extracting common expression matrixes r of the query set and the retrieval set by using the trained common expression generation networkqueryAnd rretrival;
3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCAretrivalBy minimizing rretrivalThe low-dimensional expression and the distance iterative training of the binary hash of the low-dimensional expression obtain a rotation matrix R, wherein R is used for balancing variances of different dimensions, and the length of a hash code is 48;
3.3) obtaining a binary Hash matrix B for extracting a query set and a retrieval set by using W and RqueryAnd Bretrival:
Bquery=sgn(rqueryWR)
Bretrival=sgn(rretrivalWR)
Wherein sgn represents a sign function;
and 3.4) calculating the binary hash of each query item in the query set and the Hamming distance of all the binary hashes in the retrieval set, sequencing the retrieval set according to the Hamming distance, and returning the sequenced retrieval set.
The experimental results are as follows:
in the YTC data set, the mAP of the face image training set retrieved by the face image test set is 0.6659, and the mAP of the face image training set retrieved by the face image test set is 0.6829.
In conclusion, the method can fully utilize intra-modal information to extract single-mode features with robustness and separability, and meanwhile, semantic centers of different modalities are used to fully mine similarity information among the modalities, so that a better effect is achieved in a cross-modal face retrieval task, and the method is worthy of popularization.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (5)
1. The cross-modal face retrieval hashing method based on metric learning is characterized by comprising the following steps of:
1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction;
2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; during training, training a feature extraction network by using a face image training set and a face video training set, updating parameters of the feature extraction network by using cross entropy loss, training a co-expression mapping network by using the face image training set and the face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of the trained feature extraction network;
3) in order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash.
2. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in the step 1), cutting each video in the face video data set into a plurality of video segments with the same length; 70% of videos in the face video data set form a face video training set, each frame of the videos in the face video training set is extracted to form a face image training set, and 1 frame is randomly extracted from the rest 30% of videos in the face video data set to serve as a face image test set; performing alignment operation on faces in the video frames and the images in the data set by using a pre-trained MTCNN, wherein the MTCNN is a deep face area detection network, and the aligned video frames and images are scaled to H multiplied by W, wherein H is the length of a picture, and W is the width of the picture; and extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.
3. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in step 2), for the feature extraction network, the following operations are performed:
2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of a plurality of fully-connected layers and nonlinear layer network structures, and the classifier consists of a fully-connected layer and a Softmax layer;
2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the feature extraction network of the image, an attention network and a classifier, and the classifier consists of a full connection layer and a Softmax layer;
2.13) initializing feature extraction network parameters, pre-training the feature extraction network using a CASIA-WebFace dataset, which is a large-scale public facial image dataset comprising 494414 facial images from 10575 classes collected from the network, initializing iteration timesNumber i1=0;
2.14) determining whether the number of iterations is satisfiedWherein e1Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of image samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to the step 2.15);
2.15) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each categoryAn image andtaking each video as data of one iteration;
2.16) input the ith sample in the face image training setFirst using VGG network extractionSecondly, mapping characteristics of the Yaw angle parameter Yaw and the Yaw angle residual errorAdding the product of (2) to the original featureObtaining image features:
2.17) input the ith sample in the face video training setFirstly, extracting video frame characteristics by using VGG networkSecondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: computing m kernel vectors at the first levelAnd video frame characteristicsAnd obtaining an importance coefficient vector of each frame by nonlinear mapping of the ReLUWherein the j th1A vector of importance coefficientsExpressed as:
then using Softmax function pairNormalizing to obtain m groups of weight vectorsWherein the j th1Group weight vectorExpressed as:
in the second layer, according to the previous layerCalculating corresponding importance coefficient by the obtained tensorWherein the j th2A coefficient of importanceExpressed as:
wherein the content of the first and second substances,j represents the second layer2Kernel vectors, then using Softmax function pairsNormalizing to obtain m weight coefficientsWherein, j is2Coefficient of weightExpressed as:
by passingFusing video frame features into one video featureFinally, the classifier is used to characterize the videoClassifying;
2.18) obtaining image features and video features using cross-entropy loss as a supervisionCross entropy loss ofExpressed as:
wherein the content of the first and second substances,a label representing the ith image, P represents a feature distribution,weight vector, video feature, representing full connected layers in classifierCross entropy loss ofExpressed as:
wherein the content of the first and second substances,a tag representing the ith video is selected,a weight vector representing a fully connected layer in the classifier;
updating a parameter θ of a feature extraction network by a back propagation algorithmf:
Wherein the content of the first and second substances,is the ith1Features of sub-iterationThe network parameters are characterized and extracted,is the ith1+1 iteration of feature extraction network parameters, l1D represents the differential for the learning rate of the batch;
2.110)i1=i1+1, go to 2.14).
4. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in step 2), for the co-expression mapping network, the following operations are performed:
2.21) constructing a common expression mapping network of the images, wherein the network comprises two full connection layers and a Softmax layer, inputting image characteristics, outputting the common expression of the images by the first full connection layer, and classifying the common expression of the images by a classifier by the second full connection layer and the Softmax layer;
2.22) constructing a video co-expression mapping network, wherein the network comprises two full connection layers and a Softmax layer, video characteristics are input, the first full connection layer outputs video co-expression, and the second full connection layer and the Softmax layer form a classifier for classifying the video co-expression;
2.23) initializing parameters of the coexpression mapping network, number of iterations i2=0;
2.24) determining whether the number of iterations is satisfiedWherein e2Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to 2.25);
2.25) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each categoryAn image andtaking each video as data of one iteration;
2.26) network acquisition image Co-expression r Using Co-expression Generationi ICo-expression r with videoi VWherein r isi IRepresenting a common representation of the ith image, ri VA common representation representing an ith video;
wherein the content of the first and second substances,a label representing the ith image,a tag representing the ith video;
2.28) constructing global semi-difficult triples and local semi-difficult triples to keep the characteristic relation between modes, wherein the semi-difficult triples refer to that all positive sample pairs and negative sample pairs which are more than the positive sample pairs and within the distance margin are selected, and the loss L is learned by using the metricTripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss LlocalAnd semi-difficult global triplet loss LglobalExpressed as:
wherein N isIDenotes the use of ri INumber of semi-difficult triples as anchor points, NVDenotes the use of ri VNumber of semi-difficult triples as anchor points, ri,rjRepresenting common expression of the same class, ri,rkRepresenting different classes of co-expression, cjRepresentative table and riClass centers of the same class, ckRepresentative table and riClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression ri I,ri VAnd a weight vector for a corresponding class in a second fully-connected layer of the co-expression mapping networkMaximizing the weight vector corresponding to the common expression and other classesThus using the co-expression mapping network of images and the weight vector of the second fully connected layer in the co-expression mapping network of videoRepresents
2.29) the overall optimized loss function is:
wherein the content of the first and second substances,is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithmgWhile fine tuning the parameter θ of the feature extraction networkf,l2Learning rate for this batch:
wherein the content of the first and second substances,is the ith2The co-expression of the sub-iteration maps network parameters,is the ith2The co-expression of +1 iterations maps network parameters,is the ith2The features of the sub-iteration extract the network parameters,is the ith2Extracting network parameters from the features of +1 iteration, wherein d represents differentiation;
2.210)i2=i2+1, go to 2.24).
5. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: the step 3) comprises the following steps:
3.1) testing set of face imagesAs a query set, a face video training set is used as a retrieval set, and a common expression matrix r of the query set and the retrieval set is respectively extracted by using a trained common expression generation networkqueryAnd rretrival;
3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCAretrivalBy minimizing rretrivalThe low-dimensional expression and the distance iterative training of the binary hash thereof obtain a rotation matrix R, wherein R is used for balancing the variances of different dimensions;
3.3) obtaining a binary Hash matrix B for extracting a query set and a retrieval set by using W and RqueryAnd Bretrival:
Bquery=sgn(rqueryWR)
Bretrival=sgn(rretrivalWR)
Wherein sgn represents a sign function;
and 3.4) calculating the binary hash of each query item in the query set and the Hamming distance of all the binary hashes in the retrieval set, sequencing the retrieval set according to the Hamming distance, and returning the sequenced retrieval set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111175867.7A CN114067385B (en) | 2021-10-09 | 2021-10-09 | Cross-modal face retrieval hash method based on metric learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111175867.7A CN114067385B (en) | 2021-10-09 | 2021-10-09 | Cross-modal face retrieval hash method based on metric learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114067385A true CN114067385A (en) | 2022-02-18 |
CN114067385B CN114067385B (en) | 2024-05-31 |
Family
ID=80234405
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111175867.7A Active CN114067385B (en) | 2021-10-09 | 2021-10-09 | Cross-modal face retrieval hash method based on metric learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114067385B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114360032A (en) * | 2022-03-17 | 2022-04-15 | 北京启醒科技有限公司 | Polymorphic invariance face recognition method and system |
CN114638002A (en) * | 2022-03-21 | 2022-06-17 | 华南理工大学 | Compressed image encryption method supporting similarity retrieval |
CN114866345A (en) * | 2022-07-05 | 2022-08-05 | 支付宝(杭州)信息技术有限公司 | Processing method, device and equipment for biological recognition |
CN115063845A (en) * | 2022-06-20 | 2022-09-16 | 华南理工大学 | Finger vein identification method based on lightweight network and deep hash |
CN115240249A (en) * | 2022-07-07 | 2022-10-25 | 湖北大学 | Feature extraction classification measurement learning method and system for face recognition and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175248A (en) * | 2019-04-04 | 2019-08-27 | 中国科学院信息工程研究所 | A kind of Research on face image retrieval and device encoded based on deep learning and Hash |
CN111753190A (en) * | 2020-05-29 | 2020-10-09 | 中山大学 | Meta learning-based unsupervised cross-modal Hash retrieval method |
CN112800292A (en) * | 2021-01-15 | 2021-05-14 | 南京邮电大学 | Cross-modal retrieval method based on modal specificity and shared feature learning |
CN113190699A (en) * | 2021-05-14 | 2021-07-30 | 华中科技大学 | Remote sensing image retrieval method and device based on category-level semantic hash |
-
2021
- 2021-10-09 CN CN202111175867.7A patent/CN114067385B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175248A (en) * | 2019-04-04 | 2019-08-27 | 中国科学院信息工程研究所 | A kind of Research on face image retrieval and device encoded based on deep learning and Hash |
CN111753190A (en) * | 2020-05-29 | 2020-10-09 | 中山大学 | Meta learning-based unsupervised cross-modal Hash retrieval method |
CN112800292A (en) * | 2021-01-15 | 2021-05-14 | 南京邮电大学 | Cross-modal retrieval method based on modal specificity and shared feature learning |
CN113190699A (en) * | 2021-05-14 | 2021-07-30 | 华中科技大学 | Remote sensing image retrieval method and device based on category-level semantic hash |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114360032A (en) * | 2022-03-17 | 2022-04-15 | 北京启醒科技有限公司 | Polymorphic invariance face recognition method and system |
CN114638002A (en) * | 2022-03-21 | 2022-06-17 | 华南理工大学 | Compressed image encryption method supporting similarity retrieval |
CN114638002B (en) * | 2022-03-21 | 2023-04-28 | 华南理工大学 | Compressed image encryption method supporting similarity retrieval |
CN115063845A (en) * | 2022-06-20 | 2022-09-16 | 华南理工大学 | Finger vein identification method based on lightweight network and deep hash |
CN115063845B (en) * | 2022-06-20 | 2024-05-28 | 华南理工大学 | Finger vein recognition method based on lightweight network and deep hash |
CN114866345A (en) * | 2022-07-05 | 2022-08-05 | 支付宝(杭州)信息技术有限公司 | Processing method, device and equipment for biological recognition |
CN115240249A (en) * | 2022-07-07 | 2022-10-25 | 湖北大学 | Feature extraction classification measurement learning method and system for face recognition and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114067385B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122809B (en) | Neural network feature learning method based on image self-coding | |
CN114067385A (en) | Cross-modal face retrieval Hash method based on metric learning | |
Zheng et al. | A deep and autoregressive approach for topic modeling of multimodal data | |
WO2021159769A1 (en) | Image retrieval method and apparatus, storage medium, and device | |
CN108491430B (en) | Unsupervised Hash retrieval method based on clustering characteristic directions | |
Zafar et al. | Image classification by addition of spatial information based on histograms of orthogonal vectors | |
CN108427740B (en) | Image emotion classification and retrieval algorithm based on depth metric learning | |
Hu et al. | Learning dual-pooling graph neural networks for few-shot video classification | |
CN110598022B (en) | Image retrieval system and method based on robust deep hash network | |
CN111104555A (en) | Video hash retrieval method based on attention mechanism | |
CN112712127A (en) | Image emotion polarity classification method combined with graph convolution neural network | |
Menaga et al. | Deep learning: a recent computing platform for multimedia information retrieval | |
CN107220597B (en) | Key frame selection method based on local features and bag-of-words model human body action recognition process | |
CN113344069B (en) | Image classification method for unsupervised visual representation learning based on multi-dimensional relation alignment | |
Cao et al. | Facial expression recognition algorithm based on the combination of CNN and K-Means | |
Yao | Key frame extraction method of music and dance video based on multicore learning feature fusion | |
Liu et al. | A multimodal approach for multiple-relation extraction in videos | |
Oussama et al. | A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries | |
Dong et al. | A supervised dictionary learning and discriminative weighting model for action recognition | |
Bibi et al. | Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval | |
CN115100694A (en) | Fingerprint quick retrieval method based on self-supervision neural network | |
An et al. | Near duplicate product image detection based on binary hashing | |
Benuwa et al. | Deep locality‐sensitive discriminative dictionary learning for semantic video analysis | |
CN105279489B (en) | A kind of method for extracting video fingerprints based on sparse coding | |
Huang et al. | Baggage image retrieval with attention-based network for security checks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |