CN114067385A - Cross-modal face retrieval Hash method based on metric learning - Google Patents

Cross-modal face retrieval Hash method based on metric learning Download PDF

Info

Publication number
CN114067385A
CN114067385A CN202111175867.7A CN202111175867A CN114067385A CN 114067385 A CN114067385 A CN 114067385A CN 202111175867 A CN202111175867 A CN 202111175867A CN 114067385 A CN114067385 A CN 114067385A
Authority
CN
China
Prior art keywords
video
network
face
expression
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111175867.7A
Other languages
Chinese (zh)
Other versions
CN114067385B (en
Inventor
梁籍云
沃焱
韩国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202111175867.7A priority Critical patent/CN114067385B/en
Publication of CN114067385A publication Critical patent/CN114067385A/en
Application granted granted Critical
Publication of CN114067385B publication Critical patent/CN114067385B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a cross-modal face retrieval hash method based on metric learning, which comprises the following steps: 1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction; 2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; 3) and using the trained common expression generation network and ITQ iterative quantization to extract the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and sequencing the retrieval items according to the Hamming distance of the image Hash and the video Hash. The cross-modal face hash function of the present invention can generate image and video features that are robust and discernable both within and between modalities. The invention tests on YTC face video data set, and can ensure high accuracy of cross-modal retrieval.

Description

Cross-modal face retrieval Hash method based on metric learning
Technical Field
The invention relates to the technical field of multimedia processing, in particular to a cross-modal face retrieval hash method based on metric learning.
Background
The cross-mode face retrieval refers to retrieving a face video related to a certain face ID by using an image of the certain face ID or retrieving a corresponding face image by using the face video of the certain face ID. At present, the creation and the propagation of multimedia information become increasingly wide, and the demand diversity of users for retrieval is also greatly improved, so that the cross-mode face retrieval has wide application scenes, for example, related movie and television plays can be retrieved on a video playing website according to roles, or certain video segments can be played in the movie and television plays according to character selection.
Generally, an image only contains two-dimensional spatial description, and a video contains spatial transformation in a continuous time, and the two modes have large difference. The predominant method to reduce inter-modality differences is to project two different features into a common space where the image features and video features have semantic consistency. However, for images and videos, the inter-mode difference is not only reflected in the difference between a time domain and a space domain, but also reflected in the view angle difference between the time domain and the space domain, the human faces in the human face images and the human face videos may have a large view angle difference, and the view angle difference may involve complex changes such as occlusion, nonlinear distortion, position change and the like, so that the features of different modes in a common space are more dispersed, and the accuracy of cross-mode retrieval is further influenced. This complex inter-modality difference makes cross-modality face retrieval a challenging problem.
In order to solve the above problems, the present invention provides a two-stage method, in which video features and image features robust to view angle changes are extracted in a first stage, two feature spaces are mapped to a common space in a second stage, two heterogeneous features are aligned in the common space by metric learning, and finally, image hash and video hash are obtained by ITQ iterative quantization.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a cross-modal face retrieval hash method based on metric learning, which can extract hashes with robustness and separability in and among modes and improve the consistency and accuracy of cross-modal metric learning.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: the cross-modal face retrieval hashing method based on metric learning comprises the following steps:
1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction;
2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; during training, training a feature extraction network by using a face image training set and a face video training set, updating parameters of the feature extraction network by using cross entropy loss, training a co-expression mapping network by using the face image training set and the face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of the trained feature extraction network;
3) in order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash.
Further, in step 1), cutting each video in the face video data set into a plurality of video segments with the same length; 70% of videos in the face video data set form a face video training set, each frame of the videos in the face video training set is extracted to form a face image training set, and 1 frame is randomly extracted from the rest 30% of videos in the face video data set to serve as a face image test set; performing alignment operation on faces in the video frames and the images in the data set by using a pre-trained MTCNN, wherein the MTCNN is a deep face area detection network, and the aligned video frames and images are scaled to H multiplied by W, wherein H is the length of a picture, and W is the width of the picture; and extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.
Further, in step 2), for the feature extraction network, the following operations are performed:
2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of a plurality of fully-connected layers and nonlinear layer network structures, and the classifier consists of a fully-connected layer and a Softmax layer;
2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the feature extraction network of the image, an attention network and a classifier, and the classifier consists of a full connection layer and a Softmax layer;
2.13) initializing the parameters of the feature extraction network, pre-training the feature extraction network using a CASIA-WebFace dataset, which is a large-scale public face image dataset comprising 494414 face images from 10575 categories collected from the network, and initializing the number of iterations i1=0;
2.14) determining whether the number of iterations is satisfied
Figure BDA0003295046160000031
Wherein e1Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of image samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to the step 2.15);
2.15) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each category
Figure BDA0003295046160000032
An image and
Figure BDA0003295046160000033
taking each video as data of one iteration;
2.16) input the ith sample in the face image training set
Figure BDA0003295046160000034
First using VGG network extraction
Figure BDA0003295046160000035
Secondly, mapping characteristics of the Yaw angle parameter Yaw and the Yaw angle residual error
Figure BDA0003295046160000036
Adding the product of (2) to the original feature
Figure BDA0003295046160000037
Obtaining image features:
Figure BDA0003295046160000038
finally, the classifier is used to characterize the image
Figure BDA0003295046160000039
Classifying;
2.17) input the ith sample in the face video training set
Figure BDA00032950461600000310
Firstly, extracting video frame characteristics by using VGG network
Figure BDA00032950461600000311
Secondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: computing m kernel vectors at the first level
Figure BDA0003295046160000041
And video frame characteristics
Figure BDA0003295046160000042
And obtaining an importance coefficient vector of each frame by nonlinear mapping of the ReLU
Figure BDA0003295046160000043
Wherein the j th1A vector of importance coefficients
Figure BDA0003295046160000044
Expressed as:
Figure BDA0003295046160000045
then using Softmax function pair
Figure BDA0003295046160000046
Normalizing to obtain m groups of weight vectors
Figure BDA0003295046160000047
Wherein the j th1Group weight vector
Figure BDA0003295046160000048
Expressed as:
Figure BDA0003295046160000049
in the second layer, according to the previous layer
Figure BDA00032950461600000410
Calculating corresponding importance coefficient by the obtained tensor
Figure BDA00032950461600000411
Wherein the j th2A coefficient of importance
Figure BDA00032950461600000412
Expressed as:
Figure BDA00032950461600000413
wherein the content of the first and second substances,
Figure BDA00032950461600000414
j represents the second layer2Kernel vectors, then using Softmax function pairs
Figure BDA00032950461600000415
Normalizing to obtain m weight coefficients
Figure BDA00032950461600000416
Wherein, j is2Coefficient of weight
Figure BDA00032950461600000417
Expressed as:
Figure BDA00032950461600000418
by passing
Figure BDA00032950461600000419
Fusing video frame features into one video feature
Figure BDA00032950461600000420
Finally, the classifier is used to characterize the video
Figure BDA00032950461600000421
Classifying;
2.18) obtaining image features and video features using cross-entropy loss as a supervision
Figure BDA00032950461600000422
Cross entropy loss of
Figure BDA00032950461600000423
Expressed as:
Figure BDA00032950461600000424
Figure BDA00032950461600000425
wherein the content of the first and second substances,
Figure BDA00032950461600000426
a label representing the ith image, P represents a feature distribution,
Figure BDA00032950461600000427
weight vector, video feature, representing full connected layers in classifier
Figure BDA0003295046160000051
Cross entropy loss of
Figure BDA0003295046160000052
Expressed as:
Figure BDA0003295046160000053
Figure BDA0003295046160000054
wherein the content of the first and second substances,
Figure BDA0003295046160000055
a tag representing the ith video is selected,
Figure BDA0003295046160000056
a weight vector representing a fully connected layer in the classifier;
2.19) loss function optimized as a whole
Figure BDA0003295046160000057
Comprises the following steps:
Figure BDA0003295046160000058
updating a parameter θ of a feature extraction network by a back propagation algorithmf
Figure BDA0003295046160000059
Wherein the content of the first and second substances,
Figure BDA00032950461600000510
is the ith1The features of the sub-iteration extract the network parameters,
Figure BDA00032950461600000511
is the ith1+1 iteration of feature extraction network parameters, l1D represents the differential for the learning rate of the batch;
2.110)i1=i1+1, go to 2.14).
Further, in step 2), for the co-expression mapping network, the following operations are performed:
2.21) constructing a common expression mapping network of the images, wherein the network comprises two full connection layers and a Softmax layer, inputting image characteristics, outputting the common expression of the images by the first full connection layer, and classifying the common expression of the images by a classifier by the second full connection layer and the Softmax layer;
2.22) constructing a video co-expression mapping network, wherein the network comprises two full connection layers and a Softmax layer, video characteristics are input, the first full connection layer outputs video co-expression, and the second full connection layer and the Softmax layer form a classifier for classifying the video co-expression;
2.23) initializing parameters of the coexpression mapping network, number of iterations i2=0;
2.24) determining whether the number of iterations is satisfied
Figure BDA00032950461600000512
Wherein e2Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to 2.25);
2.25) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each category
Figure BDA0003295046160000061
An image and
Figure BDA0003295046160000062
taking each video as data of one iteration;
2.26) network acquisition image Co-expression r Using Co-expression Generationi ICo-expression r with videoi VWherein r isi IRepresenting a common representation of the ith image, ri VA common representation representing an ith video;
2.27) use of Cross entropy loss
Figure BDA0003295046160000063
Retention of co-expressed intra-modal similarity relationships:
Figure BDA0003295046160000064
wherein li ILabel representing ith image, li VA tag representing the ith video;
2.28) constructing global semi-difficult triples and local semi-difficult triples to keep the characteristic relation between modes, wherein the semi-difficult triples refer to that all positive sample pairs and negative sample pairs which are more than the positive sample pairs and within the distance margin are selected, and the loss L is learned by using the metricTripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss LlocalAnd semi-difficult global triplet loss LglobalExpressed as:
Figure BDA0003295046160000065
Figure BDA0003295046160000066
wherein N isIDenotes the use of ri INumber of semi-difficult triples as anchor points, NVDenotes the use of ri VNumber of semi-difficult triples as anchor points, ri,rjRepresenting common expression of the same class, ri,rkRepresenting different classes of co-expression, cjRepresentative table and riClass centers of the same class, ckRepresentative table and riClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression ri I,ri VAnd a weight vector for a corresponding class in a second fully-connected layer of the co-expression mapping network
Figure BDA0003295046160000067
Maximizing the weight vector corresponding to the common expression and other classes
Figure BDA0003295046160000068
Thus using the co-expression mapping network of images and the weight vector of the second fully connected layer in the co-expression mapping network of video
Figure BDA0003295046160000069
Represents
Figure BDA00032950461600000610
2.29) the overall optimized loss function is:
Figure BDA0003295046160000071
wherein the content of the first and second substances,
Figure BDA0003295046160000072
is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithmgWhile fine tuning the parameter θ of the feature extraction networkf,l2Learning rate for this batch:
Figure BDA0003295046160000073
Figure BDA0003295046160000074
wherein the content of the first and second substances,
Figure BDA0003295046160000075
is the ith2The co-expression of the sub-iteration maps network parameters,
Figure BDA0003295046160000076
is the ith2The co-expression of +1 iterations maps network parameters,
Figure BDA0003295046160000077
is the ith2The features of the sub-iteration extract the network parameters, d represents the differential,
Figure BDA0003295046160000078
is the ith2Extracting network parameters according to the characteristics of +1 iteration;
2.210)i2=i2+1, go to 2.24).
Further, the step 3) comprises the following steps:
3.1) taking the face image test set as a query set and the face video training set as a retrieval set, and respectively extracting the query by using the trained common expression generation networkCo-expression matrix r of collection and search setqueryAnd rretrival
3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCAretrivalBy minimizing rretrivalThe low-dimensional expression and the distance iterative training of the binary hash thereof obtain a rotation matrix R, wherein R is used for balancing the variances of different dimensions;
3.3) obtaining a binary Hash matrix B for extracting a query set and a retrieval set by using W and RqueryAnd Bretrival
Bquery=sgn(rqueryWR)
Bretrival=sgn(rretrivalWR)
Wherein sgn represents a sign function;
and 3.4) calculating the binary hash of each query item in the query set and the Hamming distance of all the binary hashes in the retrieval set, sequencing the retrieval set according to the Hamming distance, and returning the sequenced retrieval set.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides a two-stage cross-mode face image video hash generation method, which can generate images and video hashes with robustness and distinguishability in and among modes.
2. The method aligns the image characteristics of different yaw angles through a light-weight yaw angle residual mapping, and obtains the image characteristics with robustness to the change of the yaw angle.
3. The method of the invention fuses the video frame characteristics through a multi-layer attention mechanism, and obtains the video characteristics with robustness.
4. The method provided by the invention is different from the traditional metric learning method, the semantic class center is adopted to construct the global triple of the sample and the class center, the consistency and the accuracy of metric learning are improved by combining the local triple, and meanwhile, the convergence of a loss function is accelerated by screening of semi-difficult sample pairs.
Drawings
FIG. 1 is an architectural diagram of the method of the present invention.
Fig. 2 is a schematic diagram of metric learning proposed by the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
As shown in fig. 1, the present embodiment discloses a cross-modal face retrieval hashing method based on metric learning, which specifically includes the following main technical steps:
1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction; the specific situation is as follows:
the data set selected in this example is the YTC face video data set containing 1910 face videos from 47 celebrities collected from the Youtube video website, each video in the face video data set being cropped into a plurality of video segments of 30 frames in length.
And (3) forming a face video training set by 70% of videos in the YTC face video data set, extracting each frame of the videos in the face video training set to form a face image training set, and randomly extracting 1 frame from each of the rest 30% of videos in the face video data set to serve as a face image test set.
The pre-trained MTCNN, which is a deep face region detection network, is used to align the faces in the video frames and images in the data set, and the aligned video frames and images are scaled to 64 × 64.
And extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.
2) Training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network.
The method for training the feature extraction network by using the face image training set and the face video training set and updating the parameters of the feature extraction network by using cross entropy loss comprises the following steps:
2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of two fully-connected layers, namely a nonlinear layer network structure, and the classifier consists of a fully-connected layer FC1 and a Softmax layer;
2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the image feature extraction network, an attention network and a classifier, and the classifier consists of a full connection layer FC4 and a Softmax layer;
2.13) initializing the parameters of the feature extraction network, pre-training the feature extraction network by using a CASIA-WebFace, wherein the CASIA-WebFace data set is a large-scale public face image data set and comprises 494414 face images from 10575 categories collected from the network, and the number of initialization iterations i1Initial learning rate l is 01=1e-3;
2.14) determining whether the number of iterations is satisfied
Figure BDA0003295046160000101
Wherein e11 denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatch60 denotes the number of image samples in one iteration, if the condition is not satisfied, the iteration is ended, and if the condition is satisfied, the process goes to 2.15);
2.15) randomly selecting 12 classes, and then randomly selecting 5 images and 5 videos as data of one iteration in a face image training set and a face video training set of each class;
2.16) input the ith sample in the face image training set
Figure BDA0003295046160000102
First using VGG network extraction
Figure BDA0003295046160000103
Secondly, mapping characteristics of the Yaw angle parameter Yaw and the Yaw angle residual error
Figure BDA0003295046160000104
Adding the product of (2) to the original feature
Figure BDA0003295046160000105
Obtaining image features:
Figure BDA0003295046160000106
Figure BDA0003295046160000107
dimension 512 × 1, and finally, using a classifier to characterize the image
Figure BDA0003295046160000108
Classifying, wherein the classification characteristic dimension is 47 multiplied by 1, and a Dropout layer is added after FC1 to reduce overfitting;
2.17) input the ith sample in the face video training set
Figure BDA0003295046160000109
Firstly, extracting video frame characteristics by using VGG network
Figure BDA00032950461600001010
Secondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: calculating m-3 kernel vectors in the first layer
Figure BDA00032950461600001011
And video frame characteristics
Figure BDA00032950461600001012
And obtaining an importance coefficient vector of each frame by nonlinear mapping of the ReLU
Figure BDA00032950461600001013
Wherein the j th1Heavy pieceEssential coefficient vector
Figure BDA00032950461600001014
Expressed as:
Figure BDA00032950461600001015
then using Softmax function pair
Figure BDA00032950461600001016
Normalizing to obtain m groups of weight vectors
Figure BDA00032950461600001017
Wherein the j th1Group weight vector
Figure BDA00032950461600001018
Expressed as:
Figure BDA00032950461600001019
in the second layer, according to the previous layer
Figure BDA00032950461600001020
Calculating corresponding importance coefficient by the obtained tensor
Figure BDA00032950461600001021
Wherein the j th2A coefficient of importance
Figure BDA00032950461600001022
Expressed as:
Figure BDA0003295046160000111
wherein the content of the first and second substances,
Figure BDA0003295046160000112
j represents the second layer2Kernel vectors, then using Softmax function pairs
Figure BDA0003295046160000113
Normalizing to obtain m weight coefficients
Figure BDA0003295046160000114
Wherein, j is2Coefficient of weight
Figure BDA0003295046160000115
Expressed as:
Figure BDA0003295046160000116
by passing
Figure BDA0003295046160000117
Fusing video frame features into one video feature
Figure BDA0003295046160000118
Figure BDA0003295046160000119
Dimension 512 × 1, and finally, using a classifier to characterize the video
Figure BDA00032950461600001110
Classifying, wherein the classification characteristic dimension is 47 multiplied by 1, and a Dropout layer is added after FC4 to reduce overfitting;
2.18) obtaining image features and video features using cross-entropy loss as a supervision
Figure BDA00032950461600001111
Cross entropy loss of
Figure BDA00032950461600001112
Expressed as:
Figure BDA00032950461600001113
Figure BDA00032950461600001114
wherein the content of the first and second substances,
Figure BDA00032950461600001115
a label representing the ith image, P represents a feature distribution,
Figure BDA00032950461600001116
weight vector, video feature, representing full connected layers in classifier
Figure BDA00032950461600001117
Cross entropy loss of
Figure BDA00032950461600001118
Expressed as:
Figure BDA00032950461600001119
Figure BDA00032950461600001120
wherein the content of the first and second substances,
Figure BDA00032950461600001121
a tag representing the ith video is selected,
Figure BDA00032950461600001122
a weight vector representing a fully connected layer in the classifier;
2.19) the overall optimized loss function is:
Figure BDA00032950461600001123
by reversingPropagation algorithm updating parameter theta of feature extraction networkf
Figure BDA0003295046160000121
Wherein the content of the first and second substances,
Figure BDA0003295046160000122
is the ith1The features of the sub-iteration extract the network parameters,
Figure BDA0003295046160000123
is the ith1+1 iteration of feature extraction network parameters, l1For the learning rate of the batch, d represents the differential, the gradient parameters are updated using an Adam optimizer, Adam's beta1And beta2Set to 0.9 and 0.99, respectively;
2.110)i1=i1+1, go to 2.14).
The method comprises the following steps of training a co-expression mapping network by using a face image training set and a face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of a trained feature extraction network, and comprises the following steps:
2.21) constructing a co-expression mapping network of the images, wherein the network comprises two full connection layers FC2 and FC3 and a Softmax layer, inputting image characteristics, outputting image co-expression by FC2, wherein the dimension of the image co-expression is 48 multiplied by 1, and classifying the image co-expression by a classifier of the FC3 layer and the Softmax layer;
2.22) constructing a co-expression mapping network of the video, wherein the network comprises two full connection layers FC5 and FC6 and a Softmax layer, inputting video characteristics, outputting video co-expression by FC5, wherein the dimension of the video co-expression is 48 multiplied by 1, and classifying the video co-expression by an FC6 and Softmax layer component classifier;
2.23) initializing parameters of the coexpression mapping network, number of iterations i2Initial learning rate l is 02=1e-3;
2.24) determining whether the number of iterations is satisfied
Figure BDA0003295046160000124
Wherein e220 denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatch60 represents the number of samples of one iteration, if the condition is not met, the iteration is ended, and if the condition is met, the process goes to 2.25);
2.25) randomly selecting 12 classes, and then randomly selecting 5 images and 5 videos as data of one iteration in a face image training set and a face video training set of each class;
2.26) forming a common expression generation network by the trained feature extraction network and the common expression mapping network, and acquiring an image common expression r by using the common expression generation networki ICo-expression r with videoi VWherein r isi IRepresenting a common representation of the ith image, ri VA common representation representing an ith video;
2.27) use of Cross entropy loss
Figure BDA0003295046160000131
Retention of co-expressed intra-modal similarity relationships:
Figure BDA0003295046160000132
wherein the content of the first and second substances,
Figure BDA0003295046160000133
a label representing the ith image,
Figure BDA0003295046160000134
a tag representing the ith video;
2.28) constructing global and local semi-difficult triples as shown in fig. 2, wherein semi-difficult triples refer to selecting all pairs of positive samples and pairs of negative samples greater than the pairs by a distance margin, where margin is 2, using metric learning loss to maintain inter-modal feature relationshipsLTripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss LlocalAnd semi-difficult global triplet loss LglobalExpressed as:
Figure BDA0003295046160000135
Figure BDA0003295046160000136
wherein N isIDenotes the use of ri INumber of semi-difficult triples as anchor points, NVDenotes the use of ri VNumber of semi-difficult triples as anchor points, ri,rjRepresenting common expression of the same class, ri,rkRepresenting different classes of co-expression, cjRepresentative table and riClass centers of the same class, ckRepresentative table and riClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression ri I,ri VAnd weight vectors of corresponding classes in a second full connection layer FC3, FC6 of the co-expression mapping network
Figure BDA0003295046160000137
Maximizing the weight vector corresponding to the common expression and other classes
Figure BDA0003295046160000138
Thus using the co-expression mapping network of images and the weight vectors of the FC3, FC6 in the co-expression mapping network of videos
Figure BDA0003295046160000139
Represents
Figure BDA00032950461600001310
2.29) the overall optimized loss function is:
Figure BDA00032950461600001311
wherein the content of the first and second substances,
Figure BDA00032950461600001312
is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithmgWhile fine tuning the parameter θ of the feature extraction networkf
Figure BDA0003295046160000141
Figure BDA0003295046160000142
Wherein the content of the first and second substances,
Figure BDA0003295046160000143
is the ith2The co-expression of the sub-iteration maps network parameters,
Figure BDA0003295046160000144
is the ith2The co-expression of +1 iterations maps network parameters,
Figure BDA0003295046160000145
is the ith2The features of the sub-iteration extract the network parameters,
Figure BDA0003295046160000146
is the ith2Extracting network parameters from the features of +1 iteration, wherein d represents differentiation; the Adam optimizer and the SGD optimizer are combined for gradient updating of back propagation, rapid convergence of a loss function is guaranteed, and meanwhile the generalization capability of the algorithm is guaranteed: training 10 epoch, Adam's beta using Adam optimizer early in training1And beta2Respectively setting the values to be 0.9 and 0.99, training 10 epochs by using an SGD optimizer at the later stage, setting momentum in the SGD to be 0.9, adding L2 regularization to network parameters during training by using the SGD optimizer to enhance the generalization capability of the network, and setting the regularization coefficient of L2 to be 0.0025;
2.210)i2=i2+1, go to 2.24).
3) In order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash, and the method comprises the following steps:
3.1) taking the face image test set as a query set and the face video training set as a retrieval set, and respectively extracting common expression matrixes r of the query set and the retrieval set by using the trained common expression generation networkqueryAnd rretrival
3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCAretrivalBy minimizing rretrivalThe low-dimensional expression and the distance iterative training of the binary hash of the low-dimensional expression obtain a rotation matrix R, wherein R is used for balancing variances of different dimensions, and the length of a hash code is 48;
3.3) obtaining a binary Hash matrix B for extracting a query set and a retrieval set by using W and RqueryAnd Bretrival
Bquery=sgn(rqueryWR)
Bretrival=sgn(rretrivalWR)
Wherein sgn represents a sign function;
and 3.4) calculating the binary hash of each query item in the query set and the Hamming distance of all the binary hashes in the retrieval set, sequencing the retrieval set according to the Hamming distance, and returning the sequenced retrieval set.
The experimental results are as follows:
in the YTC data set, the mAP of the face image training set retrieved by the face image test set is 0.6659, and the mAP of the face image training set retrieved by the face image test set is 0.6829.
In conclusion, the method can fully utilize intra-modal information to extract single-mode features with robustness and separability, and meanwhile, semantic centers of different modalities are used to fully mine similarity information among the modalities, so that a better effect is achieved in a cross-modal face retrieval task, and the method is worthy of popularization.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (5)

1. The cross-modal face retrieval hashing method based on metric learning is characterized by comprising the following steps of:
1) preprocessing a face video data set, namely video cutting, dividing a face image training set, a face image testing set and a face video training set, and performing face alignment and yaw angle extraction;
2) training the constructed co-expression generation network, wherein the co-expression generation network consists of a feature extraction network and a co-expression mapping network; during training, training a feature extraction network by using a face image training set and a face video training set, updating parameters of the feature extraction network by using cross entropy loss, training a co-expression mapping network by using the face image training set and the face video training set, updating parameters of the co-expression mapping network by using cross entropy loss and metric learning loss, and finely adjusting parameters of the trained feature extraction network;
3) in order to simulate a real use scene, the trained common expression generation network and ITQ iteration quantization are used for extracting the Hash commonly expressed by the images in the face image test set and the Hash commonly expressed by the videos in the face video training set, and the retrieval items are sequenced according to the Hamming distance of the image Hash and the video Hash.
2. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in the step 1), cutting each video in the face video data set into a plurality of video segments with the same length; 70% of videos in the face video data set form a face video training set, each frame of the videos in the face video training set is extracted to form a face image training set, and 1 frame is randomly extracted from the rest 30% of videos in the face video data set to serve as a face image test set; performing alignment operation on faces in the video frames and the images in the data set by using a pre-trained MTCNN, wherein the MTCNN is a deep face area detection network, and the aligned video frames and images are scaled to H multiplied by W, wherein H is the length of a picture, and W is the width of the picture; and extracting the Yaw angle parameter Yaw of the images of the face image training set and the face image testing set by using the pre-trained Hopentet, wherein the Hopentet is a deep face pose estimation network without key points.
3. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in step 2), for the feature extraction network, the following operations are performed:
2.11) constructing a feature extraction network of the image, wherein the feature extraction network of the image comprises a convolutional neural network phi (-) with a VGG structure, a yaw angle residual mapping sigma and a classifier, wherein the sigma consists of a plurality of fully-connected layers and nonlinear layer network structures, and the classifier consists of a fully-connected layer and a Softmax layer;
2.12) constructing a feature extraction network of the video, wherein the feature extraction network of the video comprises a convolutional neural network phi (-) sharing weight with the feature extraction network of the image, an attention network and a classifier, and the classifier consists of a full connection layer and a Softmax layer;
2.13) initializing feature extraction network parameters, pre-training the feature extraction network using a CASIA-WebFace dataset, which is a large-scale public facial image dataset comprising 494414 facial images from 10575 classes collected from the network, initializing iteration timesNumber i1=0;
2.14) determining whether the number of iterations is satisfied
Figure FDA0003295046150000021
Wherein e1Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of image samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to the step 2.15);
2.15) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each category
Figure FDA0003295046150000022
An image and
Figure FDA0003295046150000023
taking each video as data of one iteration;
2.16) input the ith sample in the face image training set
Figure FDA0003295046150000024
First using VGG network extraction
Figure FDA0003295046150000025
Secondly, mapping characteristics of the Yaw angle parameter Yaw and the Yaw angle residual error
Figure FDA0003295046150000026
Adding the product of (2) to the original feature
Figure FDA0003295046150000027
Obtaining image features:
Figure FDA0003295046150000028
finally, the classifier is used to characterize the image
Figure FDA0003295046150000029
Classifying;
2.17) input the ith sample in the face video training set
Figure FDA00032950461500000210
Firstly, extracting video frame characteristics by using VGG network
Figure FDA00032950461500000211
Secondly, fusing video frame characteristics by adopting a multi-layer attention mechanism: computing m kernel vectors at the first level
Figure FDA00032950461500000212
And video frame characteristics
Figure FDA00032950461500000213
And obtaining an importance coefficient vector of each frame by nonlinear mapping of the ReLU
Figure FDA00032950461500000214
Wherein the j th1A vector of importance coefficients
Figure FDA00032950461500000215
Expressed as:
Figure FDA0003295046150000031
then using Softmax function pair
Figure FDA0003295046150000032
Normalizing to obtain m groups of weight vectors
Figure FDA0003295046150000033
Wherein the j th1Group weight vector
Figure FDA0003295046150000034
Expressed as:
Figure FDA0003295046150000035
in the second layer, according to the previous layer
Figure FDA0003295046150000036
Calculating corresponding importance coefficient by the obtained tensor
Figure FDA0003295046150000037
Wherein the j th2A coefficient of importance
Figure FDA0003295046150000038
Expressed as:
Figure FDA0003295046150000039
wherein the content of the first and second substances,
Figure FDA00032950461500000310
j represents the second layer2Kernel vectors, then using Softmax function pairs
Figure FDA00032950461500000311
Normalizing to obtain m weight coefficients
Figure FDA00032950461500000312
Wherein, j is2Coefficient of weight
Figure FDA00032950461500000313
Expressed as:
Figure FDA00032950461500000314
by passing
Figure FDA00032950461500000315
Fusing video frame features into one video feature
Figure FDA00032950461500000316
Finally, the classifier is used to characterize the video
Figure FDA00032950461500000317
Classifying;
2.18) obtaining image features and video features using cross-entropy loss as a supervision
Figure FDA00032950461500000318
Cross entropy loss of
Figure FDA00032950461500000319
Expressed as:
Figure FDA00032950461500000320
Figure FDA00032950461500000321
wherein the content of the first and second substances,
Figure FDA00032950461500000322
a label representing the ith image, P represents a feature distribution,
Figure FDA00032950461500000323
weight vector, video feature, representing full connected layers in classifier
Figure FDA00032950461500000324
Cross entropy loss of
Figure FDA00032950461500000325
Expressed as:
Figure FDA0003295046150000041
Figure FDA0003295046150000042
wherein the content of the first and second substances,
Figure FDA0003295046150000043
a tag representing the ith video is selected,
Figure FDA0003295046150000044
a weight vector representing a fully connected layer in the classifier;
2.19) loss function optimized as a whole
Figure FDA0003295046150000045
Comprises the following steps:
Figure FDA0003295046150000046
updating a parameter θ of a feature extraction network by a back propagation algorithmf
Figure FDA0003295046150000047
Wherein the content of the first and second substances,
Figure FDA0003295046150000048
is the ith1Features of sub-iterationThe network parameters are characterized and extracted,
Figure FDA0003295046150000049
is the ith1+1 iteration of feature extraction network parameters, l1D represents the differential for the learning rate of the batch;
2.110)i1=i1+1, go to 2.14).
4. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: in step 2), for the co-expression mapping network, the following operations are performed:
2.21) constructing a common expression mapping network of the images, wherein the network comprises two full connection layers and a Softmax layer, inputting image characteristics, outputting the common expression of the images by the first full connection layer, and classifying the common expression of the images by a classifier by the second full connection layer and the Softmax layer;
2.22) constructing a video co-expression mapping network, wherein the network comprises two full connection layers and a Softmax layer, video characteristics are input, the first full connection layer outputs video co-expression, and the second full connection layer and the Softmax layer form a classifier for classifying the video co-expression;
2.23) initializing parameters of the coexpression mapping network, number of iterations i2=0;
2.24) determining whether the number of iterations is satisfied
Figure FDA00032950461500000410
Wherein e2Denotes the epoch number, N1Representing the total number of samples, N, of the training set of face imagesbatchRepresenting the number of samples of one iteration, if the condition is not met, ending the iteration, and if the condition is met, turning to 2.25);
2.25) random selection of NcEach category is selected randomly in the face image training set and the face video training set of each category
Figure FDA0003295046150000051
An image and
Figure FDA0003295046150000052
taking each video as data of one iteration;
2.26) network acquisition image Co-expression r Using Co-expression Generationi ICo-expression r with videoi VWherein r isi IRepresenting a common representation of the ith image, ri VA common representation representing an ith video;
2.27) use of Cross entropy loss
Figure FDA0003295046150000053
Retention of co-expressed intra-modal similarity relationships:
Figure FDA0003295046150000054
wherein the content of the first and second substances,
Figure FDA0003295046150000055
a label representing the ith image,
Figure FDA0003295046150000056
a tag representing the ith video;
2.28) constructing global semi-difficult triples and local semi-difficult triples to keep the characteristic relation between modes, wherein the semi-difficult triples refer to that all positive sample pairs and negative sample pairs which are more than the positive sample pairs and within the distance margin are selected, and the loss L is learned by using the metricTripletMinimizing Euclidean distance of positive sample pairs, maximizing Euclidean distance of negative sample pairs, half-difficult local triple loss LlocalAnd semi-difficult global triplet loss LglobalExpressed as:
Figure FDA0003295046150000057
Figure FDA0003295046150000058
wherein N isIDenotes the use of ri INumber of semi-difficult triples as anchor points, NVDenotes the use of ri VNumber of semi-difficult triples as anchor points, ri,rjRepresenting common expression of the same class, ri,rkRepresenting different classes of co-expression, cjRepresentative table and riClass centers of the same class, ckRepresentative table and riClass centers of different classes; when using the Softmax function, minimizing the cross-entropy loss function is equivalent to minimizing the co-expression ri I,ri VAnd a weight vector for a corresponding class in a second fully-connected layer of the co-expression mapping network
Figure FDA0003295046150000059
Maximizing the weight vector corresponding to the common expression and other classes
Figure FDA00032950461500000510
Thus using the co-expression mapping network of images and the weight vector of the second fully connected layer in the co-expression mapping network of video
Figure FDA00032950461500000511
Represents
Figure FDA00032950461500000512
2.29) the overall optimized loss function is:
Figure FDA0003295046150000061
wherein the content of the first and second substances,
Figure FDA0003295046150000062
is the cross entropy loss of the feature extraction network, and alpha is the proportionality coefficient of the inter-mode loss in the balance mode; updating a parameter θ of a spatial mapping network using a back propagation algorithmgWhile fine tuning the parameter θ of the feature extraction networkf,l2Learning rate for this batch:
Figure FDA0003295046150000063
Figure FDA0003295046150000064
wherein the content of the first and second substances,
Figure FDA0003295046150000065
is the ith2The co-expression of the sub-iteration maps network parameters,
Figure FDA0003295046150000066
is the ith2The co-expression of +1 iterations maps network parameters,
Figure FDA0003295046150000067
is the ith2The features of the sub-iteration extract the network parameters,
Figure FDA0003295046150000068
is the ith2Extracting network parameters from the features of +1 iteration, wherein d represents differentiation;
2.210)i2=i2+1, go to 2.24).
5. The cross-modal face retrieval hashing method based on metric learning of claim 1, wherein: the step 3) comprises the following steps:
3.1) testing set of face imagesAs a query set, a face video training set is used as a retrieval set, and a common expression matrix r of the query set and the retrieval set is respectively extracted by using a trained common expression generation networkqueryAnd rretrival
3.2) iterative training using PCA dimensionality reduction in combination with ITQ iterative quantization, first, r is obtained using PCAretrivalBy minimizing rretrivalThe low-dimensional expression and the distance iterative training of the binary hash thereof obtain a rotation matrix R, wherein R is used for balancing the variances of different dimensions;
3.3) obtaining a binary Hash matrix B for extracting a query set and a retrieval set by using W and RqueryAnd Bretrival
Bquery=sgn(rqueryWR)
Bretrival=sgn(rretrivalWR)
Wherein sgn represents a sign function;
and 3.4) calculating the binary hash of each query item in the query set and the Hamming distance of all the binary hashes in the retrieval set, sequencing the retrieval set according to the Hamming distance, and returning the sequenced retrieval set.
CN202111175867.7A 2021-10-09 2021-10-09 Cross-modal face retrieval hash method based on metric learning Active CN114067385B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111175867.7A CN114067385B (en) 2021-10-09 2021-10-09 Cross-modal face retrieval hash method based on metric learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111175867.7A CN114067385B (en) 2021-10-09 2021-10-09 Cross-modal face retrieval hash method based on metric learning

Publications (2)

Publication Number Publication Date
CN114067385A true CN114067385A (en) 2022-02-18
CN114067385B CN114067385B (en) 2024-05-31

Family

ID=80234405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111175867.7A Active CN114067385B (en) 2021-10-09 2021-10-09 Cross-modal face retrieval hash method based on metric learning

Country Status (1)

Country Link
CN (1) CN114067385B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360032A (en) * 2022-03-17 2022-04-15 北京启醒科技有限公司 Polymorphic invariance face recognition method and system
CN114638002A (en) * 2022-03-21 2022-06-17 华南理工大学 Compressed image encryption method supporting similarity retrieval
CN114866345A (en) * 2022-07-05 2022-08-05 支付宝(杭州)信息技术有限公司 Processing method, device and equipment for biological recognition
CN115063845A (en) * 2022-06-20 2022-09-16 华南理工大学 Finger vein identification method based on lightweight network and deep hash
CN115240249A (en) * 2022-07-07 2022-10-25 湖北大学 Feature extraction classification measurement learning method and system for face recognition and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175248A (en) * 2019-04-04 2019-08-27 中国科学院信息工程研究所 A kind of Research on face image retrieval and device encoded based on deep learning and Hash
CN111753190A (en) * 2020-05-29 2020-10-09 中山大学 Meta learning-based unsupervised cross-modal Hash retrieval method
CN112800292A (en) * 2021-01-15 2021-05-14 南京邮电大学 Cross-modal retrieval method based on modal specificity and shared feature learning
CN113190699A (en) * 2021-05-14 2021-07-30 华中科技大学 Remote sensing image retrieval method and device based on category-level semantic hash

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175248A (en) * 2019-04-04 2019-08-27 中国科学院信息工程研究所 A kind of Research on face image retrieval and device encoded based on deep learning and Hash
CN111753190A (en) * 2020-05-29 2020-10-09 中山大学 Meta learning-based unsupervised cross-modal Hash retrieval method
CN112800292A (en) * 2021-01-15 2021-05-14 南京邮电大学 Cross-modal retrieval method based on modal specificity and shared feature learning
CN113190699A (en) * 2021-05-14 2021-07-30 华中科技大学 Remote sensing image retrieval method and device based on category-level semantic hash

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360032A (en) * 2022-03-17 2022-04-15 北京启醒科技有限公司 Polymorphic invariance face recognition method and system
CN114638002A (en) * 2022-03-21 2022-06-17 华南理工大学 Compressed image encryption method supporting similarity retrieval
CN114638002B (en) * 2022-03-21 2023-04-28 华南理工大学 Compressed image encryption method supporting similarity retrieval
CN115063845A (en) * 2022-06-20 2022-09-16 华南理工大学 Finger vein identification method based on lightweight network and deep hash
CN115063845B (en) * 2022-06-20 2024-05-28 华南理工大学 Finger vein recognition method based on lightweight network and deep hash
CN114866345A (en) * 2022-07-05 2022-08-05 支付宝(杭州)信息技术有限公司 Processing method, device and equipment for biological recognition
CN115240249A (en) * 2022-07-07 2022-10-25 湖北大学 Feature extraction classification measurement learning method and system for face recognition and storage medium

Also Published As

Publication number Publication date
CN114067385B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
CN107122809B (en) Neural network feature learning method based on image self-coding
CN114067385A (en) Cross-modal face retrieval Hash method based on metric learning
Zheng et al. A deep and autoregressive approach for topic modeling of multimodal data
WO2021159769A1 (en) Image retrieval method and apparatus, storage medium, and device
CN108491430B (en) Unsupervised Hash retrieval method based on clustering characteristic directions
Zafar et al. Image classification by addition of spatial information based on histograms of orthogonal vectors
CN108427740B (en) Image emotion classification and retrieval algorithm based on depth metric learning
Hu et al. Learning dual-pooling graph neural networks for few-shot video classification
CN110598022B (en) Image retrieval system and method based on robust deep hash network
CN111104555A (en) Video hash retrieval method based on attention mechanism
CN112712127A (en) Image emotion polarity classification method combined with graph convolution neural network
Menaga et al. Deep learning: a recent computing platform for multimedia information retrieval
CN107220597B (en) Key frame selection method based on local features and bag-of-words model human body action recognition process
CN113344069B (en) Image classification method for unsupervised visual representation learning based on multi-dimensional relation alignment
Cao et al. Facial expression recognition algorithm based on the combination of CNN and K-Means
Yao Key frame extraction method of music and dance video based on multicore learning feature fusion
Liu et al. A multimodal approach for multiple-relation extraction in videos
Oussama et al. A fast weighted multi-view Bayesian learning scheme with deep learning for text-based image retrieval from unlabeled galleries
Dong et al. A supervised dictionary learning and discriminative weighting model for action recognition
Bibi et al. Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval
CN115100694A (en) Fingerprint quick retrieval method based on self-supervision neural network
An et al. Near duplicate product image detection based on binary hashing
Benuwa et al. Deep locality‐sensitive discriminative dictionary learning for semantic video analysis
CN105279489B (en) A kind of method for extracting video fingerprints based on sparse coding
Huang et al. Baggage image retrieval with attention-based network for security checks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant