CN111539263A

CN111539263A - Video face recognition method based on aggregation countermeasure network

Info

Publication number: CN111539263A
Application number: CN202010253595.7A
Authority: CN
Inventors: 陈莹; 金炜
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-08-14
Anticipated expiration: 2040-04-02
Also published as: CN111539263B

Abstract

The invention discloses a video face recognition method based on an aggregation countermeasure network, and belongs to the technical field of video face recognition. The method adopts an aggregation countermeasure network constructed by an aggregation network, a discrimination network and an identification network, wherein the aggregation network and the discrimination network form countermeasure learning, and a generated image is closer to a target set static image in a competitive mode; the perception loss is calculated in the high-dimensional feature space through the recognition network, so that the generated image is closer to the corresponding static image of the target set in the perception performance, and the performance of the aggregation network is improved. The discrimination network adopts a softmax multi-dimensional output mode, can judge whether the image is true or false, and can also discriminate the identity type of the image, so that the identity of the generated image is closer to a true value, subsequent identification is more accurate, and the identification efficiency is higher.

Description

Video face recognition method based on aggregation countermeasure network

Technical Field

The invention relates to a video face recognition method based on an aggregation countermeasure network, and belongs to the technical field of video face recognition.

Background

The video face recognition technology is based on video face recognition as the name implies. With the increasing development of technologies and requirements, the video face recognition technology has been applied to many fields, such as intelligent security, video monitoring, public security investigation and the like.

The video face recognition is different from the face recognition based on a single image, the query set of the video face recognition is a video sequence, the target set of the video face recognition is usually a high-definition face image, and the identity of a person in the video is recognized by extracting the face features of the video sequence and matching in the target set.

However, in the most common video monitoring scene of video face recognition, the face in the video sequence often shot has the conditions of motion blur, noise, occlusion and the like, so that the face has a great difference from the face of a target set, and the difference cannot be well processed by the conventional method or the deep learning-based method at present, so that the recognition effect is poor.

In addition, the current video face recognition method extracts features from a video sequence one by one, which not only causes overlong testing time, but also causes recognition results to be easily interfered by low-quality frames in the video sequence.

Disclosure of Invention

In order to solve the problems of low efficiency and low precision in the existing video face recognition technology, the invention provides a video face recognition method, wherein in the recognition process, a polymerization countermeasure network is adopted to polymerize a plurality of low-quality video sequences into a single high-quality front face image, and the quality of the generated front face image is improved in the polymerization process in a countermeasure learning mode, so that the video face recognition is accurately carried out;

the aggregation countermeasure network consists of an aggregation network, a judgment network and an identification network, wherein the aggregation network and the judgment network form countermeasure learning, the generated image is closer to the static image of the target set in a competitive mode, and the identification network calculates the perception loss in a high-dimensional feature space, so that the generated front face image is closer to the static image of the corresponding target set in the perception performance.

Optionally, the discrimination network outputs an N + 1-dimensional vector in a form of softmax multi-dimensional output; where N is the number of identity categories, the remaining dimension represents the true or false of the corresponding image, "true" represents that the corresponding image is a static image, and "false" represents that the corresponding image is a composite image.

Optionally, the method includes:

s1 constructing aggregation network G and losing L through aggregation_aggPre-training the aggregation network G to obtain a pre-training model of the aggregation network GMolding;

s2 loading the pre-training model of the aggregation network G, constructing a discrimination network D and a recognition network R, and calculating the confrontation loss L_advAnd a perceptual loss L_per；

S3 Joint polymerization loss L in the form of weighted sum_aggTo counter the loss L_advAnd a perceptual loss L_perTo construct a final loss function L, L ═ L_agg+λL_adv+αL_perλ and α are resistance loss L_advAnd a perceptual loss L_perWeight coefficient of (2) to the polymerization loss L_aggTo counter the loss L_advAnd a perceptual loss L_perDistributing different weight coefficient values, training the aggregation network G, and storing model parameters after the pre-training model of the aggregation network G converges to obtain an aggregation countermeasure network video face recognition model;

and S4, testing the aggregation countermeasure network video face recognition model obtained in S3, and after the testing is finished, using the aggregation countermeasure network video face recognition model to perform actual application of video face recognition.

Optionally, the S1 further includes before the step of:

acquiring a training video sequence data set, and recording as V ═ V₁,v₂,...,v_i,...,v_NIn which v is_iRepresenting the ith category video sequence, wherein i is 1,2, and N is the number of categories of the video sequence;

a static image dataset corresponding to V is acquired, denoted S ═ S₁,s₂,...,s_i,...,s_NIn which s is_iRepresenting the static image corresponding to the ith category.

Optionally, the S1 includes:

generating an image G (V) over an aggregation network_i ^k): the input of the aggregation network G is corresponding to the same category v_iK consecutive video frames of which the output is the corresponding category v_iDefining the generated image as G (V)_i ^k) K is a hyper-parameter, representing the number of input video frames of the aggregation network, V_i ^kVideo of ith category representing k frames in successionA sequence;

calculating L_aggThe loss of the carbon dioxide gas is reduced,

S_irepresents and V_i ^kStatic images of the same class, by gradient

Updating the parameters, L, of the aggregation network G_aggCalculating by using a pixel-level L2 loss function;

and after the aggregation network G is converged, saving the network model parameters to obtain the pre-training model of the aggregation network G.

Optionally, the S2 includes:

loading a pre-training model of the aggregation network G to obtain a generated image G (V)_i ^k) And a corresponding still image S_i；

Constructing a discrimination network D, converting an original image into a feature map by two convolution layers with the step length of 1, decoding the features by the combination of three convolution layers with the step length of 2 and a residual block, downsampling the decoded features by a pooling layer, and outputting a vector with the dimension of N +1 to represent the identity and true and false information of the corresponding image through a full connection layer;

will generate an image G (V)_i ^k) The static image S corresponding to it_iSending the data to a discrimination network D to calculate the countermeasure loss

Wherein D_iAn ith dimension output representing a discrimination network D;

constructing an identification network R, wherein the identification network R adopts a face identification network LightCNN, and an image G (V) is generated_i ^k) And a static image S corresponding thereto_iSending into the recognition network R, calculating the perception loss

Wherein R (-) represents a feature identifying the penultimate pooling layer of the networkAnd (5) feature value.

Optionally, in S3: λ is 0.01 and α is 0.003.

Optionally, the process of testing the aggregated countermeasure network video face recognition model in S4 includes:

the static image of the target set is recorded as S ═ S during the test₁,s₂,...,s_j,...,s_MSending the data to an identification network R to obtain a final layer of characteristic value F ═ F₁,f₂,...,f_j,...,f_M-wherein M represents the total number of categories of identity; f. of_jFeatures representing a target set static image of a person of identity category j;

capturing a human face picture in real time by using a camera, recording a captured human face video sequence of unknown types as V as the input of a convergence network G, and obtaining a generated image G (V) of unknown types;

sending the generated images G (V) into R to obtain the feature f to be inquired_vSeparately calculating the features f of the generated image_vAnd target set characteristic F ═ F₁,f₂,...,f_j,...,f_MAnd (4) the Euclidean distance, wherein the corresponding category with the minimum distance is the final recognition result.

The invention also provides application of the video face recognition method in the technical field of face recognition.

Optionally, the technical field of face recognition includes intelligent security, video monitoring and public security investigation.

The invention has the beneficial effects that:

the invention integrates the image generation technology into the video face recognition, and aggregates a plurality of low-quality video sequences into a single high-quality front face image through the aggregation network, thereby overcoming the defect of extracting image characteristics frame by frame in the current video face recognition technology and improving the video face recognition efficiency.

The aggregation countermeasure network constructed by the invention consists of three networks, namely an aggregation network, a judgment network and an identification network, wherein the aggregation network and the judgment network form countermeasure learning, and the generated image is closer to the static image of the target set in a competitive mode; the perception loss is calculated in the high-dimensional feature space through the recognition network, so that the generated image is closer to the corresponding static image of the target set in the perception performance, and the performance of the aggregation network is improved.

The discrimination network designed by the invention adopts a softmax multi-dimensional output mode, can judge the authenticity of the image and also can discriminate the identity type of the image, and ensures that the identity type of the generated image is consistent with the static image of the target set through a resistance loss containing identity type information, so that the identity of the generated image is closer to the true value, the subsequent identification is more accurate and the identification efficiency is higher.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a video face recognition technology based on an aggregation countermeasure network according to the present invention.

Fig. 2 is a network structure diagram of an aggregation countermeasure network used in the present invention.

Fig. 3A is a partial subset display diagram of a video sequence data set used by the present invention.

Fig. 3B is a real value presentation diagram corresponding to the static image of fig. 3A.

Fig. 3C is a diagram of the image result finally synthesized by the video sequence according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The first embodiment is as follows:

the embodiment provides a video face recognition method based on an aggregation countermeasure network, and with reference to fig. 1, the method includes:

step 1, obtaining a training set, wherein the training set comprises a video sequence data set V and a corresponding static image data set S:

step 1.1, acquiring a training video sequence data set, and recording as V ═ V₁,v₂,...,v_i,...,v_NIn which v is_iRepresenting the ith category video sequence, wherein i is 1,2, and N is the number of categories of the video sequence;

in practice, N represents the number of different people present in V, and video sequences corresponding to the same person are referred to as a class.

Step 1.2, a static image dataset corresponding to V is obtained, and is marked as S ═ S₁,s₂,...,s_i,...,s_NIn which s is_iRepresenting a static image corresponding to the ith category;

in practical applications, a high-definition camera can be used to capture S, and in some practical video monitoring scenes, the picture in S is usually a picture on an identification card or a specially-captured picture.

The video sequence data set V is shown in fig. 3A, which may be accompanied by some cases of occlusion, motion blur, noise and side faces; as shown in fig. 3B, the still image data set S is shot in a good environment and is a clear front face image.

Step 2, constructing a polymerization network G, and losing L through polymerization_aggPre-training the aggregation network G:

the overall framework of the aggregation countermeasure network is shown in fig. 2, and in the present embodiment, the aggregation countermeasure network is composed of three networks: aggregation networks, discrimination networks, and identification networks.

Step 2.1, generating image G (V) through aggregation network_i ^k)；

The input of the aggregation network G is corresponding to the same category v_iK consecutive video frames of which the output is the corresponding category v_iDefining the generated image as G (V)_i ^k) K is a hyper-parameter, representing the number of input video frames of the aggregation network, V_i ^kRepresenting a sequence of k frames of consecutive video of the ith category.

The aggregation network G adopts a network structure in a coding and decoding form, and as shown in fig. 2, the aggregation network extracts shallow features from two convolution layers with a step size of 1, down-samples (codes) the shallow features through three combinations of convolution with a step size of 2 and a residual block, up-samples (decodes) the shallow features through two combinations of deconvolution and a residual block to obtain features with the same size as an original image, and finally obtains a final high-definition face image through two convolution operations and a sigmoid function.

Step 2.2, calculate L_aggThe loss of the carbon dioxide gas is reduced,

S_irepresents and V_i ^kStatic images of the same class, by gradient

Updating the parameters, L, of the aggregation network G_aggThe network convergence can be accelerated by calculating a pixel-level L2 loss function;

step 2.3, after the aggregation network G is converged, saving network model parameters for subsequent formal training;

step 3, loading a pre-training model of the aggregation network G, constructing a discrimination network D and a recognition network R, and adding the countermeasure loss L_advAnd a perceptual loss L_perJointly updating the parameters of the aggregation network G:

step 3.1, loading the aggregation network G pre-training model to obtain a generated image G (V)_i ^k) And a corresponding still image S_i；

And 3.2, constructing a discrimination network D, wherein the discrimination network D is different from a discrimination network in the traditional GAN (genetic adaptive network) and can not only discriminate true and false (truly representing a static image and falsely representing a synthetic image) but also predict the identity of the synthetic image.

Different from the traditional discrimination network, the output of the discrimination network D is an N + 1-dimensional vector through a softmax function, wherein N is the number of identity categories, the identity information of the synthetic image is maximally reserved in a counterstudy mode, and the remaining dimension is used for judging the truth of the synthetic image.

Step 3.3, generating image G (V)_i ^k) The static image S corresponding to it_iSending the data to a discrimination network D to calculate the countermeasure loss

Wherein D_iRepresenting the i-th dimension output of the discrimination network D. For discriminating network D, its goal is to maximize the penalty L_advFor a converged network, it is to combat the loss L_advMinimization;

in other words, when the input of D is a still image S_iWhen D is desired to be D_N+1(S_i) And D_i(S_i) Are all maximized to 1; when the input of D is a composite image G (V)_i ^k) When D is desired

And D_i(G(V_i ^k) All are minimized to 0, and G is desirably D_N+1(G(V_i ^k) And D)_i(G(V_i ^k) Both) are maximized to 1, so both form a counterstudy in judging identity category and in judging true and false;

step 3.4, constructing an identification network R, wherein the R network adopts the existing face identification network LightCNN, and an image G (V) is generated_i ^k) And a static image S corresponding thereto_iSending into the recognition network R, calculating the perception loss

Where R (-) represents the feature value identifying the penultimate pooling layer of the network, perceptual loss allows the generation of an image G (V)_i ^k) And a still image S_iThe method has the advantages that the method is closer to a high-dimensional feature space, the perception similarity is higher, and meanwhile, the most obvious human face details in the synthetic image are kept, so that the method is more beneficial to the recognition process;

the face recognition network LightCNN refers to the article "A Light Cn for Deep faceReposition with noise Labels" of Xiang Wu, published in 2018 IEEE Transactions on information forms and Security 2884-2896.

Step 3.5 Joint polymerization losses L in the form of weighted sums_aggTo counter the loss L_advAnd a perceptual loss L_perTo construct a final loss function L, L ═ L_agg+λL_adv+αL_perλ is 0.01, α is 0.003, different weight coefficients are distributed to different losses, a Stochastic Gradient Descent (SGD) algorithm is used for training the network, and model parameters are stored after the network model converges;

a specific method of the Stochastic gradient descent algorithm can be referred to "Stochastic gradient Descriptent locks" of Leon Bottou, published in 2012 at 421-.

Step 4, video face recognition testing process:

step 4.1, first, the static image of the target set at the time of test is recorded as S ═ S₁,s₂,...,s_j,...,s_MSending the data to an identification network R to obtain a final layer of characteristic value F ═ F₁,f₂,...,f_j,...,f_M-wherein M represents the total number of categories of identity; f. of_jFeatures representing a target set static image of a person of identity category j;

step 4.2, capturing a face picture in real time by using a camera, recording a captured face video sequence of unknown type as V as the input of a polymerization network G, and obtaining a generated image G (V) of unknown type, as shown in FIG. 3C;

step 4.3, sending the generated image G (V) into R to obtain the feature f to be inquired_vSeparately calculating the features f of the generated image_vAnd target set characteristic F ═ F₁,f₂,...,f_j,...,f_MAnd (4) the Euclidean distance, wherein the corresponding category with the minimum distance is the final recognition result.

Step 5, in order to embody the performance superiority of the aggregation countermeasure network, the COX Face video Face data set is compared with the current advanced methods such as VGG-Face, GERML, TBE-CNN and Haar-Net, wherein the COX Face comprises three subsets V1, V2 and V3, and the image quality of the V1 subset and the V2 subset is much poorer than that of the V3 subset, and the COX Face video Face data set is more in line with the monitoring scene.

The comparison results are shown in table 1, and it can be seen from table 1 that the recognition accuracy of the present invention is 89.6 and 88.5 for V1 and V2 subsets, respectively, which exceed the second algorithm 0.3 and 0.6, but the present invention is relatively poor at the V3 subset with better image quality, but is also inferior to the Haar-Net algorithm. Meanwhile, the number of parameters of the aggregation countermeasure network constructed by the method and the number of network layers are respectively 7.6 and 34, and the parameters are respectively 5.5M and 22 layers less than those of Haar-Net, so that the processing efficiency of the aggregation countermeasure network is higher and the calculation is faster in the same time. Therefore, the aggregation countermeasure network in the invention is obviously superior to other methods in the video monitoring scene, regardless of the identification precision or the calculation complexity.

Table 1: comparison results of the application and VGG-Face, GERML, TBE-CNN and Haar-Net methods

The COX Face Video Face data set may be referred to in Huang Zhiwu "A Benchmark and comparative study of Video-based Face Recognition on Cox Face Database", 2015, IEEE Transactions on Image Processing, 5967 and 5981.

VGG-Face can be referred to by Omkar M. Parkhi "Deep Face Recognition" published on page 6 of the British Machine Vision Conference 2015.

The GERML references "Cross itself-to-roman chemistry with application to face recognition from video" of Huang Zhiwu, published in 2018 at IEEE Transactions on Pattern Analysis and Machine understanding, pages 2827 and 2840.

TBE-CNN may be referred to as "round-bridge environmental neural Networks for Video-based Face Recognition" by Changxing Ding, published in 2018, IEEETRANSACTIONS Pattern Analysis and Machine understanding, pp.1002-1014.

Haar-Net can refer to Parchami Mostafa, "Video-based Face Recognition Using Ensemble of Haar-like Deep conditional Neural Networks", published in 2017 on International Joint Conference on Neural Networks, pages 4625 and 4632.

Some steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A video face recognition method is characterized in that in the recognition process, a polymerization countermeasure network is adopted to polymerize a plurality of low-quality video sequences into a single high-quality front face image, and the quality of the generated front face image is improved in the polymerization process in a countermeasure learning mode, so that the video face recognition is accurately carried out;

2. The method of claim 1, wherein the discriminative network takes the form of softmax multi-dimensional output, outputting N + 1-dimensional vectors; where N is the number of identity categories, the remaining dimension represents the true or false of the corresponding image, "true" represents that the corresponding image is a static image, and "false" represents that the corresponding image is a composite image.

3. The method of claim 2, wherein the method comprises:

s1 constructing aggregation network G and losing L through aggregation_aggPre-training the aggregation network G to obtain a pre-training model of the aggregation network G;

4. The method according to claim 3, wherein the S1 is preceded by:

acquiring a training video sequence data set, and recording as V ═ V₁,v₂,…,v_i,…,v_NIn which v is_iThe video sequence of the ith category is represented, i is 1,2, …, and N is the number of categories of the video sequence;

a static image dataset corresponding to V is acquired, denoted S ═ S₁,s₂,…,s_i,…,s_NIn which s is_iRepresenting the static image corresponding to the ith category.

5. The method according to claim 4, wherein the S1 includes:

generating an image G (V) over an aggregation network_i ^k): poly(s) are polymerizedThe input of the combined network G is corresponding to the same category v_iK consecutive video frames of which the output is the corresponding category v_iDefining the generated image as G (V)_i ^k) K is a hyper-parameter, representing the number of input video frames of the aggregation network, V_i ^kA video sequence of the ith category representing a succession of k frames;

calculating L_aggThe loss of the carbon dioxide gas is reduced,

S_irepresents and V_i ^kStatic images of the same class, by gradient ▽ L_aggUpdating the parameters, L, of the aggregation network G_aggCalculating by using a pixel-level L2 loss function;

6. The method according to claim 5, wherein the S2 includes:

Constructing a discrimination network D, converting an original image into a feature map by two convolution layers with the step length of 1, decoding the features by the combination of three convolution layers with the step length of 2 and a residual block, then down-sampling the decoded features by a pooling layer, and finally outputting a vector with the dimension of N +1 to represent the identity and true and false information of the corresponding image by a full connection layer;

Wherein D_iAn ith dimension output representing a discrimination network D;

constructing an identification network R, wherein the identification network R adopts a face identification network LightCNN, and generating an image G (G)V_i ^k) And a static image S corresponding thereto_iSending into the recognition network R, calculating the perception loss

Where R (-) represents a feature value that identifies the penultimate pooling layer of the network.

7. The method according to claim 6, wherein in the S3: λ is 0.01 and α is 0.003.

8. The method according to claim 6, wherein the step of testing the converged countermeasure network video face recognition model in the step S4 includes:

the static image of the target set is recorded as S ═ S during the test₁,s₂,…,s_j,…,s_MSending the data to an identification network R to obtain a final layer of characteristic value F ═ F₁,f₂,…,f_j,…,f_M-wherein M represents the total number of categories of identity; f. of_jFeatures representing a target set static image of a person of identity category j;

sending the generated images G (V) into R to obtain the feature f to be inquired_vSeparately calculating the features f of the generated image_vAnd target set characteristic F ═ F₁,f₂,…,f_j,…,f_MAnd (4) the Euclidean distance, wherein the corresponding category with the minimum distance is the final recognition result.

9. The application of the video face recognition method of any one of claims 1-8 in the technical field of face recognition.

10. The application method of claim 9, wherein the technical field of face recognition includes intelligent security, video surveillance and public security investigation.