CN111539263B

CN111539263B - Video face recognition method based on aggregation countermeasure network

Info

Publication number: CN111539263B
Application number: CN202010253595.7A
Authority: CN
Inventors: 陈莹; 金炜
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2023-08-11
Anticipated expiration: 2040-04-02
Also published as: CN111539263A

Abstract

The application discloses a video face recognition method based on an aggregation countermeasure network, and belongs to the technical field of video face recognition. The method adopts an aggregation countermeasure network constructed by an aggregation network, a discrimination network and an identification network, wherein the aggregation network and the discrimination network form countermeasure learning, and the generated image and the static image of the target set are more similar in a competition mode; the perception loss is calculated in the high-dimensional feature space through the identification network, so that the generated image and the corresponding target set static image are closer in perception performance, and the performance of the aggregation network is improved. The discrimination network adopts a form of softmax multidimensional output, and can discriminate identity categories of images besides judging the authenticity of the images, so that the identity of the generated images is closer to the authenticity value, and the subsequent recognition is more accurate and the recognition efficiency is higher.

Description

Video face recognition method based on aggregation countermeasure network

Technical Field

The application relates to a video face recognition method based on an aggregation countermeasure network, and belongs to the technical field of video face recognition.

Background

The video face recognition technology, as the name implies, is based on video to perform face recognition. With the increasing development of technology and demand, video face recognition technology has been applied in various fields, such as intelligent security, video monitoring, public security investigation, and the like.

The video face recognition is different from the face recognition based on a single image, the query set of the video face recognition is a video sequence, and the target set of the video face recognition is usually a high-definition face image, and the identity of the person in the video is further recognized by extracting the face characteristics of the video sequence and matching the face characteristics in the target set.

However, in the video monitoring scene with the most common video face recognition, the faces in the video sequence often have the conditions of motion blur, noise, shielding and the like, so the faces have great differences with the target set faces, and the differences cannot be well processed by the conventional method or the method based on deep learning at present, so that the recognition effect is poor.

In addition, the existing video face recognition method extracts features from the video sequence one by one, so that not only can the test time be too long, but also the recognition result can be easily interfered by low-quality frames in the video sequence.

Disclosure of Invention

In order to solve the problems of low efficiency and low precision in the existing video face recognition technology, the application provides a video face recognition method, which comprises the steps of adopting an aggregation countermeasure network to aggregate a plurality of low-quality video sequences into a single high-quality face image in the recognition process, and improving the quality of the generated face image in the aggregation process in a countermeasure learning mode so as to accurately perform video face recognition;

the aggregation countermeasure network consists of an aggregation network, a discrimination network and an identification network, wherein the aggregation network and the discrimination network form countermeasure learning, the generated image and the target set static image are more similar in a competitive mode, and the identification network calculates the perception loss in a high-dimensional feature space, so that the generated front face image and the corresponding target set static image are more similar in perception performance.

Optionally, the discrimination network adopts a softmax multidimensional output form to output an n+1-dimensional vector; where N is the number of identity categories, the remaining one dimension represents true or false of the corresponding image, "true" represents the corresponding image as a static image, and "false" represents the corresponding image as a composite image.

Optionally, the method includes:

s1, constructing an aggregation network G, and losing L through aggregation _agg Pre-training the aggregation network G to obtain an aggregation network G pre-training model;

s2, loading a pre-training model of the aggregation network G, constructing a discrimination network D and an identification network R, and calculating the countermeasures loss L _adv And perceived loss L _per ；

S3 adopting weighted sum form to combine aggregation loss L _agg Countering loss L _adv And perceived loss L _per To construct the final loss function L, l=l _agg +λL _adv +αL _per The method comprises the steps of carrying out a first treatment on the surface of the Lambda and alpha are respectively the countermeasures against loss L _adv And perceived loss L _per Weight coefficient of (2), give aggregation loss L _agg Countering loss L _adv And perceived loss L _per Different weight coefficient values are distributed, the aggregation network G is trained, model parameters are stored after the pre-training model of the aggregation network G is converged, and an aggregation countermeasure network video face recognition model is obtained;

s4, testing the aggregated countermeasure network video face recognition model obtained in the S3, and performing practical application of video face recognition by using the aggregated countermeasure network video face recognition model after the testing is completed.

Optionally, the step S1 further includes:

acquiring a training video sequence data set, denoted as v= { V ₁ ,v ₂ ,...,v _i ,...,v _N }, v is _i Representing an i-th category video sequence, i=1, 2,..n, N is the number of categories of video sequences;

acquiring a static image dataset corresponding to V, denoted s= { S ₁ ,s ₂ ,...,s _i ,...,s _N (s is therein _i Representing the static image corresponding to the i-th category.

Optionally, the S1 includes:

generating an image G (V) over an aggregation network _i ^k ): the inputs to the aggregation network G are corresponding to the same category v _i The output is the corresponding category v _i Shan Zhanggao quality face image of (2), defining the generated image as G (V _i ^k ) K is a super parameter, which represents the number of video frames input by the aggregation network, V _i ^k A video sequence representing the ith class of k consecutive frames;

calculate L _agg The loss of the material is controlled by the temperature,S _i representation and V _i ^k Static images of the same class, by gradientUpdating parameters, L, of an aggregation network G _agg Calculating by adopting a pixel level L2 loss function;

after the aggregation network G converges, the network model parameters are saved, and the aggregation network G pre-training model is obtained.

Optionally, the S2 includes:

loading the aggregation network G pre-training model to obtain a generated image G (V _i ^k ) Corresponding still image S _i ；

Constructing a discrimination network D, firstly converting an original image into a feature map through two convolution layers with the step length of 1, then decoding features through the combination of three convolution layers with the step length of 2 and residual blocks, then downsampling the decoded features through a pooling layer, and finally outputting an N+1-dimensional vector representing identity and true and false information of a corresponding image through a full connection layer;

will generate an image G (V _i ^k ) Corresponding static image S _i Into the discrimination network D to calculate the countermeasures against lossWherein D is _i An ith dimension output representing the discrimination network D;

constructing a recognition network R, which uses a face recognition network LightCNN, to generate an image G (V _i ^k ) And a corresponding still image S _i Into the recognition network R, calculating the perceived lossWhere R (-) represents a characteristic value that identifies the penultimate pooling layer of the network.

Optionally, in S3: λ=0.01, α=0.003.

Optionally, the process of testing the aggregation countermeasure network video face recognition model in S4 includes:

the static image of the target set at the time of test is denoted as s= { S ₁ ,s ₂ ,...,s _j ,...,s _M }, it is subjected toRespectively sending the characteristic values into the identification network R to obtain a final layer of characteristic values F= { F ₁ ,f ₂ ,...,f _j ,...,f _M -wherein M represents the total number of categories of identities; f (f) _j Features representing a static image of a target set of people of identity class j;

capturing face images in real time by using a camera, and recording the captured face video sequences of unknown categories as V as the input of an aggregation network G to obtain a generated image G (V) of the unknown categories;

sending the generated image G (V) into R to obtain the feature f to be queried _v Respectively calculating the characteristic f of the generated image _v With object set feature f= { F ₁ ,f ₂ ,...,f _j ,...,f _M The Euclidean distance with the smallest distance is the final recognition result.

The application also provides application of the video face recognition method in the technical field of face recognition.

Optionally, the technical field of face recognition includes intelligent security, video monitoring and public security investigation.

The application has the beneficial effects that:

according to the application, the image generation technology is integrated into video face recognition, a plurality of low-quality video sequences are aggregated into a single high-quality front face image through the aggregation network, the defect of extracting image features frame by frame in the existing video face recognition technology is overcome, and the video face recognition efficiency is improved.

The aggregation countermeasure network constructed by the application consists of an aggregation network, a discrimination network and an identification network, wherein the aggregation network and the discrimination network form countermeasure learning, so that the generated image and the target set static image are more similar in a competition mode; the perception loss is calculated in the high-dimensional feature space through the identification network, so that the generated image and the corresponding target set static image are closer in perception performance, and the performance of the aggregation network is improved.

The discrimination network designed by the application adopts a mode of softmax multidimensional output, can judge the true and false of the image, can also distinguish the identity type of the image, ensures that the identity type of the generated image is consistent with the static image of the target set through the countermeasures containing the identity type information, ensures that the identity of the generated image is closer to the true value, and ensures that the subsequent recognition is more accurate and has higher recognition efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a video face recognition technology based on an aggregation countermeasure network provided by the application.

Fig. 2 is a network configuration diagram of an aggregation countermeasure network used in the present application.

Fig. 3A is a partial subset representation of a video sequence data set used in the present application.

Fig. 3B is a realistic value display diagram corresponding to the still image of fig. 3A.

Fig. 3C is a graph of the image results of the final synthesis of the present application through a video sequence.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Embodiment one:

the embodiment provides a video face recognition method based on an aggregation countermeasure network, referring to fig. 1, the method includes:

step 1, acquiring a training set comprising a video sequence data set V and a corresponding static image data set S:

step 1.1, acquiring a training video sequence data set, which is marked as V= { V ₁ ,v ₂ ,...,v _i ,...,v _N }, v is _i Representing an i-th category video sequence, i=1, 2,..n, N is the number of categories of video sequences;

in practice, N represents the number of different people present in V, and video sequences corresponding to the same person are referred to as one class.

Step 1.2, acquiring a static image dataset corresponding to V, denoted as S= { S ₁ ,s ₂ ,...,s _i ,...,s _N (s is therein _i Representing a static image corresponding to the i-th category;

in practical applications, a high-definition camera may be used to capture S, and in some practical video monitoring scenarios, the picture in S is usually a photo on an identification card or a photo specifically captured.

The video sequence data set V is shown in fig. 3A, with some occlusion, motion blur, noise and side face cases; as shown in fig. 3B, the still image data set S is photographed in a good environment, and is a clear front face image.

Step 2, constructing an aggregation network G, and losing L through aggregation _agg Pretrained aggregation network G:

the aggregate antagonism network overall framework is shown in fig. 2, and in this embodiment, the aggregate antagonism network is composed of three networks: aggregation network, discrimination network, and identification network.

Step 2.1, generating an image G (V) over an aggregation network _i ^k )；

The inputs to the aggregation network G are corresponding to the same category v _i The output is the corresponding category v _i Shan Zhanggao quality face image of (2), defining the generated image as G (V _i ^k ) K is a super parameter, which represents the number of video frames input by the aggregation network, V _i ^k Representing a k-frame consecutive video sequence of the i-th category.

The aggregation network G adopts a network structure in a coding and decoding form, and it can be known from fig. 2 that the aggregation network extracts shallow features from two convolution layers with a step length of 1, then downsamples (codes) the shallow features by three combinations of convolution with a step length of 2 and residual blocks, then upsamples (decodes) the combinations of deconvolution and residual blocks to obtain features with the same size as the original image, and finally obtains a final high-definition face image by two convolution operations and a sigmoid function.

Step 2.2, calculate L _agg The loss of the material is controlled by the temperature,S _i representation and V _i ^k Static images of the same class, by gradient +.>Updating parameters, L, of an aggregation network G _agg The convergence of the network can be accelerated by adopting the pixel-level L2 loss function to calculate;

step 2.3, after convergence of the aggregation network G, saving network model parameters for subsequent formal training;

step 3, loading a pre-training model of the aggregation network G, constructing a discrimination network D and an identification network R, and adding a countermeasures loss L _adv And perceived loss L _per Jointly updating parameters of the aggregation network G:

step 3.1, loading a pre-training model of the aggregation network G to obtain a generated image G (V) _i ^k ) Corresponding still image S _i ；

Step 3.2, a discrimination network D is constructed, and unlike the discrimination network in the conventional GAN (Generative Adversarial Networks generation type countermeasure network), the discrimination network D in the present application can discriminate not only between true and false (true represents a still image and false represents a composite image) but also predict the identity of the composite image.

Unlike conventional discrimination networks, the output of the discrimination network D of the present application is a vector of n+1 dimensions through a softmax function, where N is the number of identity classes, and the combined image is maximized by means of countermeasure learning to retain identity information, leaving one dimension for judging whether it is true or false.

Step 3.3, an image G (V _i ^k ) Corresponding static image S _i Into the discrimination network D to calculate the countermeasures against lossWherein D is _i Ith dimension representing a discrimination network DAnd outputting. For discriminating network D, its goal is to maximize the countering loss L _adv Whereas for an aggregation network it is to combat the loss L _adv Minimizing;

in other words, when the input of D is a still image S _i When D is desired D _N+1 (S _i ) And D _i (S _i ) All maximized to 1; when the input of D is a composite image G (V _i ^k ) When D is desiredAnd D _i (G(V _i ^k ) All minimized to 0, while G expects D _N+1 (G(V _i ^k ) D) and D _i (G(V _i ^k ) 1) so that both form an countermeasure study in judging identity class and judging true or false;

step 3.4, constructing a recognition network R, wherein the R network adopts the existing face recognition network LightCNN to generate an image G (V _i ^k ) And a corresponding still image S _i Into the recognition network R, calculating the perceived lossWhere R (-) represents a characteristic value identifying the penultimate pooling layer of the network, the perceived loss lets the image G (V _i ^k ) And still image S _i The method has the advantages that the method is closer in high-dimensional feature space, the perceived similarity is higher, and meanwhile, the most obvious face details in the synthesized image are reserved, so that the recognition process is facilitated;

the face recognition network LightCNN may refer to "A Light Cnn for Deep Face Representation with Noisy Labels" by Xiang Wu, which is published on pages 2884-2896 of IEEE Transactions on Information Forensics and Security in 2018.

Step 3.5, combining the aggregation loss L in the form of a weighted sum _agg Countering loss L _adv And perceived loss L _per To construct the final loss function L, l=l _agg +λL _adv +αL _per λ=0.01, α=0.003, given to different speciesDifferent weight coefficients are allocated to the loss of the network, a random gradient descent algorithm (Stochastic Gradient Descent, SGD) is utilized to train the network, and model parameters are stored after the network model converges;

specific methods for the random gradient descent algorithm are described in the document "Stochastic Gradient Descent Tricks" by Leon Bottou, 2012 on pages 421-436 of "Neural networks: tricks of the trade".

Step 4, a video face recognition testing process:

step 4.1, firstly, recording the static image of the target set during test as S= { S ₁ ,s ₂ ,...,s _j ,...,s _M Respectively sending the characteristic values into a recognition network R to obtain a final layer of characteristic values F= { F ₁ ,f ₂ ,...,f _j ,...,f _M -wherein M represents the total number of categories of identities; f (f) _j Features representing a static image of a target set of people of identity class j;

step 4.2, capturing face images in real time by using a camera, and recording the captured face video sequences of unknown categories as V as the input of an aggregation network G to obtain a generated image G (V) of the unknown categories, as shown in FIG. 3C;

step 4.3, sending the generated image G (V) into R to obtain the feature f to be queried _v Respectively calculating the characteristic f of the generated image _v With object set feature f= { F ₁ ,f ₂ ,...,f _j ,...,f _M The Euclidean distance with the smallest distance is the final recognition result.

In order to embody the superiority of aggregation to the performance of the network, the application compares with the current advanced methods of VGG-Face, GERML, TBE-CNN and Haar-Net on a COX Face video Face data set, wherein the COX Face comprises three subsets of V1, V2 and V3, the image quality of the subsets of V1 and V2 is much worse than that of V3, and the application is more in line with the monitoring scene.

As shown in Table 1, it is clear from Table 1 that the recognition accuracy of the present application is 89.6 and 88.5 for the V1 and V2 subsets, respectively, exceeding the second algorithms 0.3 and 0.6, but the effect is relatively poor on the V3 subset with better image quality, but is also inferior to the Haar-Net algorithm. Meanwhile, the quantity of the aggregation countermeasure network parameters and the number of network layers constructed by the method are respectively 7.6 layers and 34 layers, and compared with Haar-Net layers, the quantity of the aggregation countermeasure network parameters and the number of network layers are respectively 5.5M layers and 22 layers, so that the aggregation countermeasure network processing efficiency is higher and the calculation is faster in the same time. Therefore, the aggregation countermeasure network in the application is obviously superior to other methods in terms of identification precision and calculation complexity under the monitoring video scene.

Table 1: comparison results of the present application with VGG-Face, GERML, TBE-CNN, haar-Net method

The COX Face video Face dataset may be referred to Huang Zhiwu, "A Benchmark and Comparative Study of Video-based Face Recognition on Cox Face Database," which is published 2015 on pages 5967-5981 of IEEE Transactions on Image Processing.

VGG-Face can be referred to "Deep Face Recognition" by Omkar M.Parkhi, which is published on page 6 of British Machine Vision Conference at 2015.

GERML is described in Huang Zhiwu, "Cross euclidean-to-riemannian metric learning with application to face recognition from video", which is published in 2018 at IEEE Transactions on Pattern Analysis and Machine Intelligence on pages 2827-2840.

TBE-CNN can be referred to as "Trunk-branch Ensemble Convolutional Neural Networks for Video-based Face Recognition" by Changxing Ding, which is published in 2018 on pages 1002-1014 of IEEE Transactions on Pattern Analysis and Machine Intelligence.

Haar-Net can be referred to Parchami Mostafa, video-based Face Recognition Using Ensemble of Haar-like Deep Convolutional Neural Networks, published in 2017 at International Joint Conference on Neural Networks, pages 4635-4632.

Some steps in the embodiments of the present application may be implemented by using software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A video face recognition method is characterized in that in the method, a plurality of low-quality video sequences are aggregated into a single high-quality front face image by adopting an aggregation countermeasure network in the recognition process, and the quality of the generated front face image is improved in an countermeasure learning mode in the aggregation process, so that video face recognition is accurately performed;

the aggregation countermeasure network consists of an aggregation network, a discrimination network and an identification network, wherein the aggregation network and the discrimination network form countermeasure learning, the generated image and the target set static image are more similar in a competitive manner, and the identification network calculates the perception loss in a high-dimensional feature space, so that the generated front face image and the corresponding target set static image are more similar in perception performance;

the method comprises the following steps:

S3 adopting weighted sum form to combine aggregation loss L _agg Countering loss L _adv And perceived loss L _per To construct the final loss function L, l=l _agg +λL _adv +αL _per The method comprises the steps of carrying out a first treatment on the surface of the Lambda and alpha are respectively the countermeasures against loss L _adv And perceived loss L _per Weight coefficient of (2), give aggregation loss L _agg Countering loss L _adv And perceived loss L _per Different weight coefficient values are distributed, the aggregation network G is trained, and the aggregation network G is pre-trainedAfter the model is converged, the model parameters are stored to obtain an aggregation countermeasure network video face recognition model;

s4, testing the aggregated countermeasure network video face recognition model obtained in the S3, and performing actual application of video face recognition by using the aggregated countermeasure network video face recognition model after the test is completed;

the step S1 further includes:

acquiring a static image dataset corresponding to V, denoted s= { S ₁ ,s ₂ ,...,s _i ,...,s _N (s is therein _i Representing a static image corresponding to the i-th category;

the S1 comprises the following steps:

calculate L _agg The loss of the material is controlled by the temperature,S _i representation and V _i ^k Static images of the same class, pass gradient L _agg Updating parameters, L, of an aggregation network G _agg Calculating by adopting a pixel level L2 loss function;

after convergence of the aggregation network G, saving network model parameters to obtain an aggregation network G pre-training model;

the step S2 comprises the following steps:

Constructing a discrimination network D, converting an original image into a feature map through two convolution layers with the step length of 1, decoding features through the combination of three convolution layers with the step length of 2 and residual blocks, downsampling the decoded features through a pooling layer, and finally outputting the identity and true and false information of the corresponding image represented by the vector of the dimension N+1 through a full connection layer;

2. The method of claim 1, wherein the discrimination network takes the form of a softmax multidimensional output outputting an n+1-dimensional vector; where N is the number of identity categories, the remaining one dimension represents true or false of the corresponding image, "true" represents the corresponding image as a static image, and "false" represents the corresponding image as a composite image.

3. The method according to claim 2, wherein in S3: λ=0.01, α=0.003.

4. A method according to claim 3, wherein the step of testing the aggregated countermeasure network video face recognition model in S4 includes:

the static image of the target set at the time of test is denoted as s= { S ₁ ,s ₂ ,...,s _j ,...,s _M Respectively sending the characteristic values into a recognition network R to obtain a final layer of characteristic values F= { F ₁ ,f ₂ ,...,f _j ,...,f _M -wherein M represents the total number of categories of identities; fj represents the characteristics of the static image of the target set of people with identity class j;

5. Use of the video face recognition method of any one of claims 1-4 in the field of face recognition technology.

6. The application method according to claim 5, wherein the face recognition technical field comprises intelligent security, video monitoring and public security investigation.