CN111339988B

CN111339988B - Video face recognition method based on dynamic interval loss function and probability characteristic

Info

Publication number: CN111339988B
Application number: CN202010166807.8A
Authority: CN
Inventors: 柯逍; 郑毅腾; 朱敏琛
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2023-04-07
Anticipated expiration: 2040-03-11
Also published as: CN111339988A

Abstract

The invention relates to a video face recognition method based on a dynamic interval loss function and probability characteristics, which comprises the following steps: step S1: training a recognition network through a face recognition training set; step S2: adopting a trained recognition network as a feature extraction module, and training an uncertainty module through the same training set; and step S3: aggregating the input video feature set by using the learned uncertainty as the importance degree of the features to obtain aggregated features; and step S4: and comparing the aggregated features by using the mutual likelihood fraction to complete the final recognition. The method can effectively identify the face in the video.

Description

Video face recognition method based on dynamic interval loss function and probability characteristic

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a video face recognition method based on a dynamic interval loss function and probability characteristics.

Background

In recent years, deep convolutional neural networks have been successful in the field of computer vision, and the face recognition method based on deep learning also utilizes the advantages of the deep convolutional neural networks in the aspect of feature extraction, and continuously creates new records on public data sets and develops greatly. In addition, there are also an increasing number of researchers publishing papers related to face recognition in various computer vision meetings. Because face recognition has a wide application field and a great commercial value, new face recognition technology is continuously explored in both academic and industrial fields, and in recent years, with the help of a great breakthrough of deep learning and a convolutional neural network in the field of computer vision, face recognition algorithms continuously refresh records on various public reference data sets and generate a plurality of standard products in the industrial field.

Although the face recognition technology has made great progress, many challenges are faced in real environment, and many factors such as illumination, posture, shading, age and the like affect the performance of face recognition.

Disclosure of Invention

The invention aims to provide a video face recognition method based on a dynamic interval loss function and probability characteristics, which can effectively recognize faces in a video.

In order to realize the purpose, the invention adopts the technical scheme that: a video face recognition method based on a dynamic interval loss function and probability characteristics comprises the following steps:

step S1: training a recognition network through a face recognition training set;

step S2: adopting a trained recognition network as a feature extraction module, and training an uncertainty module through the same training set;

and step S3: aggregating the input video feature set by using the learned uncertainty as the importance degree of the features to obtain aggregated features;

and step S4: and comparing the aggregated features by using the mutual likelihood fraction to complete the final recognition.

Further, the step S1 specifically includes the following steps:

step S11: acquiring a public face recognition training set from a network, and acquiring related labels of training data;

step S12: outputting positions of a face bounding box and key points of a face by adopting a pre-trained Retina face detection model for face images in a face recognition training set, aligning the face by applying similarity transformation, subtracting a mean value from pixel values of all input face images, and normalizing;

step S13: adopting 18 layers of ResNet as a network model for extracting the depth features of the human face, and replacing the first 7 multiplied by 7 convolution kernels with 3 multiplied by 3 convolution kernels; meanwhile, the step size of the first convolutional layer is set to 1, so that the output size of the last feature map is kept to be 7 × 7; in addition, the path where the identity mapping is located is set as an average pooling with step size of 2 followed by a 1 × 1 convolution with step size of 1 to prevent information loss; finally, the convolution layer with the size of 7 multiplied by 7 is adopted to replace an average pooling layer, and the final face feature x is output _i ；

Step S14: let D = { D ₁ ,d ₂ ,...,d _N Is the face image in the test set, d _i For the ith human face image, E (-) is a deep convolutional neural network model for extracting depth features, x _i ＝E(d _i ) For the feature corresponding to the ith human face image, the depth feature x is used _i Taking dot product with the jth column of the last full connection layer W to obtain the fraction z of the jth category _i,j And inputting the classification probability P into a Softmax activation function _i,j The calculation formula is as follows:

wherein C is the total number of categories and k is the subscript of different categories;

step S15: let y _i Label corresponding to ith data，

As a depth feature x _i And the corresponding class weight vector->

Angle therebetween, by>

In connection with>

The point with the maximum rate of change in the function curve of (4) is taken as a reference point, and the point is compared with

Is correlated, i.e. when the dynamic interval parameter for the ith sample is set->

Thereafter, it is taken up>

In connection with>

Curve of function of (a) at theta _m The absolute value of the derivative reaches a maximum, where θ _m For reference points in which the derivative of the functional curve is greatest, a dynamic interval parameter>

The calculation formula of (a) is as follows:

where v is the corresponding scaling parameter used to prevent the classification probability from falling outside the desired range,

the total score of all other categories except the category of the user is obtained;

step S16: obtaining a classification probability P _i,j And dynamic interval parameter

Then, the predicted classification probability P is calculated by using a cross entropy loss function _i And true probability Q _i Difference between them and obtaining the loss value L _CE (x _i ) The calculation formula is as follows:

and then updating the network parameters by using a gradient descent and back propagation algorithm.

Further, the step S2 specifically includes the following steps:

step S21: taking the face recognition model trained in the step S1 as a feature extraction model, and extracting the depth feature x of each face image from the same training data set _i Outputting the corresponding last characteristic diagram as the input of the uncertainty module;

step S22: the uncertainty module is a shallow neural network model which comprises two full connection layers, relu is used as an activation function, a batch normalization layer is inserted between the full connection layers and the activation function for input normalization operation, and finally an exponential function is used as the activation function to output uncertainty sigma corresponding to each face image _i Which is related to the depth feature x _i Having the same dimension, representing the variance of the corresponding feature in the feature space;

step S23: calculating a mutual likelihood fraction s (x) between any two samples using a function _i ,x _j )：

Wherein

And &>

Respectively representing the values of the characteristic mean value mu and the characteristic variance sigma in the ith dimension, wherein h is the dimension of the human face characteristic; />

Step S24: calculating the final loss L by adopting the following function according to the distribution condition of the face images in one batch _pair ：

Where R is the set of face pairs of all the same person and s (-) is a computation function of mutual likelihood scores used to compute the mutual likelihood scores between two face pairs, the goal of the loss function being to maximize the mutual likelihood score values between all the face pairs of the same person.

Further, the specific method of step S3 is:

deep face feature x output by feature extraction network _i Reflects the most likely feature representation of the input face image, while the output σ of the uncertainty module _i Then it represents the uncertainty, σ, of the feature in each dimension _i Varies with the image quality, σ _i Reflecting the importance of the corresponding depth feature in the entire set of input video images, as a weight for depth feature x _i Performing weighted fusion, and obtaining the fused feature a _i The calculation is as follows:

wherein M is the number of samples in a batch;

and fusing the uncertainties corresponding to the features by adopting a minimum uncertainty method, namely, taking the minimum value of each dimension as a final vector for all uncertainty vectors in the set.

Further, in the step S4, for the input feature x _i And corresponding uncertainty σ _i Comparing by using the mutual likelihood scores, specifically comprising the following steps:

step S41: performing ten-fold cross validation on the trained model on a validation set to obtain final average accuracy, traversing possible thresholds on each fold, and taking the threshold which enables the final accuracy to be highest as a comparison threshold t;

step S42: let G = { G ₁ ,g ₂ ,...,g _M The feature x of a tested face image is taken as the face image in the database _i And the face image characteristics x of each person in G _j Comparing, and adopting a nearest neighbor and threshold value method as a judgment basis; for the face images in the database G and the test set D, extracting corresponding depth features x by using a trained feature extraction model and an uncertainty module _i And the corresponding uncertainty σ _i Calculating a mutual likelihood score, if the score is larger than a comparison threshold t, the person is regarded as the same person, otherwise, the person is regarded as a different person; and traversing each image in the database to obtain a final recognition result.

Compared with the prior art, the invention has the following beneficial effects:

1. the face recognition method and the face recognition device can effectively recognize the face in the video, improve the accuracy of face recognition and reduce the influence of image quality on face recognition.

2. The constraint can be gradually enhanced in the model training process, and the generalization of the features is improved.

3. Aiming at the problem that interval parameters are difficult to select in the traditional interval-based loss function, the loss function based on the dynamic interval is provided. The loss function does not need to adjust parameters of the interval, and can adaptively adjust the size of the interval according to different data sets and different network structures, so as to control the gradient size of each sample in a fine-grained manner. In addition, the constraint strength can be gradually increased in the training process along with the convergence of the model, so that the model can continuously receive effective gradients and update parameters, and the judgment of final characteristics is improved.

4. The method utilizes the uncertainty of the pre-training network learning characteristics, fuses set characteristics by the uncertainty, finally compares the fused characteristics by adopting mutual likelihood scores, and can effectively improve the face recognition effect in the non-limited scene.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present invention provides a video face recognition method based on dynamic interval loss function and probability feature, comprising the following steps:

step S1: and training the recognition network through a face recognition training set. The method specifically comprises the following steps:

step S11: and acquiring a public face recognition training set from the network, and acquiring related labels of training data.

Step S12: outputting positions of a face bounding box and key points of a face by adopting a pre-trained Retina face detection model for face images in a face recognition training set, aligning the face by applying similarity transformation, subtracting a mean value of 127.5 from pixel values of all input face images, and dividing by 128 for normalization.

Step S13: adopting 18 layers of ResNet as a network model for extracting the depth features of the human face, and replacing the first 7 multiplied by 7 convolution kernels with 3 multiplied by 3 convolution kernels; meanwhile, the step size of the first convolutional layer is changed from 2 to 1, so that the output size of the last characteristic diagram is kept to be 7 × 7; in addition, the path where the identity mapping is located is changed into an average pooling with the step length of 2, and then is connected with a 1 × 1 convolution with the step length of 1, so that information loss is prevented; finally, the convolution layer with the size of 7 multiplied by 7 is adopted to replace an average pooling layer, and the final face feature x is output _i 。

Step S14: let D = { D ₁ ,d ₂ ,...,d _N The face image in the test set, d _i For the ith human face image, E (-) is a deep convolutional neural network model for extracting depth features, x _i ＝E(d _i ) For the feature corresponding to the ith human face image, the depth feature x is used _i Taking dot product with the jth column of the last full connection layer W to obtain the fraction z of the jth category _i,j And inputting the classification probability P into a Softmax activation function to generate a classification probability _i,j The calculation formula is as follows:

where C is the total number of categories and k is the subscript of the different categories.

Step S15: let y _i Is the label corresponding to the ith data,

as a depth feature x _i And the corresponding class weight vector->

Angle therebetween, by>

About>

Thereafter, it is taken up>

About>

Curve of function of (a) at theta _m The absolute value of the derivative reaches a maximum, where θ _m Reference point for maximizing the derivative of the functional curve, P (θ) _m ) Close to 0.5, in the early part of the training, is/are present>

Relatively large, in order to be able to provide suitable constraints on the optimization of the network, we limit the reference point θ _m Is less than pi/4, a dynamic interval parameter>

The calculation formula of (c) is as follows:

is the sum of scores of all other classes except the self classGenerally, one may be subtracted from the total number of categories.

Step S2: and training the uncertainty module by using the trained recognition network as a feature extraction module and through the same training set. The method specifically comprises the following steps:

step S21: taking the face recognition model trained in the step S1 as a feature extraction model, and extracting the depth feature x of each face image from the same training data set _i And outputting the corresponding last feature map as the input of the uncertainty module.

Step S22: the uncertainty module is a shallow neural network model and comprises two full connection layers, relu is used as an activation function, a batch normalization layer is inserted between the full connection layers and the activation function for input normalization operation, and finally an index function is used as the activation function to output uncertainty sigma corresponding to each face image _i Which is related to the depth feature x _i Having the same dimension, represent the variance of the corresponding feature in the feature space.

Wherein

And &>

Respectively representing the values of the characteristic mean value mu and the characteristic variance sigma in the ith dimension, and h is the dimension of the face characteristic; from the formula, it can be seen that if the depth feature x _i And x _j With a large uncertainty, the value of the mutual likelihood score will be low regardless of the distance between its features; the value of the mutual likelihood score will be high only if both inputs have little uncertainty and the corresponding means are very close.

Where R is the set of face pairs of all the same person and s (·,) is the computation function of the mutual likelihood scores that is used to compute the mutual likelihood scores between the two face pairs, the objective of the penalty function being to maximize the value of the mutual likelihood scores between all the face pairs of the same person.

And step S3: and aggregating the input video feature set by using the learned uncertainty as the importance degree of the features to obtain the aggregated features.

Deep face feature x output by feature extraction network _i Reflecting the most probable feature representation of the input face image, and the output sigma of the uncertainty module _i Then the uncertainty, σ, of the feature in each dimension is represented _i Varies with the image quality, σ _i Reflects the importance of the corresponding depth feature in the entire set of input video images and is therefore used as a weight for depth feature x _i Performing weighted fusion, the fused feature a _i Is calculated asThe following:

wherein M is the number of samples in a batch;

in order to compare the aggregated features in the testing stage, the uncertainty corresponding to the features is fused by adopting a minimum uncertainty method, namely, the minimum value of each dimension is taken as a final vector for all uncertainty vectors in the set.

And step S4: and comparing the aggregated features by adopting the mutual likelihood fraction instead of the cosine similarity to finish final identification.

In the testing phase, for the input feature x _i And corresponding uncertainty σ _i The mutual likelihood fraction is adopted to replace the cosine similarity for comparison, and the mutual likelihood fraction considers the influence of the quality of the input image on the characteristics at the same time, so that the influence of the poor image quality on the final recognition result can be more effectively inhibited; the method specifically comprises the following steps:

step S41: compared with cosine similarity, the value range of the mutual likelihood score is wider, so that the selection of the comparison threshold is more difficult. In order to effectively select the comparison threshold, the trained model is subjected to cross validation by ten folds on a validation set to obtain the final average accuracy, the possible thresholds are traversed on each fold, and the threshold which enables the final accuracy to be the highest is taken as the comparison threshold t.

Step S42: let G = { G ₁ ,g ₂ ,...,g _M The feature x of a tested face image is taken as the face image in the database _i And the face image characteristics x of each person in G _j Comparing, and adopting a nearest neighbor method and a threshold value method as a judgment basis; for the face images in the database G and the test set D, extracting corresponding depth features x by using a trained feature extraction model and an uncertainty module _i And the corresponding uncertainty σ _i Calculating a mutual likelihood score, if the score is larger than a comparison threshold t,the users are considered to be the same person, otherwise, the users are considered to be different persons; and traversing each image in the database to obtain a final identification result.

The above are preferred embodiments of the present invention, and all changes made according to the technical solutions of the present invention that produce functional effects do not exceed the scope of the technical solutions of the present invention belong to the protection scope of the present invention.

Claims

1. A video face recognition method based on dynamic interval loss function and probability characteristic is characterized by comprising the following steps:

step S1: training a face recognition network model through a face recognition training set;

step S2: adopting a trained face recognition network model as a feature extraction model, and training an uncertainty module through the same training set;

and step S3: aggregating the input video image set by using the learned uncertainty as the importance degree of the features to obtain the aggregated features;

and step S4: comparing the aggregated features by adopting a mutual likelihood score to complete final recognition;

the step S1 specifically includes the steps of:

step S13: adopting 18 layers of ResNet as a face recognition network model for face depth feature extraction, and replacing the first 7 x 7 convolution kernel by 3 x 3 convolution kernels; meanwhile, the step size of the first convolutional layer is set to 1, so that the output size of the last feature map is kept to be 7 × 7; the average pooling step size of the paths in which the identity maps are located is set to 21 × 1 convolution of 1 to prevent information loss; and finally, replacing the average pooling layer with the convolution layer with the size of 7 multiplied by 7, and outputting the final face depth feature x _i ；

Step S14: let D = { D ₁ ,d ₂ ,...,d _N The face image in the training set, d _i For the ith face image, E (-) is the face recognition network model for extracting depth features, x _i ＝E(d _i ) For the depth feature corresponding to the ith human face image, the depth feature x is used _i Taking dot product with the jth column of the last full connection layer W to obtain the fraction z of the jth category _i,j And inputting the classification probability P into a Softmax activation function to generate a classification probability _i,j The calculation formula is as follows:

step S15: let y _i The label corresponding to the ith human face image,

as a depth feature x _i And corresponding category weight vector>

Angle therebetween, by>

About>

The point with the maximum rate of change in the function curve of (2) is taken as a reference point, and the point is compared with

Correlation, i.e. when the dynamic interval parameter of the ith face image is set/>

Thereafter, it is taken up>

About>

Curve of function of (a) at theta _m The absolute value of the derivative reaches a maximum, where θ _m The dynamic spacing parameter @, which is the reference point at which the derivative of the functional curve is maximized>

The calculation formula of (a) is as follows:

Then, the predicted classification probability P is calculated by using a cross entropy loss function _i,j And true probability Q _i,j Difference between them and obtaining the loss value L _CE (x _i ) The calculation formula is as follows:

then updating network parameters by using a gradient descent and back propagation algorithm;

the step S2 specifically includes the following steps:

step S21: taking the face recognition network model trained in the step S1 as a feature extraction model, extracting the depth feature of each face image on the same training set, and outputting the last feature image corresponding to the depth feature of each face image as the input of an uncertainty module;

step S22: the uncertainty module is a shallow neural network model, comprises two full connection layers, adopts Relu as an activation function, inserts a batch normalization layer between the full connection layers and the activation function for input normalization operation, and finally adopts an exponential function as the activation function to output the uncertainty corresponding to each face image, wherein the uncertainty has the same dimensionality with the depth feature and represents the variance of the corresponding feature in a feature space;

step S23: calculating a mutual likelihood fraction s (x) between any two samples using a function _i ,x _b )：

Wherein

And &>

Respectively representing depth features x _i And uncertainty σ _i H is the dimension of the face feature;

2. The video face recognition method based on the dynamic interval loss function and the probability feature of claim 1, wherein the specific method of the step S3 is as follows:

depth feature x output by face recognition network model _k Reflecting the most probable feature representation of the input face image, and the output sigma of the uncertainty module _k Then the uncertainty, σ, of the feature in each dimension is represented _k Varies with the image quality, σ _k Reflecting the importance of the corresponding depth feature in the entire set of input video images, as a weight for depth feature x _k Carrying out weighted aggregation, feature a after aggregation _f The calculation is as follows:

wherein M is the number of samples in a batch;

and aggregating the uncertainties corresponding to the features by adopting a minimum uncertainty method, namely taking the minimum value of each dimension as a final vector for all uncertainty vectors in the set.

3. The video face recognition method based on the dynamic interval loss function and the probability feature of claim 2, wherein the step S4 specifically comprises the following steps:

step S42: let G = { G ₁ ,g ₂ ,...,g _Z Comparing a tested face depth feature with the face depth features of each person in the G, and adopting a nearest neighbor and threshold value method as a judgment basis; for the face images in the database G and the test set D, extracting corresponding depth features and corresponding uncertainties by using a trained feature extraction model and an uncertainty module, calculating mutual likelihood scores of the aggregated features, and if the scores are greater than a comparison threshold t, the features are considered as the same person, otherwise, the features are considered as different persons; and traversing each image in the database to obtain a final recognition result.