CN111339988A

CN111339988A - Video face recognition method based on dynamic interval loss function and probability characteristic

Info

Publication number: CN111339988A
Application number: CN202010166807.8A
Authority: CN
Inventors: 柯逍; 郑毅腾; 朱敏琛
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-06-26
Anticipated expiration: 2040-03-11
Also published as: CN111339988B

Abstract

The invention relates to a video face recognition method based on a dynamic interval loss function and probability characteristics, which comprises the following steps: step S1: training a recognition network through a face recognition training set; step S2: adopting a trained recognition network as a feature extraction module, and training an uncertainty module through the same training set; step S3: aggregating the input video feature set by using the learned uncertainty as the importance degree of the features to obtain aggregated features; step S4: and comparing the aggregated features by using the mutual likelihood fraction to complete the final recognition. The method can effectively identify the face in the video.

Description

Video face recognition method based on dynamic interval loss function and probability characteristic

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a video face recognition method based on a dynamic interval loss function and probability characteristics.

Background

In recent years, the deep convolutional neural network has achieved great success in the field of computer vision, and the face recognition method based on deep learning also utilizes the advantages of the deep convolutional neural network in the aspect of feature extraction, and continuously creates new records on public data sets and has greatly developed. In addition, there are also an increasing number of researchers in various computer vision meetings that issue papers related to face recognition. Because face recognition has a wide application field and a great commercial value, new face recognition technology is continuously explored in both academic and industrial fields, and in recent years, with the help of a great breakthrough of deep learning and a convolutional neural network in the field of computer vision, face recognition algorithms continuously refresh records on various public reference data sets and generate a plurality of standard products in the industrial field.

Although the face recognition technology has advanced greatly, many challenges are faced in real environment, and factors such as light, posture, shading, age, etc. affect the performance of face recognition.

Disclosure of Invention

The invention aims to provide a video face recognition method based on a dynamic interval loss function and probability characteristics, which can effectively recognize faces in a video.

In order to achieve the purpose, the invention adopts the technical scheme that: a video face recognition method based on a dynamic interval loss function and probability characteristics comprises the following steps:

step S1: training a recognition network through a face recognition training set;

step S2: adopting a trained recognition network as a feature extraction module, and training an uncertainty module through the same training set;

step S3: aggregating the input video feature set by using the learned uncertainty as the importance degree of the features to obtain aggregated features;

step S4: and comparing the aggregated features by using the mutual likelihood fraction to complete the final recognition.

Further, the step S1 specifically includes the following steps:

step S11: acquiring a public face recognition training set from a network, and acquiring related labels of training data;

step S12: outputting positions of a face bounding box and key points of a face by adopting a pre-trained Retina face detection model for face images in a face recognition training set, aligning the face by applying similarity transformation, subtracting a mean value from pixel values of all input face images, and normalizing;

step S13, adopting 18 layers of ResNet as a network model for extracting the human face depth features, replacing the first convolution kernel of 7 × 7 with 3 convolution kernels of 3 × 3, setting the step size of the first convolution layer to be 1 to keep the output size of the last feature map to be 7 × 7, setting the path of the identity mapping to be an average pooling with the step size of 2 and then convolving the path with 1 of 1 × 1 with the step size of 1 to prevent information loss, and finally, adopting the convolution layer with the size of 7 × 7 to replace the average pooling layer to output the final human face feature x_i；

Step S14: let D ═ D₁,d₂,...,d_NThe face image in the test set, d_iFor the ith face image, E (-) is a deep convolutional neural network model for extracting depth features, x_i＝E(d_i) For the feature corresponding to the ith human face image, the depth feature x is used_iTaking dot product with the jth column of the last full connection layer W to obtain the fraction z of the jth category_i,jAnd inputting the classification probability P into a Softmax activation function to generate a classification probability_i,jThe calculation formula is as follows:

wherein C is the total number of categories and k is a subscript of different categories;

step S15: let y_iIs the label corresponding to the ith data,

as a depth feature x_iAnd corresponding class weight vector

Angle therebetween, adopt

About

The point with the maximum rate of change in the function curve of (2) is taken as a reference point, and the point is compared with

Correlation, i.e. when the dynamic interval parameter of the ith sample is set

After that time, the user can use the device,

about

Curve of function of (a) at theta_mThe absolute value of the derivative reaches a maximum, where θ_mDynamic interval parameter for reference point to maximize derivative of function curve

The calculation formula of (a) is as follows:

where v is the corresponding scaling parameter used to prevent the classification probability from falling outside the desired range,

the total score of all other categories except the category of the user is obtained;

step S16: obtaining a classification probability P_i,jAnd dynamic interval parameter

Then, the predicted classification probability P is calculated by using a cross entropy loss function_iAnd true probability Q_iDifference between them and obtaining the loss value L_CE(x_i) The calculation formula is as follows:

and then updating the network parameters by using a gradient descent and back propagation algorithm.

Further, the step S2 specifically includes the following steps:

step S21: taking the face recognition model trained in the step S1 as a feature extraction model, and extracting the depth feature x of each face image from the same training data set_iOutputting the corresponding last characteristic diagram as the input of the uncertainty module;

step S22: the uncertainty module is a shallow neural network model and comprises two full connection layers, Relu is used as an activation function, a batch normalization layer is inserted between the full connection layers and the activation function for input normalization operation, and finally an index function is used as the activation function to output uncertainty sigma corresponding to each face image_iWhich is related to the depth feature x_iHaving the same dimension, representing the variance of the corresponding feature in the feature space;

step S23: calculating a mutual likelihood fraction s (x) between any two samples using a function_i,x_j)：

Wherein

And

respectively representing the values of the characteristic mean value mu and the characteristic variance sigma in the ith dimension, wherein h is the dimension of the human face characteristic;

step S24: calculating the final loss L by adopting the following function according to the distribution condition of the face images in one batch_pair：

Where R is the set of face pairs of all the same person and s (-) is a computation function of mutual likelihood scores used to compute the mutual likelihood scores between two face pairs, the goal of the loss function being to maximize the mutual likelihood score values between all the face pairs of the same person.

Further, the specific method of step S3 is as follows:

deep face feature x output by feature extraction network_iReflects the most likely feature representation of the input face image, while the output σ of the uncertainty module_iThen the uncertainty, σ, of the feature in each dimension is represented_iVaries with the image quality, σ_iReflecting the importance of the corresponding depth feature in the entire set of input video images, as a weight for depth feature x_iPerforming weighted fusion, the fused feature a_iThe calculation is as follows:

wherein M is the number of samples in a batch;

and fusing the uncertainties corresponding to the features by adopting a minimum uncertainty method, namely, taking the minimum value of each dimension as a final vector for all uncertainty vectors in the set.

Further, in the step S4, for the input feature x_iAnd corresponding uncertainty σ_iBy using each otherComparing the likelihood scores, specifically comprising the following steps:

step S41: performing ten-fold cross validation on the trained model on a validation set to obtain final average accuracy, traversing possible thresholds on each fold, and taking the threshold with the highest final accuracy as a comparison threshold t;

step S42: let G be { G ═ G₁,g₂,...,g_MThe feature x of a tested face image is taken as the face image in the database_iAnd the facial image characteristics x of each person in G_jComparing, and adopting a nearest neighbor method and a threshold value method as a judgment basis; for the face images in the database G and the test set D, extracting corresponding depth features x by using a trained feature extraction model and an uncertainty module_iAnd the corresponding uncertainty σ_iCalculating a mutual likelihood score, and if the score is greater than a comparison threshold t, the person is considered to be the same person, otherwise, the person is considered to be different; and traversing each image in the database to obtain a final recognition result.

Compared with the prior art, the invention has the following beneficial effects:

1. the face recognition method and the face recognition device can effectively recognize the face in the video, improve the accuracy of face recognition and reduce the influence of image quality on face recognition.

2. The constraint can be gradually enhanced in the model training process, and the generalization of the features is improved.

3. Aiming at the problem that the interval parameter is difficult to select in the traditional interval-based loss function, the loss function based on the dynamic interval is provided. The loss function does not need to adjust parameters of the interval, and can adaptively adjust the size of the interval according to different data sets and different network structures, so as to control the gradient size of each sample in a fine-grained manner. In addition, the constraint strength can be gradually increased in the training process along with the convergence of the model, so that the model can continuously receive effective gradients and update parameters, and the judgment of final characteristics is improved.

4. The method utilizes the uncertainty of the pre-training network learning characteristics, fuses set characteristics by the uncertainty, finally compares the fused characteristics by adopting mutual likelihood scores, and can effectively improve the face recognition effect in the non-limited scene.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present invention provides a video face recognition method based on a dynamic interval loss function and probability features, which includes the following steps:

step S1: the recognition network is trained through a face recognition training set. The method specifically comprises the following steps:

step S11: and acquiring a public face recognition training set from the network, and acquiring related labels of training data.

Step S12: outputting positions of a face bounding box and key points of a face by adopting a pre-trained Retina face detection model for face images in a face recognition training set, aligning the face by applying similarity transformation, subtracting a mean value of 127.5 from pixel values of all input face images, and dividing by 128 for normalization.

Step S13, adopting 18 layers of ResNet as a network model for extracting the human face depth features, replacing the first convolution kernel of 7 × 7 with 3 convolution kernels of 3 × 3, changing the step size of the first convolution layer from 2 to 1 to keep the output size of the last feature map to be 7 × 7, changing the path of the identity mapping to be an average pooling with the step size of 2 and then convolving with 1 × 1 with the step size of 1 to prevent information loss, and finally adopting the convolution layer with the size of 7 × 7 to replace the average pooling layer to output the final human face feature x_i。

where C is the total number of categories and k is the subscript of the different categories.

Step S15: let y_iIs the label corresponding to the ith data,

as a depth feature x_iAnd corresponding class weight vector

Angle therebetween, adopt

About

Curve of function (d)The point with the maximum change rate is taken as a reference point, and the point is compared with the reference point

Correlation, i.e. when the dynamic interval parameter of the ith sample is set

After that time, the user can use the device,

about

Curve of function of (a) at theta_mThe absolute value of the derivative reaches a maximum, where θ_mP (θ) as a reference point for maximizing the derivative of the functional curve_m) Close to 0.5, in the early stages of training,

relatively large, in order to be able to provide suitable constraints on the optimization of the network, we limit the reference point θ_mIs less than pi/4, dynamic interval parameter

The calculation formula of (a) is as follows:

the sum of all other category scores, except for the own category, can generally be taken as the total number of categories minus one.

Thereafter, using crossoversCalculating and predicting classification probability P by using entropy loss function_iAnd true probability Q_iDifference between them and obtaining the loss value L_CE(x_i) The calculation formula is as follows:

Step S2: and training the uncertainty module by using the trained recognition network as a feature extraction module and through the same training set. The method specifically comprises the following steps:

step S21: taking the face recognition model trained in the step S1 as a feature extraction model, and extracting the depth feature x of each face image from the same training data set_iAnd outputting the corresponding last feature map as the input of the uncertainty module.

Step S22: the uncertainty module is a shallow neural network model and comprises two full connection layers, Relu is used as an activation function, a batch normalization layer is inserted between the full connection layers and the activation function for input normalization operation, and finally an index function is used as the activation function to output uncertainty sigma corresponding to each face image_iWhich is related to the depth feature x_iHaving the same dimension, represent the variance of the corresponding feature in the feature space.

Wherein

And

respectively representing the feature mean value mu and the feature squareThe value in the ith dimension of the difference sigma, h being the dimension of the face feature; as can be seen from the formula, if the depth feature x_iAnd x_jWith a large uncertainty, the value of the mutual likelihood score will be low regardless of the distance between its features; the value of the mutual likelihood score will be high only if both inputs have little uncertainty and the corresponding means are very close.

Step S3: and aggregating the input video feature set by using the learned uncertainty as the importance degree of the features to obtain the aggregated features.

Deep face feature x output by feature extraction network_iReflects the most likely feature representation of the input face image, while the output σ of the uncertainty module_iThen the uncertainty, σ, of the feature in each dimension is represented_iVaries with the image quality, σ_iReflects the importance of the corresponding depth feature in the entire set of input video images and is therefore used as a weight for depth feature x_iPerforming weighted fusion, the fused feature a_iThe calculation is as follows:

wherein M is the number of samples in a batch;

in order to compare the aggregated features in the testing stage, the uncertainty corresponding to the features is fused by adopting a minimum uncertainty method, namely, the minimum value of each dimension is taken as a final vector for all uncertainty vectors in the set.

Step S4: and comparing the aggregated features by adopting the mutual likelihood fraction instead of the cosine similarity to finish final identification.

In the testing phase, for the input feature x_iAnd corresponding uncertainty σ_iThe mutual likelihood fraction is adopted to replace the cosine similarity for comparison, and the mutual likelihood fraction considers the influence of the quality of the input image on the characteristics at the same time, so that the influence of the poor image quality on the final recognition result can be more effectively inhibited; the method specifically comprises the following steps:

step S41: compared with cosine similarity, the value range of the mutual likelihood fraction is wider, so that the selection of the comparison threshold value is more difficult. In order to effectively select the comparison threshold, the trained model is subjected to cross validation by ten folds on a validation set to obtain the final average accuracy, the possible thresholds are traversed on each fold, and the threshold which enables the final accuracy to be the highest is taken as the comparison threshold t.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A video face recognition method based on a dynamic interval loss function and probability features is characterized by comprising the following steps:

2. The video face recognition method based on the dynamic interval loss function and the probability feature of claim 1, wherein the step S1 specifically includes the following steps:

Step S14: let D ═ D₁,d₂,...,d_NIs a test setFace image of (1), d_iFor the ith face image, E (-) is a deep convolutional neural network model for extracting depth features, x_i＝E(d_i) For the feature corresponding to the ith human face image, the depth feature x is used_iTaking dot product with the jth column of the last full connection layer W to obtain the fraction z of the jth category_i,jAnd inputting the classification probability P into a Softmax activation function to generate a classification probability_i,jThe calculation formula is as follows:

step S15: let y_iIs the label corresponding to the ith data,

as a depth feature x_iAnd corresponding class weight vector

Angle therebetween, adopt

About

Correlation, i.e. when the dynamic interval parameter of the ith sample is set

After that time, the user can use the device,

about

The calculation formula of (a) is as follows:

3. The video face recognition method based on the dynamic interval loss function and the probability feature of claim 2, wherein the step S2 specifically includes the following steps:

step S21: taking the face recognition model trained in the step S1 as a feature extraction model, and performing feature extraction on the same face recognition modelExtracting depth characteristic x of each face image from training data set_iOutputting the corresponding last characteristic diagram as the input of the uncertainty module;

Wherein

And

4. The video face recognition method based on the dynamic interval loss function and the probability feature of claim 3, wherein the specific method of the step S3 is as follows:

wherein M is the number of samples in a batch;

5. The method for video face recognition based on dynamic interval loss function and probability feature of claim 4, wherein in step S4, for the input feature x_iAnd corresponding uncertainty σ_iComparing by using the mutual likelihood scores, specifically comprising the following steps:

step S42: let G be { G ═ G₁,g₂,...,g_MThe feature x of a tested face image is taken as the face image in the database_iAnd the facial image characteristics x of each person in G_jMaking a comparison, anAdopting a nearest neighbor method and a threshold value method as a judgment basis; for the face images in the database G and the test set D, extracting corresponding depth features x by using a trained feature extraction model and an uncertainty module_iAnd the corresponding uncertainty σ_iCalculating a mutual likelihood score, and if the score is greater than a comparison threshold t, the person is considered to be the same person, otherwise, the person is considered to be different; and traversing each image in the database to obtain a final recognition result.