CN110674730A

CN110674730A - Monocular-based face silence living body detection method

Info

Publication number: CN110674730A
Application number: CN201910893676.0A
Authority: CN
Inventors: 谢巍; 周延
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2020-01-10

Abstract

The invention relates to a monocular-based face silence living body detection method, which comprises the following steps: s1, obtaining a training data set in a data enhancement mode; s2, training the image by using the improved convolutional neural network, and storing the convolutional neural network model obtained after training; and S3, capturing a single-frame face by using a camera, calling the trained model to perform living body detection, and realizing real-time high-accuracy face living body identification. The method trains true and false face data samples by building a multilayer convolutional neural network to obtain a classification model, and achieves the aim of identifying the living body of the image.

Description

Monocular-based face silence living body detection method

Technical Field

The invention relates to the field of image processing technology, computer vision and pattern recognition, in particular to a monocular-based face silence living body detection method.

Background

With the increasing maturity of image processing technology, computer vision algorithm, etc., the face recognition technology is developed vigorously, and the face anti-counterfeiting technology is an important research subject. The living body detection is a method for determining the real physiological characteristics of an object in some identity verification scenes, and in the application of face recognition, the living body detection can verify whether a user operates for the real living body by combining actions of blinking, mouth opening, shaking, nodding and the like and using technologies such as face key point positioning, face tracking and the like. Common attack means such as photos, face changing, masks, sheltering and screen copying can be effectively resisted, so that a user is helped to discriminate fraudulent behaviors, and the benefit of the user is guaranteed. Currently existing in vivo detection methods are:

silent live body detection: compared with the dynamic living body detection method, the silent living body detection method is that a user does not need to do any action and naturally faces the camera for 3 and 4 seconds. Since the real face is not absolutely still, there are micro-expressions such as the rhythm of the eyelid and eyeball, blinking, stretching of the lips and peripheral cheeks, etc., which can be used to counter-cheat.

Infrared living body detection: the infrared camera with the additional device is electromagnetic wave in nature whether visible light or infrared light. What the length of the image is finally seen is related to the reflective properties of the material surface. The reflection characteristics of real human faces and attack media such as paper sheets, screens, stereo masks and the like are different, so that the imaging is also different, and the difference is more obvious in the aspect of infrared wave reflection.

An optical flow method: the motion of each pixel position is determined by utilizing the time domain variation and the correlation of pixel intensity data in the image sequence, the operation information of each pixel point is obtained from the image sequence, and a Gaussian difference filter, LBP characteristics and a support vector machine are adopted for data statistical analysis. Meanwhile, the optical flow field is sensitive to the movement of an object, and eyeball movement and blink can be uniformly detected by using the optical flow field. This way of live detection can be done blindly without user cooperation.

3D camera: shooting a human face to obtain 3D data of a corresponding human face area, and performing further analysis based on the data to finally judge whether the human face is from a living body or a non-living body. The sources of non-living objects are wide, including photos and videos of media such as mobile phones and pads, photos of various printed different materials (including various situations such as bending, folding, cutting, digging, and the like), and the like. The key point is how to select the most distinguishing features to train a classifier based on the 3D face data of the living body and the non-living body, and the trained classifier is used for distinguishing the living body and the non-living body.

Disclosure of Invention

In order to solve the above problems, the present invention provides a monocular-based face silence live detection method. The convolutional neural network is a feed-forward neural network which comprises convolutional calculation and has a deep structure, can perform translation invariant classification on input information, and is widely applied to the fields of image recognition, natural language processing, audio processing and the like. The algorithm provided by the invention comprises three steps: the method comprises the steps of firstly obtaining a training set with rich data in a data enhancement mode, then training images by using an improved deep neural network consisting of a plurality of convolution layers, pooling layers and BN layers, storing a model, finally capturing a single-frame face by using a camera, and calling the trained model to perform living body detection so as to realize real-time high-accuracy face living body recognition.

The invention is realized by at least one of the following technical schemes.

A monocular-based face silence living body detection method comprises the following steps:

s1, obtaining a training data set in a data enhancement mode, and carrying out enhancement processing on the data;

s2, training the image by using the improved convolutional neural network, and storing the convolutional neural network model obtained after training;

and S3, capturing a single-frame face by using a camera, and performing living body detection by using a convolutional neural network model to realize real-time high-accuracy face living body identification.

Further, the training set of step S1 is obtained by:

according to the video in the CASIA-FASD data set, cutting out human faces from images by using a Haar classifier, wherein the images form a part of a training data set; and shooting sample pictures of true and false faces in different scenes as the other part of the training data set, and carrying out data enhancement processing of random adjustment and random rotation on the brightness and the saturation of the image of the training data set. The real face is the real face, and the false face is the face on a photo or the face picture on the screen of various devices.

Further, the improved convolutional neural network is an improved VGG11 network, the improved VGG11 network comprises 11 convolutional layers and three full-connected layers, each convolutional layer is followed by a ReLU layer (namely convolutional layer + ReLU layer), every two convolutional layers + ReLU layers are followed by a maximum pooling layer and a random inactivation layer (dropout), the last three random inactivation layers are respectively followed by a full-connected layer, each full-connected layer is followed by a ReLU layer, and the last ReLU layer is connected with a softmax layer; in the output of the first two convolutional layers, each convolutional layer is connected to a BN layer, which is connected to a largest pooling layer, which is in turn connected to a random deactivation layer.

Further, the training of the improved VGG11 network is specifically as follows:

1) and (3) carrying out Batch Normalization (Batch Normalization) on the outputs of the first two layers of convolution layers so as to ensure that the intermediate output value of the convolutional neural network is stable and prevent the gradient from disappearing, wherein the Batch Normalization principle formula is as follows:

wherein x^(k)Is the k-th dimension vector of the input, E x^(k)]Is x^(k)Mean value of (1), Var [ x ]^(k)]Is x^(k)The variance of (a);

2) dropout is used for each convolution layer output, so that the activation value of a certain neuron stops working with probability p during forward propagation, and overfitting is prevented;

3) the learning rate is an attenuated learning rate, and the updating speed of the parameters is controlled by using the learning rate when the improved convolutional neural network is trained.

Further, softmax layer is represented as:

wherein is v'_jThe output of the last layer of the improved convolutional neural network, j represents the category index, represents the ratio of the index of the current element to the sum of the indexes of all elements, y_jTwo neurons are included that correspond to the probability distribution of a binary classified image of a real face and a false face.

Further, the formula of the network structure of VGG11 using dropout is as follows:

r_j ^(l)～Bernoulli(p) (3)

y_i ^(l+1)＝f(z_i ^(l+1)) (6)

wherein z is_i ^(l+1)Is the output of the l +1 layer in the improved convolutional neural network, y_i ^(l+1)Is the final output of the improved convolutional neural network,is the output value of the l-th layer neuron after dropout operation, and the Bernoulli function is to randomly generate a vector r of 0 or 1_j ^(l)，y^(l)Is the output of the l-th layer of the improved convolutional neural network,

for the first layer output after dropout processing, w_i ^(l+1)Is the weight of the l +1 layer, b_i ^(l+1)Is the bias of the l +1 layer, is the representation of the dropout processing operation, r^(l)A vector of 0 or 1, i denotes the ith dimension, and p is the activation probability of the neuron.

Further, the BN (batch normalization) layer normalizes the data into a standard gaussian distribution with a mean value of 0 and a variance of 1, as follows:

consider a vector B of size m ═ x₁...，x_i，x_m}，x_iThe output y after the BN layer is the element in the vector and the two parameters gamma and beta to be learned and used for maintaining the expression capability of the improved convolutional neural network_i＝BN_γ,β(x_i)

Wherein, mu_BIs the minimum batch mean, σ_B ²Is the minimum batch variance and is the minimum batch variance,

is normalized x_iε is a constant and is set to 1, y_iIs the output of BN layer, BN_γ,βIs a BN layer normalization function with gamma and beta as parameters.

Compared with the prior art, the invention has the beneficial effects that: the prior technologies are limited to anti-spoofing detection of faces on a screen or anti-spoofing detection of printed faces, but the invention can realize live body detection of faces of videos and printed photos, and can realize more accurate live body detection without the cooperation of a user.

Drawings

Fig. 1 is a flowchart of a monocular-based face silence live detection method according to the present embodiment;

FIG. 2 is a block diagram of a convolutional neural network of the present embodiment;

FIG. 3 is a block diagram of a convolutional neural network without dropout in the present embodiment;

FIG. 4 is a block diagram of a convolutional neural network using dropout in the present embodiment;

fig. 5 is a network structure diagram of the convolutional neural network test of the embodiment.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

As shown in fig. 1, a monocular-based face silence live-body detection method includes the following steps:

s1, obtaining a training data set in a data enhancement mode, wherein the specific obtaining process is as follows:

according to videos in a face anti-spoofing DATABASE (CASIA DATABASE) of the Chinese academy of sciences, a cascade classifier is used for cutting out faces from images, the images form a part of a training data set, and the embodiment also adopts images in the face anti-spoofing DATABASE of Nanjing aerospace university; the method comprises the steps of shooting sample pictures (actualscreenario) of true and false faces in different scenes as training samples, and carrying out random adjustment on image brightness and saturation and random rotation data enhancement processing on a training data set. The CAASA-FASD data set consists of videos, each consisting of 100 to 200 video frames. 30 frames (the same interval between each frame) are captured for each video.

The human face image in the Nanjing aerospace university face anti-spoofing DATABASE (NUAA-DATABASE) can also be used as a training data set, and the image in the NUAA DATABASE is shot by different people under different illumination conditions. Carrying out random brightness adjustment, random saturation adjustment, random contrast adjustment and random overturning on the face image to increase the generalization capability of the model;

as shown in fig. 2, the convolutional neural network is a modified VGG11 network structure, and a modified VGG11 network is used to classify real and false faces. On the basis of the original VGG11 network, the improved VGG11 network structure comprises 11 convolutional layers and three full connected layers (full connected), wherein a ReLU layer (namely convolutional layer + ReLU layer) is added behind each convolutional layer (Conv), a Max pooling layer (Max pooling) and a random deactivation layer (dropout) are added behind every two convolutional layers + ReLU layers, a full connected layer is respectively added behind the last three random deactivation layers, a linear rectification function (ReLU) layer is added behind each full connected layer, and the last ReLU layer is connected with a softmax layer; in the output of the first two convolutional layers, each convolutional layer is connected to a BN layer (batch normalization layer) which is connected to a maximum pooling layer, which in turn is connected to a random deactivation layer.

The size of a pooling kernel in the convolutional neural network is 2 multiplied by 2, the step length is 2, and the pooling kernel comprises an input layer, 8 convolutional layers, two full-link layers and a normalized exponential function (Softmax) layer; the first convolutional layer and the second convolutional layer respectively comprise 64 convolution kernels and 128 convolution kernels; the sizes of convolution kernels are 7 x 7 and 5 x 5, the first convolution layer and the second convolution layer are respectively followed by a maximum pooling layer, and the size of the maximum pooling layer is 2 x 2; weight sharing is carried out between the third convolutional layer and the fourth convolutional layer, the convolutional layers respectively comprise 256 convolutional kernels, and the size of each convolutional kernel is 3 multiplied by 3; weight sharing is carried out between the fifth convolutional layer and the sixth convolutional layer, the convolutional layers respectively comprise 512 convolutional kernels, and the size of each convolutional kernel is 3 multiplied by 3; the seventh convolutional layer and the eighth convolutional layer share the weight, the convolutional layers respectively comprise 512 convolutional cores, the size of each convolutional core is 3 multiplied by 3, and the fully-connected layer is completely connected with the eighth convolutional layer; the image of the input layer is 200 × 200 × 3 pixels, and includes three channels of RGB, and after the image is preprocessed, convolutional neural network processing can be performed.

The convolutional neural network structure adopted by the invention is shown in table 3, and comprises eight convolutional layers, three full-link layers and one softmax layer, wherein the middle activation function adopts a ReLU activation function, and the pooling layer adopts a maximum pooling function. The network with the structure is used for training the face image to obtain a true and false face discrimination model, so that the monocular silence detection of the living body is realized.

TABLE 3 network architecture

Wherein Conv denotes a convolutional layer, Pool denotes a pooling layer, and Fully connected denotes a Fully connected layer.

The last layer is the Softmax layer, which is denoted as:

wherein is v'_jOutput of a layer preceding the last layer of the network, j denotes a category index, y_jDenotes the ratio of the index of the current element to the sum of the indices of all elements, y_jTwo neurons are included that correspond to the probability distribution of a binary classified image of a real face and a false face.

The training of the improved VGG11 network is specifically as follows:

1) the output of the first two layers of convolution layers is subjected to Batch Normalization (Batch Normalization), input data is subjected to Normalization processing, and therefore the stability of the intermediate output value of the convolution neural network is guaranteed, the disappearance of gradients is prevented, and the Batch Normalization principle formula is as follows:

wherein x^(k)Is the k-th dimension vector of the input, E x^(k)]Is x^(k)Mean value of (1), Var [ x ]^(k)]Is x^(k)The variance of (c).

2) Dropout is used for each convolution layer output, namely the activation value of a certain neuron stops working with probability p in the forward propagation process, so that overfitting is prevented;

3) the learning rate adopts an attenuation learning rate, and the updating speed of the learning rate control parameters is used when the neural network is trained; when the learning rate is low, the updating speed of the parameters can be greatly reduced; when the learning rate is high, oscillation occurs in the searching process, so that the parameters linger near the extreme value, and therefore the problem can be solved by adopting the attenuation learning rate.

The random deactivation (Dropout) method refers to randomly selecting a part of nodes of the network for forgetting. This is because either model cannot completely distinguish the data by 100%. When abnormal data appears in a certain class, the network learns the abnormal data as a rule, so that an overfitting problem also occurs. Because the probability of the abnormal data is much lower than that of the mainstream data, the data of some nodes is actively ignored in each model optimization process, and the learning probability of the abnormal data is further reduced, so that the generalization capability of the network is enhanced.

FIG. 3 is a workflow of dropout, and a calculation formula without dropout is as follows:

z_i ^(l+1)＝w_i ^(l+1)y^l+b_i ^(l+1)(3)

y_i ^(l+1)＝f(z_i ^(l+1)) (4)；

as shown in fig. 4, the calculation formula of the VGG11 network structure using dropout is as follows:

r_j ^(l)～Bernoulli(p) (5)

y_i ^(l+1)＝f(z_i ^(l+1)) (8)

wherein z is_i ^(l+1)Is the output of a layer in the improved convolutional neural network, y_i ^(l+1)Is the final output of the improved convolutional neural network,

is a certain layer of neuron output value after dropout operation, the Bernoulli function is to randomly generate a vector r of 0 or 1_j ^(l)。y^(l)Is the output of the l layer of the improved convolutional neural network,

for the first layer output after dropout processing, w_i ^(l+1)Is the weight of the l +1 layer, b_i ^(l+1)For a bias of l +1 layers, p is the probability of activation of the neuron.

It is worth noting that dropout is used only during training, and is not required to be added during testing. Therefore, when the keep _ prob is set to 1 during the test, i.e. the activation rate of the neuron is one hundred percent, which means that the neuron does not need to be discarded, the structure diagram of the network under test is shown in fig. 5, which lacks a dropout (random inactivation) layer compared with the network structure of fig. 2

The BN (batch normalization) layer is arranged to furthest ensure that each forward propagation output prevents gradient dispersion on the same distribution, data passing through the BN layer are normalized into standard Gaussian distribution with the mean value of 0 and the variance of 1, and the batch normalization principle is as follows:

consider a vector B of size m ═ x₁...，x_i，x_m}，x_iThe output y after the BN layer is the element in the vector and the two parameters gamma and beta to be learned for keeping the expression capability of the model_i＝BN_γ,β(x_i)

is normalized x_iε is a constant and is set to 1.

A higher learning rate can be used after the BN layer is added, with a lower dropout removed or used, resulting in an increased training speed.

The method is characterized in that the BN layer is added to bring certain effects to the improvement of the training speed and the generalization ability of the model, but the positions where the BN layer is placed have performance differences, under the same hardware condition, experiments that the BN layer (a) is not placed on a CASIA data set, the BN layer (b) is placed on the output of each convolutional layer, and the BN layer (c) is placed on the output of only part of convolutional layers are respectively carried out, and the equal error rate and the training time consumption of the network are counted, as shown in table 1.

TABLE 1 representation on different placement positions CASIA-FASD dataset of BN layer

As can be seen from table 1, the BN layer addition position has an influence on the performance of the convolutional neural network model, and if a BN layer is added to the output of each convolutional layer, the model performance is seriously reduced, the equal error rate is increased compared with that of the BN-free layer model, and the training speed is reduced compared with that of the BN-free layer model. And the error rate and the training speed are reduced by adding the model of the BN layer only on part of the convolution layer (the first two layers).

Dropout improves the generalization capability of the model, and the following table shows the behavior of the three models, i.e., a, b and c, after Dropout is added between the output of the pooling layer and the full connection layer.

TABLE 2 Performance of the three classes of models after Dropout processing

As can be seen from table 2, the equal error rate in the model performance is further reduced compared to table 2, but the training time of the convolutional neural network model is increased, and Dropout increases the training time of the convolutional neural network model, so that the retention rate can be set to be higher, i.e., 95%, in cooperation with batch normalization, and the increase of the model training time is alleviated.

Labeling the real face sample, the fake face sample and the sample after data enhancement, and then training; the loss function adopts a cross entropy function (cross entropy); the learning rate adopts an attenuation learning rate, namely every 800 steps are attenuated to ninety percent of the original rate, so that the training speed is high when the step length is long at the beginning of model training, and the global optimum point is not easy to miss when the step length is short at the later stage; the living body detection method is successfully integrated into a face recognition system.

And S3, capturing a single-frame face by using a camera, calling the trained model to perform living body detection, and realizing real-time high-accuracy face living body identification.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A monocular-based face silence living body detection method is characterized by comprising the following steps:

and S3, capturing a single-frame face by using a camera, and performing living body detection by using a convolutional neural network model to realize real-time face living body identification.

2. The monocular based face silence live detection method of claim 1, wherein the training set of step S1 is obtained by:

according to the video in the CASIA-FASD data set, cutting out human faces from images by using a Haar classifier, wherein the images form a part of a training data set; and shooting sample pictures of true and false faces in different scenes as the other part of the training data set, and carrying out random adjustment on image brightness, contrast and saturation and random rotation data enhancement processing on the training data set.

3. The monocular based face silence live detection method of claim 1, wherein the modified convolutional neural network is a modified VGG11 network, the modified VGG11 network comprises 11 convolutional layers and three fully-connected layers, each convolutional layer is followed by a ReLU layer, namely convolutional layer + ReLU layer, each two convolutional layer + ReLU layers are followed by a max pooling layer and a random inactivation layer, namely dropout, the last three random inactivation layers are followed by a fully-connected layer, each fully-connected layer is followed by a ReLU layer, and the last ReLU softmax layer; in the output of the first two convolutional layers, each convolutional layer is connected to a BN (Batch Normalization) layer, which is connected to a largest pooling layer, which is in turn connected to a random deactivation layer.

4. The monocular based face silence live detection method of claim 3, wherein the training of the improved VGG11 network is specifically as follows:

5. The monocular based face silence live detection method of claim 3, wherein the softmax layer is expressed as:

whereinThe output of the previous layer of the last layer of the convolutional neural network is further input, j represents a category index and represents the ratio of the index of the current element to the sum of the indexes of all elements, y_jTwo neurons are included that correspond to the probability distribution of a binary classified image of a real face and a false face.

6. The monocular-based face silence living body detection method of claim 3, wherein the formula of the VGG11 network structure adopting dropout is as follows:

r_j ^(l)～Bernoulli(p) (3)

y_i ^(l+1)＝f(z_i ^(l+1)) (6)

for the first layer output after dropout processing, w_i ^(l+1)Is the weight of the l +1 layer, b_i ^(l+1)Is the bias of the l +1 layer; -representing a dropout processing operation, r^(l)A vector of 0 or 1, i denotes the ith dimension, and p is the activation probability of the neuron.

7. The monocular based face silence live detection method of claim 3, wherein the BN layer normalizes the data into a standard Gaussian distribution with a mean of 0 and a variance of 1, as follows:

Wherein, mu_BIs the minimum batch mean, σ_B ²Is the minimum batch variance and is the minimum batch variance,is normalized x_iε is a constant and is set to 1, y_iIs the output of BN layer, BN_γ,βIs a BN layer normalization function with gamma and beta as parameters.