CN112200057B

CN112200057B - Face living body detection method and device, electronic equipment and storage medium

Info

Publication number: CN112200057B
Application number: CN202011063444.1A
Authority: CN
Inventors: 冯思博; 陈莹; 黄磊; 彭菲
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2023-10-31
Anticipated expiration: 2040-09-30
Also published as: CN112200057A

Abstract

The application discloses a face living body detection method, belongs to the technical field of face detection, and is beneficial to improving the speed and accuracy of face living body detection. The method comprises the following steps: acquiring a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face; respectively carrying out face positioning on the first face image and the second face image, cutting a first face image to be detected from the first face image according to a face positioning result, and cutting a second face image to be detected from the second face image; inputting the first face image to be detected and the second face image to be detected which are obtained through cutting into a pre-trained living body detection model in parallel, and carrying out classification mapping on the target face according to plane characteristics and depth characteristics in the two input face images through the living body detection model; and determining whether the target face is a living face according to the classification mapping result.

Description

Face living body detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of face detection technologies, and in particular, to a method and apparatus for detecting a living body of a face, an electronic device, and a computer readable storage medium.

Background

In order to improve the safety of the face recognition technology in practical application, the importance of living body detection of face images to be recognized to resist attack of photos or videos on the face recognition application is increasingly highlighted. In the prior art, in order to improve the accuracy of face recognition, face recognition technology based on binocular cameras is increasingly widely used. Face living body detection technology based on binocular cameras is also continuously improved. Currently, common face detection techniques include binocular visible light-based face in-vivo detection techniques. The method comprises the following steps: two images of a target face acquired based on a binocular visible light camera are respectively subjected to face key point detection, then three-dimensional sparse point cloud is constructed according to face key point data, then the three-dimensional sparse point cloud is interpolated to generate dense point cloud, and classification is carried out based on the dense point cloud. The scheme has long calculation time, high complexity, large calculation error and limited use scene.

As can be seen, there remains a need for improvement in the methods of face biopsy in the prior art.

Disclosure of Invention

The application provides a human face living body detection method which is beneficial to improving the speed and accuracy of human face living body detection.

In order to solve the above problems, in a first aspect, an embodiment of the present application provides a method for detecting a living body of a face, including:

acquiring a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face;

respectively carrying out face positioning on the first face image and the second face image to obtain a corresponding face positioning result;

cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image respectively;

the first face image to be detected and the second face image to be detected which are obtained through clipping are input to a pre-trained living body detection model in parallel, and the target face is subjected to classified mapping according to plane characteristics and depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model; the living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample;

And determining whether the target face is a living face according to the classification mapping result.

In a second aspect, an embodiment of the present application provides a face living body detection apparatus, including:

the face image acquisition module is used for acquiring a first face image and a second face image which are synchronously acquired by the first image acquisition device and the second image acquisition device aiming at a target face;

the face positioning module is used for respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results;

the face image clipping module is used for clipping a first face image to be detected from the first face image and clipping a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image respectively;

the image classification module is used for inputting the first face image to be detected and the second face image to be detected which are obtained through cutting into a pre-trained living body detection model in parallel, and carrying out classification mapping on the target face according to the plane characteristics and the depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model; the living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample;

And the human face living body detection result determining module is used for determining whether the target human face is a living body human face according to the classification mapping result.

In a third aspect, the embodiment of the application also discloses an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the method for detecting the human face living body according to the embodiment of the application when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the face in-vivo detection method disclosed in the embodiment of the present application.

According to the face living body detection method disclosed by the embodiment of the application, the first face image and the second face image which are synchronously acquired by the first image acquisition device and the second image acquisition device aiming at the target face are acquired; respectively carrying out face positioning on the first face image and the second face image to obtain a corresponding face positioning result; then, cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image respectively; the first face image to be detected and the second face image to be detected which are obtained through clipping are input to a pre-trained living body detection model in parallel, and the target face is subjected to classified mapping according to plane characteristics and depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model; the living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample; and determining whether the target face is a living face according to the classification mapping result, so that the speed of face living detection is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a face living body detection method according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a multi-task model according to a first embodiment of the present application;

fig. 3 is a schematic view of a living body face detection model according to the first embodiment of the present application;

fig. 4 is a schematic structural diagram of a face living body detection apparatus according to a second embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1

The embodiment of the application discloses a face living body detection method, as shown in fig. 1, which comprises steps 110 to 150.

Step 110, acquiring a first face image and a second face image synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face.

In the embodiment of the application, the first image acquisition device and the second image acquisition device are two synchronous image acquisition devices arranged on the same electronic equipment, such as binocular synchronous face recognition equipment. The first image acquisition device and the second image acquisition device synchronously acquire images of a target object (such as a human face) according to the control of the electronic equipment. In some embodiments of the application, the relative positions of the first and second image capturing devices in the vertical and horizontal directions remain unchanged, and the horizontal direction is maintained at a distance (e.g., typically a distance greater than 60 mm). The imaging light sources of the first image acquisition device and the second image acquisition device can be the same or different. For example, the first image capturing device and the second image capturing device may be all visible light image capturing devices, or all infrared light image capturing devices, or one infrared light image capturing device and one visible light image capturing device, which is not limited in the present application.

In some embodiments of the present application, the first image capturing device and the second image capturing device need to be calibrated and calibrated in advance, so as to obtain calibration matrices of the first image capturing device and the second image capturing device.

For the specific implementation manners of calibrating and calibrating the different image acquisition devices, refer to the prior art, for example, the camera internal reference matrix and the camera external reference matrix obtained by adopting a Zhang Zhengyou checkerboard calibration method are not repeated in the embodiment of the present application.

Taking the first image acquisition device and the second image acquisition device as binocular cameras of electronic equipment as examples, the calibration matrix of the binocular cameras is calibrated and determined when the cameras leave the factory. In some embodiments of the present application, after two face images of a target face are simultaneously and respectively acquired by a binocular synchronization camera of an electronic device, for example, the two face images are respectively represented as a first face image a and a second face image B, the first face image a and the second face image B are further corrected by using a calibration matrix of the binocular synchronization camera, so as to respectively obtain a first face image a 'and a second face image B'.

And 120, respectively performing face positioning on the first face image and the second face image to obtain a corresponding face positioning result.

The face positioning result in some embodiments of the present application includes: and a face positioning frame. In specific implementation, face positioning methods in the prior art can be adopted to perform face positioning on the first face image a 'and the second face image B' respectively, so as to obtain a face positioning frame in the first face image a 'and a face positioning frame in the second face image B'. The application is not limited to the specific implementation mode of respectively carrying out face positioning on the first face image and the second face image and respectively determining the face positioning frames in the first face image and the second face image.

Step 130, cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image, respectively.

In some embodiments of the present application, the face positioning result includes: and a face positioning frame. After determining the face positioning frame in the first face image and the face positioning frame in the second face image, cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image, respectively, further comprising: cutting a first face image to be detected from the first face image according to a face positioning frame in the first face image; and cutting a second face image to be detected from the second face image according to the face positioning frame in the second face image.

In order to acquire richer information, it is necessary to determine an image in a larger area including the face positioning frame from the face positioning frame for face living body detection. In some embodiments of the present application, a first face image to be detected is cut from the first face image according to a face positioning frame in the first face image; and clipping a second face image to be detected from the second face image according to a face positioning frame in the second face image, including: expanding the face positioning frame in the first face image by a preset size, and cutting a first face image to be detected from the first face image according to the face positioning frame obtained after expanding; and expanding the face positioning frame in the second face image by a preset size, and cutting a second face image to be detected from the second face image according to the face positioning frame obtained after expanding. For example, a face positioning frame S of the first face image A _A Expanding the periphery by 2 times to obtain a face positioning frame S _A ' then, the face positioning frame S is extracted from the first face image a _A The image of the 'covered area' is cut out to serve as a first face image to be detected. Similarly, the face positioning frame S of the second face image B' is used for positioning the face of the user _B Expanding the periphery by 2 times to obtain a face positioning frame S _B ' then, the face positioning frame S is extracted from the second face image B _B The image of the 'covered area' is cut out and used as a second face image to be detected.

And 140, inputting the first face image to be detected and the second face image to be detected which are obtained through clipping into a pre-trained living body detection model in parallel, and carrying out classified mapping on the target face according to the plane characteristics and the depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model.

The living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample;

and then, inputting the first face image to be detected and the second face image to be detected which are obtained through clipping into a pre-trained living body detection model in parallel, and performing living body detection on the target face based on the two input images through the living body detection model. In practice, a living body detection model needs to be trained first.

In some embodiments of the present application, the method further includes, before the step of performing classification mapping on the target face according to the planar features and the depth features in the first face image to be detected and the second face image to be detected by using the biopsy model, inputting the first face image to be detected and the second face image to be detected obtained by clipping into a pre-trained biopsy model in parallel: and training a living body detection model.

In some embodiments of the present application, the living body detection model is obtained by clipping a preset multitasking model, as shown in fig. 2, where the multitasking model includes: a first task network consisting of a first convolutional network 210 and a first fully-connected network 220, the first task network being configured to learn facial key point features in images input to the first convolutional network; a second task network consisting of a second convolutional network 230 and a second fully-connected network 240, the second task network for learning facial key point features in images input to the second convolutional network; a third task network consisting of the first convolutional network 210, the second convolutional network 230, a residual network 250, and a depth regression network 260 for learning depth features in images input to the first convolutional network and the second convolutional network; and a fourth task network composed of the first convolutional network 210, the second convolutional network 230, the residual network 250, and the classification network 270, for learning living and non-living information in images input to the first convolutional network and the second convolutional network, wherein the first convolutional network and the second convolutional network are arranged in parallel, and the residual network is connected with outputs of the first convolutional network and the second convolutional network, respectively.

In the embodiment of the present application, as shown in fig. 3, the training living body detection model includes: the first convolutional network 210, the second convolutional network 230, the residual network 250, and the classification network 270. Accordingly, the training living body detection model comprises: training the multitasking model; network parameters of the living detection model consisting of the first convolutional network 210, the second convolutional network 230, the residual network 250, and the classification network 270 are obtained by training the multitasking model.

Wherein the first convolutional network 210 and the second convolutional network 230 are arranged in parallel, the residual network 250 is connected to the outputs of the first convolutional network 210 and the second convolutional network 230, and the classification network 270 is connected to the outputs of the residual network 250.

When the application is embodied, the multitasking network is trained first. The multi-task network comprises four learning tasks, namely two network tasks for learning key point features of faces in input images of two channels respectively, one network task for learning depth features in the input images and one network task for learning living and non-living features in the input images. Each network task is realized through different task networks, the four task networks share a first convolution network and a second convolution network, and the learning of depth features, living body features and non-living body features is based on the learning of key point features of the face.

Before training a multitasking model, a training sample set comprising several training samples is first acquired. Wherein the sample data of each training sample comprises: a first sample image and a second sample image. The sample label of each training sample comprises: the first sample image and the second sample image each correspond to: the face key point true value, the depth value true value and the living body category true value.

The first sample image and the second sample image are a pair of images determined according to the mode of determining the first face image to be detected and the second face image to be detected; the real values of the key points of the human faces corresponding to the first sample image and the second sample image are obtained by respectively carrying out human face detection on the first sample image and the second sample image through the human face detection technology in the prior art. In particular, the number of face key points obtained by different face detection techniques may be different. The depth value true value is a depth value of a face key point obtained after calculation according to the face key point coordinates in the first sample image and the face key point coordinates in the second sample image and the calibration matrixes of the first image acquisition device and the second image acquisition device.

In some embodiments of the application, the sample data for each training sample used to train the multitasking model comprises: a first sample image and a second sample image, the sample label of each of the training samples comprising: the first sample image and the second sample image each correspond to: the face key point true value, the depth value true value and the face living body category true value.

The multitasking model is trained by the following method: for each of the training samples in the sample set, performing the following code mapping operations respectively: inputting a first sample image included in the training sample into the first convolution network of the multitasking model, and simultaneously inputting a second sample image included in the training sample into the second convolution network of the multitasking model; performing operation processing on the first sample image through the first task network to obtain a face key point predicted value of the first sample image in the training sample; performing operation processing on the second sample image through the second task network to obtain a face key point predicted value of the second sample image in the training sample; performing operation processing on the first sample image and the second sample image through the third task network to obtain a depth value predicted value of the training sample; performing operation processing on the first sample image and the second sample image through the fourth task network to obtain a face living body category predicted value of the training sample; according to each predicted value (namely, the predicted value of the face key point of the first sample image, the predicted value of the face key point of the second sample image, the predicted value of the depth value and the predicted value of the face living body category) obtained by executing the coding mapping operation, determining predicted loss values of the first task network, the second task network, the third task network and the fourth task network; carrying out weighted summation on the predicted loss values of the first task network, the second task network, the third task network and the fourth task network, and determining a model predicted total loss value of the multi-task model; and optimizing network parameters of the multi-task model, and jumping to execute the coding mapping operation until the model predictive total loss value converges to meet a preset condition.

First, for each training sample, a first sample image included in sample data of the training sample is input to the first convolution network of the multitasking model, and at the same time, a second sample image included in sample data of the training sample is input to the second convolution network of the multitasking model, and then, a computing processing device starts executing codes corresponding to each network in the multitasking model, and learns face key point features, depth features, living and non-living face features of all training samples in the training sample set. In the task model structure shown in fig. 3, the first convolution network 210 and the second convolution network 230 are respectively used to learn the key point features of the face in the input image; the residual network 250 is used for learning depth features and living and non-living features of the input image simultaneously based on the learning of the key point features of the face.

In some embodiments of the present application, the computing processing is performed on the first sample image through the first task network to obtain a face key point predicted value of the first sample image in the training sample; and performing operation processing on the second sample image through the second task network to obtain a face key point predicted value of the second sample image in the training sample, wherein the face key point predicted value comprises: performing convolution processing on the first sample image in the training sample through the first convolution network to obtain a first vector; then, carrying out coding mapping on the first vector through the first fully-connected network to obtain a face key point predicted value corresponding to the first sample image; and performing convolution processing on the second sample image in the training sample through the second convolution network to obtain a second vector; and then, carrying out coding mapping on the second vector through the second full-connection network to obtain a face key point predicted value corresponding to the second sample image.

In the implementation of the application, the processing of the first sample image and the processing of the first sample image are performed synchronously by two network tasks. The following describes the code mapping process for the first sample image and the second sample image in the training samples, respectively.

As shown in fig. 2, the first convolutional network 210 includes a plurality of convolutional layers for applying a convolutional algorithm to an input image (e.g., denoted P _{L_i} ) Convolving to extract a feature vector, e.g. denoted as a first vector e _{L_i} The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the first fully-connected network 220 pairs the first vector e _{L_i} Vector flattening and mapping to obtain an input image (e.g., P _{L_i} ) The corresponding face key point predictors, for example, the predictors of 81 face key points are expressed as (x) _{L_i_0} ,x _{L_i_1} ,…,x _{L_i_80} ),(y _{L_i_0} ,y _{L_i_1} ,…,y _{L_i_80} )。

Similarly, the second convolutional network 230 includes a plurality of convolutional layers for applying a convolutional algorithm to the input image (e.g., denoted as P _{R_i} ) Convolving to extract a feature vector, e.g. denoted as second vector e _{R_i} The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the second fully connected network 240 pairs the second vector e _{R_i} Vector flattening and mapping to obtain an input image (e.g., P _{R_i} ) Corresponding toFace key point predictors, for example, predictors of 81 face key points are expressed as (x) _{R_i_0} ,x _{R_i_1} ,…,x _{R_i_80} ),(y _{R_i_0} ,y _{R_i_1} ,…,y _{R_i_80} )。

In some embodiments of the present application, the depth value true value is determined according to a face key point in a first sample image and a second sample image in sample data of the training sample, and the computing, through the third task network, the first sample image and the second sample image to obtain a depth prediction value of the training sample includes: convolving the first vector and the second vector through the residual error network to obtain a third vector; and carrying out coding mapping on the third vector through the depth regression network to obtain a depth value predicted value of the face key point corresponding to the training sample.

The specific implementation manner of determining the face key points in the first sample image and the second sample image refers to the prior art, and is not repeated in the embodiment of the present application. Furthermore, by adopting the method in the prior art, the real depth value of the training sample can be determined according to the face key points in the first sample image and the second sample image and the calibration matrixes of the first image acquisition device and the second image acquisition device for acquiring the first sample image and the second sample image.

As shown in fig. 2, the first vector e is mapped by the residual network 250 _{L_i} And the second vector e _{R_i} Performing convolution treatment to obtain a third vector; the third vector is coded and mapped through the depth regression network 260 to obtain the predicted depth values of the face key points corresponding to the training samples, for example, the predicted depth values of 81 face key points are expressed as (z _{i_0} ,z _{i_1} ,…,z _{i_80} )。

In some embodiments of the present application, the performing, by using the fourth task network, the operation processing on the first sample image and the second sample image to obtain the face living body class prediction value of the training sample includes: convolving the first vector and the second vector through the residual error network to obtain a third vector; and carrying out coding mapping on the third vector through the classification network to obtain a face living body category predicted value corresponding to the training sample.

As shown in fig. 2, the first vector e is mapped by the residual network 250 _{L_i} And the second vector e _{R_i} Performing convolution treatment to obtain a third vector; and performing coding mapping on the third vector through the classification network 270 to obtain a face living body category predicted value corresponding to the training sample. In some embodiments of the present application, after the third vector is coded and mapped by the classification network 270, a two-dimensional vector may be output, where each dimension of the two-dimensional vector is used to represent the probability that the training sample is a different face living body class.

In some embodiments of the present application, determining predicted loss values of the first task network, the second task network, the third task network, and the fourth task network according to each predicted value obtained by performing the code mapping operation includes: determining a predicted loss value of the first task network according to the difference value between the face key point predicted value and the face key point true value of the first sample image in all the training samples in the sample set; determining a predicted loss value of the second task network according to the difference value between the face key point predicted value and the face key point true value of the second sample image in all the training samples in the sample set; determining a predicted loss value of the third task network according to the difference value between the predicted value of the depth value and the true value of the depth value of all the training samples in the sample set; and determining a predicted loss value of the fourth task network according to the difference value between the human face living body type predicted value and the human face living body type true value of all the training samples in the sample set.

For example, according to the coding mapping result of the first sample image in all training samples in the sample set, the prediction errors of the first convolution network 210 and the first fully-connected network 220, that is, the prediction loss value of the first task network, are also the prediction loss value of the first face key point. In some embodiments of the present application, the first face keypoint predictive loss value of the first task network may be calculated by the following formula:

wherein L is _{landmark_left} Representing the predicted loss value, x, of the first mission network _{L_i} For the first sample image in the ith training sample, f (x _{L_i} ) Face key point predicted value of first sample image in ith training sample, y _{L_i} The human face key point true value of the first sample image in the ith training sample is represented, N represents the number of samples in the training sample set, lambda represents the weight of each layer of network, and w _j For network parameters, n represents the number of network layers with weights.

For another example, according to the coding mapping result of the second sample image in all training samples in the sample set, the prediction errors of the second convolutional network 230 and the second fully-connected network 240, that is, the prediction loss value of the second task network, are also the prediction loss value of the second face key point. In some embodiments of the present application, the predicted loss value for the second task network may be calculated by the following formula:

Wherein L is _{landmark_right} Representing a predicted loss value, x, of the second mission network _{L_i} For the second sample image in the ith training sample, f (x _{R_i} ) Face key point predicted value representing second sample image in ith training sample, y _{R_i} Representing the face key point true value of the second sample image in the ith training sample, N represents the number of samples in the training sample set, lambda represents the weight of each layer of network, and w _j N is a weighted representation of the number of network layers for the network parameters.

For another example, the prediction errors of the first convolution network 210, the second convolution network 230, the residual network 250, and the depth regression network 260, that is, the prediction loss value of the third task network, are also depth value prediction loss values, are calculated according to the coding mapping results of all training samples in the sample set. In some embodiments of the present application, the predicted loss value for the third mission network may be calculated by the following formula:

wherein L is _depth Representing the predicted loss value of the third mission network, f (x _i ) Representing the depth value predicted value, y of key points of human face in the ith training sample _i Representing the true value of the depth value of the key point of the human face in the ith training sample, N represents the number of samples in the training sample set, lambda represents the weight of each layer of network, and w _j For network parameters, n represents the number of network layers with weights.

For another example, the prediction errors of the first convolution network 210, the second convolution network 230, the residual network 250, and the classification network 270, that is, the prediction loss value of the fourth task network, are also the face living body category prediction loss values, according to the coding mapping results of all training samples in the sample set. In some embodiments of the present application, the predicted loss value for the fourth mission network may be calculated by the following formula:

wherein L is _{face_liveness} Representing the predicted loss value, f (x _i ) Representing the face living body category predicted value in the ith training sample, y _i Representing the real value of the human face living body category in the ith training sample, N represents the number of samples in the training sample set, lambda represents the weight of each layer of network, and w _j For network parameters, n represents the number of network layers with weights.

After determining the predicted loss value for each branch network, the predicted loss value for each branch network is further determined based on the predicted loss values of the branch networksModel predictive overall loss values for a loss-value calculation multitasking model. In some embodiments of the application, the model predictive total loss value L of the multitasking model may be determined as follows _total ：

L _total ＝λ ₁ L _{landmark_left} +λ ₂ L _{landmark_right} +λ ₃ L _depth +λ ₄ L _{face_liveness} The method comprises the steps of carrying out a first treatment on the surface of the Wherein lambda is ₁ 、λ ₂ 、λ ₃ And lambda (lambda) ₄ The value of (c) may be set according to practical experience.

In the training process, the model predictive total loss value L of the multi-task model can be adjusted by continuously optimizing network parameters of each network included in the multi-task model _total Until the model predicts the overall loss value L _total Meets the preset conditions (such as loss value L _total Convergence to less than a preset value), the training of the multitasking model is completed.

In the prediction stage, the first face image to be detected and the second face image to be detected which are obtained through clipping are input to a pre-trained living body detection model in parallel, and the target face is subjected to classified mapping according to plane characteristics and depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model, wherein the method comprises the following steps: carrying out convolution processing on the first face image to be detected through the first convolution network to obtain a fourth vector; carrying out convolution processing on the second face image to be detected through the second convolution network to obtain a fifth vector; convolving the fourth vector and the fifth vector through the residual error network to obtain a sixth vector; and performing coding mapping on the sixth vector through the classification network to obtain the face living body category corresponding to the target face.

Specifically, the first face image to be detected is subjected to convolution processing through the first convolution network to obtain a specific implementation manner of a fourth vector, and the first vector is obtained by carrying out convolution processing on a first sample image of a training sample through the first convolution network in a training stage, which is not described herein. The second face image to be detected is subjected to convolution processing through the second convolution network to obtain a specific implementation manner of a fifth vector, and the second sample image of the training sample is subjected to convolution processing through the second convolution network in the training stage to obtain the second vector, which is not described herein. The fourth vector and the fifth vector are convolved through the residual network to obtain a specific implementation manner of a sixth vector, and the first vector and the second vector are convolved through the residual network in the training stage to obtain a specific implementation manner of a third vector, which is not described herein. The sixth vector is coded and mapped through the classification network, so that the specific implementation mode of the face living body category corresponding to the target face is obtained, and the specific implementation mode of the face living body category predicted value corresponding to the training sample is obtained through the coding and mapping of the third vector through the classification network in the training stage, and is not repeated here.

And step 150, determining whether the target face is a living face according to the classification mapping result.

The classification mapping result in the embodiment of the application comprises the probability that the input image is identified as different face living body categories. Further, when the probability of the input image being identified as the living face category is greater than a preset probability threshold, the target face can be determined to be the living face, otherwise, the target face can be determined to be the non-living face.

According to the human face living body detection method disclosed by the embodiment of the application, two human face images acquired by the binocular image acquisition equipment are further learned under the constraint of the human face key points and the depth information learning result in the training process of the living body detection model, so that plane information and depth information of an image pair of a target human face acquired by the binocular image acquisition equipment are realized, living body detection is carried out on the target human face, a three-dimensional space point cloud is not required to be generated, the calculation complexity is low, the operation speed is high, and the human face living body detection efficiency is high.

In the model training process, depth information of images is considered, so that plane non-living faces such as photos, videos and the like can be accurately classified in the prediction stage, and attack of plane images can be rapidly detected. Because the key point information of the human face (such as the training process of the first task network and the second task network in fig. 2 and the training process of the first convolution network and the second convolution network) is fully considered by the respective networks, the living body detection model can detect the complex bending of the nose, the three-dimensional head model, the simulation mask or the attack face of the mask worn by the person holding the photo nose, and the accuracy of living body detection of the human face is further improved.

Specifically, the first sample image and the second sample image (namely, two image acquisition devices of the binocular image acquisition equipment) are respectively input into a multi-task model shown in fig. 2, firstly, the first convolution network of the first task network and the second convolution network of the second task network respectively learn face features of the first sample image and the second sample image, a full-connection layer is added behind each convolution network to carry out regression on the features, and face key point constraint is increased, so that the convolution networks branch learn the face features in the input images; combining the features extracted by the two convolution network branches, and then learning depth features and living body features through a plurality of residual modules of a residual network; and then, two network branches are led out, one network branch is connected to the full-connection layer to return the depth value of the feature point to the fusion feature, the other network branch is added with a convolution network, the convolution feature is stretched into a one-dimensional vector, the depth value is spliced to the one-dimensional vector, and the living body classification result is obtained through the two full-connection layers. The living body detection model trained by the structure and the method has face key point constraint and depth constraint corresponding to the key points, can learn face plane information (such as two-dimensional texture information) and depth information, and is beneficial to improving accuracy and reliability of living body detection.

On the other hand, the living body detection model is obtained based on the multi-task model cutting, the four branch networks are synchronously trained in a training stage by combining plane information and depth information, only one branch network is used in a prediction stage, the network used in the prediction stage has a simple structure, and the operation efficiency is higher.

Example two

Corresponding to the method embodiment, another embodiment of the present application discloses a device for detecting a living body of a human face, as shown in fig. 4, the device includes:

a face image acquisition module 410, configured to acquire a first face image and a second face image that are synchronously acquired by the first image acquisition device and the second image acquisition device for a target face;

the face positioning module 420 is configured to perform face positioning on the first face image and the second face image, respectively, to obtain a corresponding face positioning result;

a face image clipping module 430, configured to clip a first face image to be detected from the first face image and clip a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image, respectively;

The image classification module 440 is configured to input the first to-be-detected face image and the second to-be-detected face image obtained by clipping in parallel to a pre-trained living body detection model, and perform classification mapping on the target face according to planar features and depth features in the first to-be-detected face image and the second to-be-detected face image through the living body detection model; the living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample;

the face living body detection result determining module 450 is configured to determine whether the target face is a living body face according to the classification mapping result.

In some embodiments of the present application, the living body detection model is obtained by clipping a preset multitasking model, and the multitasking model includes:

the system comprises a first task network, a second task network and a third task network, wherein the first task network is composed of a first convolution network and a first full-connection network and is used for learning face key point characteristics in an image input to the first convolution network;

the second task network is composed of a second convolution network and a second full-connection network and is used for learning the key point characteristics of the face in the image input to the second convolution network;

A third task network consisting of the first convolutional network, the second convolutional network, a residual network, and a depth regression network, the third task network for learning depth features in images input to the first convolutional network and the second convolutional network; the method comprises the steps of,

a fourth task network composed of the first convolutional network, the second convolutional network, the residual network, and a classification network, the fourth task network being for learning living and non-living information in images input to the first convolutional network and the second convolutional network;

obtaining network parameters of the living body detection model consisting of the first convolution network, the second convolution network, the residual error network and the classification network by training the multitasking model;

the first convolution network and the second convolution network are arranged in parallel, and the residual error network is connected with the outputs of the first convolution network and the second convolution network respectively.

In some embodiments of the application, the sample data for each training sample used to train the multitasking model comprises: a first sample image and a second sample image, the sample label of each of the training samples comprising: the first sample image and the second sample image each correspond to: a face key point true value, a depth value true value, and a face living body category true value;

The multitasking model is trained by the following method:

for each of the training samples in the sample set, performing the following code mapping operations respectively:

inputting a first sample image included in the training sample into the first convolution network of the multitasking model, and simultaneously inputting a second sample image included in the training sample into the second convolution network of the multitasking model;

performing operation processing on the first sample image through the first task network to obtain a face key point predicted value of the first sample image in the training sample; performing operation processing on the second sample image through the second task network to obtain a face key point predicted value of the second sample image in the training sample;

performing operation processing on the first sample image and the second sample image through the third task network to obtain a depth value predicted value of the training sample;

performing operation processing on the first sample image and the second sample image through the fourth task network to obtain a face living body category predicted value of the training sample;

according to each predicted value obtained by executing the coding mapping operation, determining predicted loss values of the first task network, the second task network, the third task network and the fourth task network;

Carrying out weighted summation on the predicted loss values of the first task network, the second task network, the third task network and the fourth task network, and determining a model predicted total loss value of the multi-task model;

and optimizing network parameters of the multi-task model, and jumping to execute the coding mapping operation until the model predictive total loss value converges to meet a preset condition.

In some embodiments of the present application, determining predicted loss values of the first task network, the second task network, the third task network, and the fourth task network according to each predicted value obtained by performing the code mapping operation includes:

determining a predicted loss value of the first task network according to the difference value between the face key point predicted value and the face key point true value of the first sample image in all the training samples in the sample set;

determining a predicted loss value of the second task network according to the difference value between the face key point predicted value and the face key point true value of the second sample image in all the training samples in the sample set;

determining a predicted loss value of the third task network according to the difference value between the predicted value of the depth value and the true value of the depth value of all the training samples in the sample set;

And determining a predicted loss value of the fourth task network according to the difference value between the human face living body type predicted value and the human face living body type true value of all the training samples in the sample set.

In some embodiments of the present application, the computing processing is performed on the first sample image through the first task network to obtain a face key point predicted value of the first sample image in the training sample; and performing operation processing on the second sample image through the second task network to obtain a face key point predicted value of the second sample image in the training sample, wherein the step comprises the following steps:

performing convolution processing on the first sample image in the training sample through the first convolution network to obtain a first vector; then, carrying out coding mapping on the first vector through the first fully-connected network to obtain a face key point predicted value corresponding to the first sample image; the method comprises the steps of,

performing convolution processing on the second sample image in the training sample through the second convolution network to obtain a second vector; and then, carrying out coding mapping on the second vector through the second full-connection network to obtain a face key point predicted value corresponding to the second sample image.

In some embodiments of the present application, the depth value true value is determined according to a face key point in a first sample image and a second sample image in sample data of the training sample, and the step of performing operation processing on the first sample image and the second sample image through the third task network to obtain a depth prediction value of the training sample includes:

convolving the first vector and the second vector through the residual error network to obtain a third vector;

and carrying out coding mapping on the third vector through the depth regression network to obtain a depth value predicted value of the face key point corresponding to the training sample.

In some embodiments of the present application, the step of performing an operation process on the first sample image and the second sample image through the fourth task network to obtain a face living body class prediction value of the training sample includes:

and carrying out coding mapping on the third vector through the classification network to obtain a face living body category predicted value corresponding to the training sample.

The embodiment of the application discloses a face living body detection device, which is used for realizing the face living body detection method in the first embodiment of the application, and specific implementation manners of all modules of the device are not repeated, and can be referred to specific implementation manners of corresponding steps in the method embodiment.

The face living body detection device disclosed by the embodiment of the application is characterized in that a first face image and a second face image which are synchronously acquired by a first image acquisition device and a second image acquisition device aiming at a target face are acquired; respectively carrying out face positioning on the first face image and the second face image to obtain a corresponding face positioning result; then, cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image respectively; the first face image to be detected and the second face image to be detected which are obtained through clipping are input to a pre-trained living body detection model in parallel, and the target face is subjected to classified mapping according to plane characteristics and depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model; the living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample; and determining whether the target face is a living face according to the classification mapping result, so that the speed of face living detection is improved.

According to the human face living body detection device disclosed by the embodiment of the application, through the two human face images acquired by the binocular image acquisition equipment in the training process of the living body detection model, the human face living body and non-living body characteristics are further learned under the constraint of the human face key points and the depth information learning result, so that the plane information and the depth information of the image pair of the target human face acquired by the binocular image acquisition equipment are realized, the living body detection of the target human face is realized, the generation of a three-dimensional space point cloud is not needed, the calculation complexity is low, the operation speed is high, and the human face living body detection efficiency is high.

Correspondingly, the application also discloses electronic equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the human face living body detection method according to the first embodiment of the application when executing the computer program. The electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc.

The application also discloses a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the method for detecting human face living body according to the first embodiment of the application.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The above describes in detail a method and apparatus for detecting human face living body provided by the present application, and specific examples are applied herein to illustrate the principle and implementation of the present application, and the above description of the examples is only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Claims

1. A face living body detection method, characterized by comprising:

respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results, wherein the face positioning results comprise a face positioning frame;

cutting a first face image to be detected from the first face image and cutting a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image respectively, including: expanding the face positioning frame in the first face image by a preset size, cutting the face positioning frame obtained after expanding to obtain the first face image to be detected, expanding the face positioning frame in the second face image by a preset size, and cutting the face positioning frame obtained after expanding to obtain the second face image to be detected;

the first face image to be detected and the second face image to be detected which are obtained through clipping are input to a pre-trained living body detection model in parallel, and the target face is subjected to classified mapping according to plane characteristics and depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model; the living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample, and consists of a first convolution network, a second convolution network, a residual error network and a classification network;

Determining whether the target face is a living face according to the classification mapping result;

the step of inputting the cut first face image to be detected and the second face image to be detected into a pre-trained living body detection model in parallel, and performing classification mapping on the target face according to plane features and depth features in the first face image to be detected and the second face image to be detected through the living body detection model, wherein the step of performing classification mapping comprises the following steps:

the first convolution network and the second convolution network are arranged in parallel and are used for respectively carrying out convolution processing on the first face image to be detected and the second face image to be detected to obtain a fourth vector and a fifth vector;

and carrying out convolution processing and coding mapping on the fourth vector and the fifth vector through the residual error network and the classification network to obtain the face living body category corresponding to the target face.

2. The method of claim 1, wherein the living detection model is tailored from a preset multitasking model, the multitasking model comprising:

3. The method of claim 2, wherein the sample data for each training sample used to train the multitasking model comprises: a first sample image and a second sample image, the sample label of each of the training samples comprising: the first sample image and the second sample image each correspond to: a face key point true value, a depth value true value, and a face living body category true value;

the multitasking model is trained by the following method:

4. A method according to claim 3, wherein determining predicted loss values for the first, second, third and fourth task networks based on the predicted values obtained from performing the code mapping operation comprises:

5. The method according to claim 3, wherein the computing processing is performed on the first sample image through the first task network to obtain a face key point predicted value of the first sample image in the training sample; and performing operation processing on the second sample image through the second task network to obtain a face key point predicted value of the second sample image in the training sample, wherein the step comprises the following steps:

6. The method according to claim 5, wherein the depth value true value is determined according to face key points in a first sample image and a second sample image in sample data of the training sample, and the step of performing operation processing on the first sample image and the second sample image through the third task network to obtain the depth prediction value of the training sample includes:

7. The method according to claim 5, wherein the step of performing an operation process on the first sample image and the second sample image through the fourth task network to obtain the face living body class prediction value of the training sample includes:

8. A human face living body detection device is characterized in that,

the face positioning module is used for respectively carrying out face positioning on the first face image and the second face image to obtain corresponding face positioning results, wherein the face positioning results comprise a face positioning frame;

The face image clipping module is configured to clip a first face image to be detected from the first face image and clip a second face image to be detected from the second face image according to the face positioning results in the first face image and the second face image, and includes: expanding the face positioning frame in the first face image by a preset size, cutting the face positioning frame obtained after expanding to obtain the first face image to be detected, expanding the face positioning frame in the second face image by a preset size, and cutting the face positioning frame obtained after expanding to obtain the second face image to be detected;

the image classification module is used for inputting the first face image to be detected and the second face image to be detected which are obtained through cutting into a pre-trained living body detection model in parallel, and carrying out classification mapping on the target face according to the plane characteristics and the depth characteristics in the first face image to be detected and the second face image to be detected through the living body detection model; the living body detection model is a classification model trained based on face key point constraint and depth feature constraint of a training sample, and consists of a first convolution network, a second convolution network, a residual error network and a classification network;

The face living body detection result determining module is used for determining whether the target face is a living body face or not according to the classification mapping result;

the target face is subjected to classification mapping through a module for executing the following processes by the image classification module:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of face in-vivo detection according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the face living body detection method according to any one of claims 1 to 7.