WO2022246612A1 - 活体检测方法、活体检测模型的训练方法及其装置和*** - Google Patents

活体检测方法、活体检测模型的训练方法及其装置和*** Download PDF

Info

Publication number
WO2022246612A1
WO2022246612A1 PCT/CN2021/095597 CN2021095597W WO2022246612A1 WO 2022246612 A1 WO2022246612 A1 WO 2022246612A1 CN 2021095597 W CN2021095597 W CN 2021095597W WO 2022246612 A1 WO2022246612 A1 WO 2022246612A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
convolutional neural
training
living body
data
Prior art date
Application number
PCT/CN2021/095597
Other languages
English (en)
French (fr)
Inventor
赵亚西
徐文康
黄为
王振阳
科特瓦勒·科坦
马塞尔·塞巴斯蒂安
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180040878.6A priority Critical patent/CN116057587A/zh
Priority to PCT/CN2021/095597 priority patent/WO2022246612A1/zh
Publication of WO2022246612A1 publication Critical patent/WO2022246612A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and more specifically, relate to a living body detection method, a training method of a living body detection model, and an apparatus and system thereof.
  • Existing liveness detection methods are suitable for open scenes, but these methods are easily broken by data that are not real faces (such as a pre-recorded video, 2D masks, 3D masks, etc.).
  • Embodiments of the present application provide a living body detection method, a living body detection model training method, an apparatus and a system thereof, which can improve the accuracy of living body detection.
  • a living body detection method comprising: acquiring a face image, and inputting the face image into a target living body detection model to obtain a living body detection result, and the living body detection result is used to indicate a person in the face image Whether it is a living body, the target living body detection model includes a first convolutional neural network, the first convolutional neural network includes a second convolutional neural network and a fully connected layer, and the second convolutional neural network is used to obtain the face image according to the face image.
  • the category feature vector, the fully connected layer is used to perform liveness discrimination based on the category feature vector, and obtain the liveness detection result.
  • the target liveness detection model used for liveness detection includes a neural network capable of obtaining category feature vectors and a fully connected layer capable of living body discrimination, so it has a high ability to extract face features and living body discrimination ability, which can effectively improve the accuracy of living body detection results. It should be understood that the accuracy of face feature extraction will directly affect the accuracy of subsequent living body discrimination, and the technical solution of the present application fully takes into account both capabilities, so that the accuracy of living body detection can be effectively improved.
  • Face images can include real face images and non-real face images, can be real face and non-real face images, video frames, and can also be real face images and non-real face images after feature extraction Feature vector.
  • Non-real faces are divided into two-dimensional (2 dimension, 2D) and three-dimensional (3 dimension, 3D) categories.
  • the 2D category mainly includes printed photos of faces, photos and video playback on screens such as tablets, and the 3D category mainly includes 3D Masks and 3D head models.
  • the above-mentioned first convolutional neural network (i.e., the living body detection neural network below) includes two parts of the second convolutional neural network (i.e., the basic neural network below) and a fully connected layer, and the first convolutional neural network can be understood as being in the The second convolutional neural network is obtained on the basis of transformation.
  • the training data sets of the first convolutional neural network and the second convolutional neural network may be different, and the training process may be completed in stages.
  • the target living body detection model is obtained by using the first training data to update the parameters of the first convolutional neural network
  • the second convolutional neural network is obtained by using the second training data
  • the first training data includes real face data and non-real face data
  • the second training data includes real face data.
  • the second convolutional neural network can be a data set that utilizes various types of open data-rich real faces (that is, the second training data can be data in a public real face data set), which makes the second
  • the convolutional neural network can be trained more fully and has a better ability to extract facial features.
  • some non-real face data can also be mixed in, without affecting the overall effect. It is even possible to directly select one of the publicly available convolutional neural networks for face recognition as the second convolutional neural network.
  • the training cost that is, when training the first convolutional neural network (updating the parameters of the first convolutional neural network), the requirements for training data and training equipment are relatively low, because no A large amount of sufficient training is then performed on the second convolutional neural network.
  • the first training data that is, the training data used to improve the ability to distinguish living bodies
  • the training phase of the first convolutional neural network is equivalent to the process of fine-tuning the parameters, so that it has the ability of living body discrimination.
  • the target living body detection model uses the first training data to update the parameters of the shallow network of the second convolutional neural network and the fully connected layer of the second convolutional neural network
  • the parameters and the parameters of the fully connected layer of the first convolutional neural network are obtained.
  • the calculation amount of the training process can be greatly reduced, and the training of the first convolutional neural network can be further simplified and the training cost can be reduced.
  • the parameters of the middle layer of the first convolutional neural network remain unchanged. It is equivalent to freezing the parameters of the middle layer during the process of updating the parameters of the first convolutional neural network, or understanding that the parameters of the middle layer are not updated. This is to take advantage of the domain independence of the middle layer. In this way, under the premise of not reducing the training effect, the calculation amount of the training process can be greatly reduced, and the training of the first convolutional neural network can be further simplified and the training cost can be reduced.
  • the second convolutional neural network is a lightweight neural network for face recognition. This can further reduce computation and storage pressure. That is to use a neural network model with a simple structure, fewer parameters, and less storage space, which is conducive to deployment in application scenarios with limited computing and storage capabilities, such as vehicle scenarios.
  • the human face image includes images under multiple lighting scenarios.
  • it may be the following lighting scenarios: sunny outside, cloudy or cloudy outside, dim indoor light, and bright indoor light.
  • the target living body detection model of the present application can perform well on human face images under different lighting scenarios. This can be obtained by enriching the training data of the target detection model, that is, the target detection model is trained on the training data under such different lighting scenarios, so that it has the ability to detect face images under different lighting scenarios.
  • the aforementioned face image is captured by one or more cameras installed in the vehicle.
  • the acquired face images are clear enough and less affected by the background.
  • the target liveness detection model of this application can still have a good performance on the face image of the scene in the car. This can be obtained by enriching the training data of the target detection model, that is, the target detection model is trained with the training data in this kind of in-vehicle scene, so that it has the ability to detect face images in the in-vehicle scene.
  • the multiple cameras when there are multiple cameras, are set at different positions of the vehicle to obtain face images from different angles and/or different distances.
  • the target living body detection model of the present application can perform well on face images from different angles and/or different distances.
  • the advantage of this is that the detected person does not need to act in order to cooperate with a camera at a certain position, and can still complete face image collection for liveness detection. For example, if there is only a camera on the control panel, the driver needs to turn right and lower his head to allow the camera to collect face images. The interaction process is not friendly enough and may cause interference to the detected person.
  • the aforementioned camera is a near-infrared camera.
  • the face image may be acquired by using a camera, a video camera, or the like.
  • Common cameras can be used to acquire face images, such as RGB cameras and near-infrared cameras.
  • the RGB camera is greatly affected by light, and the face image obtained by using RGB needs to be processed in grayscale first, converted into a grayscale image, and then the subsequent liveness detection is performed on the grayscale image.
  • the near-infrared camera is less affected by light and has a wider range of applications, and screen-like images (including photos displayed on the screen or videos played) cannot be imaged on the near-infrared camera, because screen-like images cannot be imaged on the near-infrared camera. Imaging within the wavelength band, so the use of near-infrared cameras is equivalent to filtering out non-real face images such as screens. In other words, the near-infrared camera has the advantages of being less disturbed by light and shielding screens from unreal faces. Therefore, if a near-infrared camera is used, screen-type attacks are automatically filtered.
  • the above method further includes: sending a face image, and the face image is used to train a target living body detection model. That is to say, the above-mentioned face image is used for the training of the target detection model, that is, the parameters of the target living body detection model are updated. In this way, the purpose of online update can be realized, and the accuracy of the target living body detection model can be further improved.
  • the above method further includes: when the living body detection result indicates that there is an artificial living body in the face image, executing a target task decision, the target task includes at least one of the following: unlocking , account login, authority permission or confirmation of payment.
  • unlocking when the living body detection result is a living body, it is further judged whether the person has the unlocking authority. If the person has the unlocking authority, the unlocking is performed, and if not, the unlocking is not performed.
  • the decision-making task can also be performed first, and then the living body detection is performed, that is, firstly, it is judged whether the person in the face image has the task authorization, and when the authorization is granted, it is further judged whether the face image is a living body.
  • a training method for a living body detection model includes: obtaining first training data, the first training data including real face data and non-real face data; according to the first training data, updating The parameters of the first convolutional neural network are used to obtain the target live detection model.
  • the first convolutional neural network includes the second convolutional neural network and a fully connected layer.
  • the second convolutional neural network is used to obtain the category features of the face according to the training data.
  • Vector, the fully connected layer is used for liveness discrimination based on the category feature vector.
  • the training method of the technical solution of the present application has the advantages of relatively simple training, relatively small demand for training data, and relatively higher accuracy of the model obtained through training.
  • the data of real faces in the live detection data is relatively easy to obtain, and the types are diverse and sufficient in quantity, while the data of non-real faces such as 2D/3D mask data are difficult to obtain, and the types are limited and the quantity is limited.
  • the unbalanced distribution of training data which is not conducive to the training of the classifier of the two classifications
  • the insufficient quantity will also cause the existing living body detection to fail to achieve the desired accuracy.
  • the living body detection in the prior art only pays attention to the ability of living body discrimination but ignores the ability of face recognition (that is, the ability to extract face features).
  • the equipment for training the model (referred to as the training equipment) needs to store a large amount of training data, model parameters, and perform training process operations;
  • the living body detection model in the prior art has a large scale, complex training, and a large amount of calculation. It has high requirements for the storage and computing capabilities of the training equipment and the storage and execution capabilities of the inference equipment, which makes it not suitable for computing. For scenarios with weak capabilities and/or storage capabilities, such as in-vehicle scenarios, it is difficult for in-vehicle devices to undertake complex calculations and large-scale model storage, so the existing live detection models are not suitable for in-vehicle scenarios.
  • the second convolutional neural network is pre-trained using second training data, and the second training data includes real face data.
  • the accuracy of the model can be further improved, thereby improving the accuracy of the living body detection, and at the same time, the training process of the target living body detection model can be simplified and the training cost of the target detection model can be reduced.
  • the parameters of the shallow network of the first convolutional neural network and the full parameters of the first convolutional neural network are updated. Parameters of the connection layer. In this way, under the premise of not reducing the training effect, the calculation amount of the training process can be greatly reduced, and the training of the first convolutional neural network can be further simplified and the training cost can be reduced.
  • the parameters of the middle layer of the first convolutional neural network remain unchanged. It is equivalent to freezing the parameters of the middle layer during the process of updating the parameters of the first convolutional neural network, or understanding that the parameters of the middle layer are not updated. This is to take advantage of the domain independence of the middle layer. In this way, under the premise of not reducing the training effect, the calculation amount of the training process can be greatly reduced, and the training of the first convolutional neural network can be further simplified and the training cost can be reduced.
  • the second convolutional neural network is a lightweight neural network for face recognition. This can further reduce computation and storage pressure. That is to use a neural network model with a simple structure, fewer parameters, and less storage space, which is conducive to deployment in application scenarios with limited computing and storage capabilities, such as vehicle scenarios.
  • the first training data includes data under multiple lighting scenarios.
  • it may be the following lighting scenarios: sunny outside, cloudy or cloudy outside, dim indoor light, and bright indoor light. This can increase the richness of the training data, thereby improving the training effect, and enabling the training of a living body detection model that can be applied to more complex lighting scenarios.
  • the first training data is captured by one or more cameras arranged in the vehicle.
  • the liveness detection model trained in this way can have a good performance in the car scene.
  • the multiple cameras are set at different positions of the vehicle to obtain the first training at different angles and/or different distances. data. This can increase the richness of the training data, thereby improving the training effect, and enabling the training of a living body detection model that can be applied to more complex lighting scenarios.
  • the aforementioned camera is a near-infrared camera.
  • This can be applied to scenarios that do not require screens as training data, that is, the collection and training of non-real face data of screens is omitted.
  • the target detection model does not need to have the ability to detect non-real face data on the screen. At this time, it only needs to use other training data to train the target detection model.
  • the collection, processing and training of screen data can be omitted, thereby effectively reducing training costs.
  • a living body detection device in a third aspect, includes a unit for performing the method in any one implementation manner of the first aspect above.
  • a training device for a living body detection model includes a unit for executing the training method in any one of the implementation manners of the second aspect above.
  • a living body detection device which includes: a memory for storing programs; a processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processor uses To execute the method in any one of the implementation manners in the first aspect.
  • the device can be installed in various equipment or systems that require liveness detection, such as vehicle terminals, smart screens, and access control systems.
  • the device can also be a chip.
  • a training device for a living body detection model includes: a memory for storing programs; a processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, The processor is configured to execute the training method in any one implementation manner in the second aspect.
  • the training device can be a host, a computer, a server, a cloud device and the like that can perform model training.
  • the training device can also be a chip.
  • a computer-readable medium stores program code for execution by a device, and the program code includes a method for executing any one of the implementation manners of the first aspect or the second aspect .
  • a computer program product containing instructions is provided, and when the computer program product is run on a computer, the computer is made to execute the method in any one of the above-mentioned first aspect or the second aspect.
  • the chip includes a processor and a data interface, and the processor reads the instructions stored in the memory through the data interface, and executes any one of the above-mentioned first aspect or the second aspect method in the implementation.
  • the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in any one of the implementation manners in the first aspect.
  • Fig. 1 is a schematic diagram of an artificial intelligence subject framework according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an application scenario of a living body detection solution.
  • FIG. 3 is a schematic diagram of a system architecture of an embodiment of the present application.
  • Figure 4 is a schematic diagram of the structure of a convolutional neural network.
  • Figure 5 is a schematic diagram of the structure of a convolutional neural network.
  • FIG. 6 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of the basic neural network of the embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a neural network for live detection according to an embodiment of the present application.
  • Fig. 9 is a schematic flowchart of a living body detection method according to an embodiment of the present application.
  • FIG. 10 is a schematic layout diagram of an in-vehicle camera according to an embodiment of the present application.
  • FIG. 11 is a schematic layout diagram of another in-vehicle camera according to an embodiment of the present application.
  • Fig. 12 is a schematic flowchart of a training method for a living body detection model according to an embodiment of the present application.
  • Fig. 13 is a schematic block diagram of a living body detection device according to an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a hardware structure of a living body detection device provided by an embodiment of the present application.
  • Fig. 15 is a schematic block diagram of a training device for a living body detection network according to an embodiment of the present application.
  • FIG. 16 is a schematic diagram of a hardware structure of a training device for a living body detection network provided by an embodiment of the present application.
  • the embodiment of the present application involves a neural network.
  • the relevant terms and concepts of the neural network are firstly introduced below.
  • neural network neural network, NN
  • the neural network may be composed of neural units, and the neural unit may refer to an operation unit that takes x s and the intercept 1 as input, and the output of the operation unit may be shown in formula (1), for example:
  • W s is the weight of x s , which can also be called the parameter or coefficient of the neural network
  • x s is the neural network
  • the input of , b is the bias of the neuron unit.
  • f is the activation function of the neural unit, and the activation function is used to perform nonlinear transformation on the features in the neural network, thereby converting the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multilayer neural network, can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, each neuron in the i-th layer is connected to the neuron in the i+1-th layer.
  • DNN looks very complicated, it is actually not complicated in terms of the work of each layer.
  • it is the following linear relationship expression: in, is the input vector, is the output vector, Is an offset vector, W is a weight, which can also be called a coefficient or a parameter; the weight can be in the form of a weight matrix, and ⁇ () is an activation function.
  • Each layer is just an input vector After a simple operation to get the output vector Due to the large number of DNN layers, the weight W and offset vector The number is also higher.
  • each layer of weight in DNN (for convenience of description, called coefficient) is as follows: Take the coefficient W as an example, assuming that in a three-layer DNN, the fourth neuron in the second layer goes to the third layer
  • the linear coefficient of the second neuron of is defined as
  • the superscript 3 represents the layer number where the coefficient W is located, and the subscript corresponds to the output index 2 of the third layer and the input index 4 of the second layer.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
  • the input layer has no weight W.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more weights has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is also the process of learning weights, and its ultimate goal is to obtain the weights of all layers of the trained deep neural network (for example, including the weight matrix formed by the coefficients W of multiple layers).
  • a convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information that is independent of location.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the fully connected layer constitutes a binary classification classifier, which is used to distinguish whether the face in the image is a real person or a non-real person, that is, to determine whether the image is a living image.
  • This classifier is used to classify objects in the image.
  • the classifier can include a fully connected layer and a softmax function (which can be called a normalized exponential function), which can output probabilities of different categories according to the input.
  • this is the loss function or objective function, which is an important equation for measuring the difference between the predicted value and the target value.
  • the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing the loss as much as possible.
  • the neural network can use the error back propagation algorithm to correct the value of the weight in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal until the output will generate an error loss, and the weights in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal weight of the neural network model, such as the weight matrix.
  • the liveness detection model is often trained by using liveness detection data (including real face data and non-real face data), but the traditional scheme only focuses on distinguishing whether it is live or not, and ignores the In order to improve the face recognition ability, the ability of the living body detection model to extract face features is poor, and it is difficult to achieve high accuracy in living body discrimination based on this poor face feature extraction ability.
  • the embodiment of the present application proposes a living body detection scheme.
  • the target living body detection model used for living body detection includes a neural network capable of obtaining category feature vectors and a fully connected layer capable of living body discrimination. Therefore It has a high ability to extract human face features and a living body discrimination ability, thereby effectively improving the accuracy of living body detection results.
  • the solutions of the embodiments of the present application can be applied to various usage scenarios of liveness detection such as screen unlocking, device unlocking, account login, authority permission (such as access permission), and secure payment.
  • FIG. 1 is a schematic diagram of an artificial intelligence main framework according to an embodiment of the present application.
  • the main framework describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements.
  • Intelligent information chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
  • IT value chain reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the infrastructure can communicate with the outside through sensors, and the computing power of the infrastructure can be provided by smart chips.
  • the smart chip here can be a central processing unit (central processing unit, CPU), a neural network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), an application specific integrated circuit (application specific) Integrated circuit, ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips.
  • CPU central processing unit
  • NPU neural network processor
  • NPU graphics processing unit
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the basic platform of infrastructure can include related platform guarantees and supports such as distributed computing framework and network, and can include cloud storage and computing, interconnection and interworking network, etc.
  • data can be obtained through sensors and external communication, and then these data can be provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves at least one of graphics, images, voice, text and other information.
  • This data is different in different application areas and can have different representations.
  • the content of the data is related to a specific connected terminal of the Internet of Things, for example, may include sensory data such as force, displacement, liquid level, temperature, or humidity.
  • the data is, for example, living body data, which includes real face data and non-real face data.
  • These data can be in the form of images or graphics, or in the form of feature vectors or matrices.
  • the above data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other processing methods.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, safe city, or smart terminals, etc.
  • the embodiments of the present application can be applied in many fields of artificial intelligence, for example, intelligent manufacturing, intelligent transportation, smart home, intelligent medical care, intelligent security, automatic driving, or safe city, etc., specifically, in these fields of artificial intelligence
  • the branch part that needs liveness detection in for example, in the field of smart security, only when the live body detection confirms that the person is a real person can the person with access authority be further allowed to access, so as to prevent someone from using various tools such as fingerprint gloves and masks to break into the security system.
  • the field of autonomous driving only those who are confirmed to be real by liveness detection will further allow people with access rights to log in and enable vehicle-mounted devices, so as to prevent people from stealing or using vehicles by using various methods of non-real faces.
  • the input face image can be divided into one of two categories (living body, non-living body), that is, to judge whether the person in the face image is a real person (that is, the category is living body), when Unlocking is allowed when the category is living, otherwise it is not allowed, as shown in Figure 2. That is to say, when some face images are input to the living body detection model, the input image can be classified into one of the above two categories (that is, the detection result is living or non-living), and then the unlocking decision-making module determines whether Allow unlocking.
  • the aforementioned unlocking may be screen unlocking, access control unlocking, device unlocking, vehicle unlocking, and the like.
  • a and B are the image of real face
  • C is the image of 2D printing photo, is the image of non-real face
  • D is 3D
  • the image of the head model is also an image of a non-real human face.
  • the unlocking decision-making module further determines whether the unlocking authority is available, and if it has the unlocking authority, it will be unlocked, and if it does not have the unlocking authority, it will not be unlocked.
  • the unlocking decision-making module directly determines that they do not have the unlocking authority and do not unlock. That is to say, in Figure 2, the security of the unlocking task is improved by using the liveness detection link, which effectively prevents someone from stealing the unlocking authority through the data of the non-real face of the person with the unlocking authority.
  • the unlocking decision-making module in Figure 2 can be directly replaced with the permission decision-making module, that is, after judging whether the person in the face image is a living body Whether the person is a real person or not), the authority decision-making module can be used to decide whether to grant permission. It uses the liveness detection link to improve the security of permission permission tasks, and effectively prevents someone from defrauding permission through the data of non-real faces of people with permission. For example, for the in-vehicle payment scenario, when the user initiates a payment request, the car machine can collect images of the user's face area through the camera, and perform liveness detection and face detection to determine whether the current user has payment authority.
  • FIG. 3 is a schematic diagram of a system architecture of an embodiment of the present application, which can be used to train neural network models, such as face recognition models and living body detection models.
  • the data collection device 160 is used to collect training data.
  • the training data may include training images and classification results corresponding to the training images, wherein the results of the training images may be manually pre-labeled results.
  • the training images include images of real human faces and images of non-real human faces.
  • the training images include images of real human faces.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 based on training data maintained in the database 130 .
  • “A/B” describes the association relationship of associated objects, indicating that there may be three types of relationships. For example, A/B may indicate: A exists alone, A and B exist simultaneously, and B exists independently.
  • the training device 120 processes the input original image, and compares the output image with the original image until the difference between the output image of the training device 120 and the original image is less than a certain threshold, thereby completing the target model/rule 101 training.
  • the face recognition model of the embodiment of the present application can be trained, that is, the trained basic neural network (ie, the second convolutional neural network), so that the subsequent further training of the basic neural network can be used to obtain the living body detection Model.
  • the training device 120 processes the category feature vector of the input face image, and compares the output category with the label category until the accuracy of the category output by the training device 120 is greater than or equal to a certain threshold, thereby Complete the training of the target model/rules 101.
  • the living body detection model of the embodiment of the present application can be trained, that is, the living body detection model can be obtained by further training on the basis of the above-mentioned basic neural network.
  • the above target model/rule 101 can be used to implement the method of the embodiment of the present application.
  • the target model/rule 101 in the embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
  • the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptop, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR), vehicle terminal, etc., can also be a server or cloud, etc.
  • the execution device 110 configures an input/output (input/output, I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, the described
  • the input data in this embodiment of the application may include: a face image input by the client device.
  • the preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as face images) received by the I/O interface 112.
  • the input data such as face images
  • the execution device 110 When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculation and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result to the client device 140, thereby providing it to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete above tasks, thereby providing the desired result to the user.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data, the authorization of the user can be obtained in advance, and the user can set corresponding permissions in the client device 140 .
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be specific ways such as display, sound, or action.
  • the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
  • the client device 140 may not be used for collection, but the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample.
  • the data is stored in database 130 .
  • FIG. 3 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the target model/rule 101 is obtained according to the training device 120.
  • the target model/rule 101 may be a neural network obtained by using the method of the embodiment of the present application.
  • the neural network of the embodiment of the present application may be CNN that can be used for live detection, or deep convolutional neural networks (DCNN) and so on.
  • CNN is a very common neural network, and it is the neural network that the embodiment of the present application focuses on, the structure of CNN will be introduced in detail below with reference to FIG. 4 .
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning (deep learning) architecture. Multiple levels of learning are performed on the abstraction level.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.
  • the structure of the neural network specifically adopted by the basic neural network in the living body detection method of the embodiment of the present application may be shown in FIG. 4 .
  • FIG. 4 is a schematic diagram of the structure of a convolutional neural network.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a layer 220 (layer 220 may include a convolutional layer and a pooling layer, or, layer 220 may include a convolutional layer without a pooling layer) , and a fully connected layer 230 .
  • the input layer 210 can obtain the face image to be processed, and hand over the acquired face image to be processed by the layer 220 and the following fully connected layer 230 for processing, and the processing result of the image can be obtained.
  • the internal layer structure of CNN 200 in Fig. 4 will be described in detail below.
  • the layer 220 may include layers 221-226 as examples.
  • layer 221 is a convolutional layer
  • layer 222 is a pooling layer
  • layer 223 is a convolutional layer.
  • Layers, 224 are pooling layers
  • 225 are convolutional layers
  • 226 are pooling layers
  • 221 and 222 are convolutional layers
  • 223 are pooling layers
  • 224 and 225 are convolutional layers
  • Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the number and positions of convolutional layers and pooling layers are only examples here, and there may be more or fewer convolutional layers and pooling layers, and no pooling layers may be included.
  • the convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix.
  • the convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. ...It depends on the value of the stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple dimensions" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the convolutional feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple convolutional feature maps of the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features;
  • the features extracted by the later convolutional layers (such as 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • a pooling layer can be periodically introduced after the convolutional layer.
  • a layer of convolutional layer can be followed by a layer of pooling It can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the purpose of the pooling layer is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the average value of the pixel values in the image within a specific range as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the layer 220, the convolutional neural network 200 is not enough to output the required output information. In order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 further utilizes the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 4 ) and an output layer 240, and the parameters contained in the multi-layer hidden layers may be based on the specific task type The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the backpropagation (as shown in Fig. 4, the propagation from 240 to 210 direction is back propagation) will Start to update the weights and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.
  • a convolutional neural network (CNN) 300 may include an input layer 310 , a layer 320 (the layer 320 may include a convolutional layer and a pooling layer, wherein the pooling layer is optional), and a fully connected layer 330 .
  • CNN convolutional neural network
  • FIG. 4 multiple convolutional layers or pooling layers in the layer 320 in FIG. 5 are parallelized, and the features extracted respectively are input to the fully connected layer 330 for processing.
  • the convolutional neural network shown in Figure 4 and Figure 5 is only an example of two possible convolutional neural networks for the basic neural network of the living body detection method of the embodiment of the present application.
  • the convolutional neural network used in the basic neural network of the living body detection method in the embodiment of the present application may also exist in the form of other network models.
  • the living body detection model can be CNN (the first convolutional neural network after training), which is only in the basic neural network that can be used for face recognition (the basic neural network is also CNN structure), add one or more fully connected layers for binary classification, so it can be seen as adding one or more layers after the output layer on the basis of the structure shown in Figure 4 or Figure 5 Layer two classification fully connected layers. Therefore, the processing result of the face image to be processed output by the output layer in FIG. 4 and FIG. 5 can be called a category feature vector, that is, a feature vector of a face that can be used for classification.
  • the above-mentioned living body detection neural network ie, the first convolutional neural network obtained on the basis of the basic neural network is still a CNN structure, so the living body detection model of the embodiment of the present application (using The CNN for live detection) can also be represented by the architecture shown in Figure 4 and Figure 5, but at this time, the processing result of the face image to be processed is the classification result of whether it is a live body.
  • FIG. 6 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application.
  • the chip includes a neural network processor (NPU600 shown).
  • the chip can be set in the execution device 110 shown in FIG. 3 to complete the computing work of the computing module 111 .
  • the chip can also be set in the training device 120 shown in FIG. 3 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 4 and Figure 5 can be implemented in the chip shown in Figure 6 .
  • the NPU600 is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 60, and the controller 604 controls the operation circuit 603 to extract data in the memory (weight memory or input memory) and perform operations.
  • the operation circuit 603 includes multiple processing units (process engine, PE).
  • arithmetic circuit 603 is a two-dimensional systolic array.
  • the arithmetic circuit 603 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 603 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 602, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 601 and performs matrix operation with matrix B, and the obtained partial or final results of the matrix are stored in an accumulator (accumulator) 608 .
  • the vector calculation unit 607 can perform further processing on the output of the calculation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 607 can be used for network calculations of non-convolution/non-FC layers in neural networks, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), etc. .
  • the vector computation unit can 607 store the vector of the processed output to the unified buffer 606 .
  • the vector calculation unit 607 may apply a non-linear function to the output of the operation circuit 603, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 607 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as an activation input to arithmetic circuitry 603, for example for use in subsequent layers in a neural network.
  • the unified memory 606 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 601 and/or unified memory 606 through the storage unit access controller 605 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 602, And store the data in the unified memory 606 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (bus interface unit, BIU) 610 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 609 through the bus.
  • An instruction fetch buffer (instruction fetch buffer) 609 connected to the controller 604 is used to store instructions used by the controller 604;
  • the controller 604 is configured to invoke instructions cached in the memory 609 to control the operation process of the computing accelerator.
  • the unified memory 606, the input memory 601, the weight memory 602, and the instruction fetch memory 609 are all on-chip memory
  • the external memory is a memory outside the NPU
  • the external memory can be double data rate synchronous dynamic Random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • each layer in the convolutional neural network shown in FIG. 4 and FIG. 5 can be performed by the operation circuit 603 or the vector calculation unit 607 .
  • the executing device 110 in FIG. 3 described above can execute each step of the living body detection method or the training method of the living body detection model in the embodiment of the present application.
  • the CNN model shown in FIG. 4 and FIG. 5 and the chip shown in FIG. 6 are also It can be used to execute various steps of the living body detection method of the embodiment of the present application.
  • the living body detection model can be further obtained on the basis of a basic neural network.
  • the following two neural networks are respectively introduced in conjunction with FIG. 7 and FIG. 8 .
  • Fig. 7 is a schematic structural diagram of the basic neural network of the embodiment of the present application.
  • the basic neural network (that is, the second convolutional neural network) may be an existing neural network for face recognition, or a training library may be used to train a basic neural network for face recognition. Since the basic neural network is used for face recognition rather than live detection, a large amount of real face data can be used for training, and optionally, non-real face data can also be used for training.
  • the basic neural network may be a lightweight neural network, that is, a neural network with a simple structure, fewer parameters, and less storage space required.
  • the basic neural network can be an existing lightweight convolutional neural network (LightCNN) face recognition (face recognition) model (also known as a face recognition neural network), for example, a 9-layer version of LightCNN FR can be used model (hereinafter referred to as LightCNN-9).
  • the LightCNN-9 is one of the public FR CNNs with high accuracy. Compared with other FR CNNs, LightCNN-9 can have very good performance with a smaller parameter set. For the current live detection, such a small network scale is especially suitable for scenarios with limited computing and storage capabilities such as vehicles.
  • LightCNN-9 is just an example of a lightweight convolutional neural network for face recognition
  • other LightCNNs such as LightCNN-4, LightCNN-29, etc.
  • the neural network used for face recognition can be used in the scheme of this application, for example, various types of networks such as DeepFace, Webface, FaceNet, or visual geometry group (visual geometry group, VGG) networks can also be used.
  • the neural network is not listed one by one for the sake of brevity.
  • Figure 7 shows the LightCNN-9 convolutional layer (conv1), maximum feature map (max-feature-map, MFM) layer (MFM1), pooling layer (pool1-pool4), convolutional layer combination layer (group2 -group5) and a fully connected layer (MFM_fc1).
  • the 128x128 input image is input into LightCNN-9, and the output layer (here, the fully connected layer MFM_fc1 is used as the output layer) can output a 256-dimensional feature vector (or other dimensions such as 128 dimensions) eigenvectors).
  • These 256-dimensional feature vectors are vectors of some discriminative features, which can distinguish some features of different faces, such as the shape and position of the nose, the shape and position of the eyes, etc., which are called category feature vectors.
  • the LightCNN-9 can be trained by using the existing data sets with a large amount of data. During training, the pictures in the data set can be randomly flipped, randomly cropped, etc., and then converted into grayscale images for training.
  • the loss function commonly used in the field of face recognition such as arcface, cosineface, or sphereface is used for training, which is not limited in this application.
  • the size of the input image and the dimension of the output feature vector are determined by LightCNN-9. If other lightweight neural networks are used, other input image sizes and output dimensions may also be used, which do not exist. limited.
  • FIG. 8 is a schematic structural diagram of a neural network for live detection according to an embodiment of the present application.
  • the living body detection neural network is also called the living body detection model, and after it is trained (that is, its parameters are updated), the target living body detection model can be obtained.
  • the basic neural network in FIG. 8 can be any neural network used for face recognition, and does not mean that it is limited to the neural network shown in FIG. The same structure, but in practice, there is no structural limitation on the basic neural network, as long as it is a CNN for face recognition.
  • parameters of the neural network can also be called coefficients or weights, which can be in the form of a matrix.
  • the living body detection neural network (ie, the first convolutional neural network) is used for living body detection, and the trained living body detection neural network can be used as a target living body detection model for living body detection.
  • the liveness detection neural network can be divided into three parts: the shallow part, the middle layer part and the fully connected layer part. There is no limit to the layer section.
  • the middle layer can be understood as all non-fully connected layers except the shallow layer.
  • the fully connected layer part in Figure 8 includes the fully connected layer of the basic neural network and the added fully connected layer.
  • the new fully connected layer can be one layer or multiple layers.
  • Figure 8 can be understood as adding one or more fully connected layers for binary classification after the last layer of the basic neural network (that is, the fully connected layer as the output layer).
  • the added fully connected layer is used for living body discrimination, that is, judging whether it is a living body (that is, judging whether it is a real person or a non-real person).
  • the output of the living body detection neural network shown in Figure 8 is the detection result, which can be understood as the result of judging whether it is a living body, or can be understood as dividing the input image into one of two categories: "living body", " non-living”.
  • DSU domain specific units
  • FIG. 8 The layer higher than the shallow layer in CNN (corresponding to the middle layer part in Figure 8), that is, the features of the middle layer can be shared between different imaging domains, and the learned features are domain independent.
  • the data set has a more robust performance, that is, the intermediate layer can share parameters between different data sets.
  • Fully connected layers are highly task and dataset specific.
  • Figure 8 adds a regression-based classifier (ie, FC2) to the basic neural network, which is specific to the binary classification task, that is, the live detection task.
  • FC2 regression-based classifier
  • the shallow layer can include LightCNN-9 including group2 and its previous layers
  • the middle layer can include pool2 to pool4 layers
  • the fully connected layer can include MFM_fc1 (MFM_fc1 is FC1 shown in input 8) and FC2 layer.
  • the FC2 layer takes the 256-dimensional feature vector output by MFM_fc1 as input, and outputs the binary classification result of whether it is a living body or not. It can be understood that the newly added FC2 can distinguish whether the person in the image is a living body by learning a 256-dimensional category feature vector with rich discriminative features of the face.
  • the middle layer is domain-independent, that is, it can be shared in different data sets, so the computational load of training can be further reduced based on this.
  • the parameters (weights) of the middle layer can be frozen. Freezing can also be understood as keeping, not updating, and not training.
  • During training only the parameters of the shallow network and the fully connected layer are updated, while the parameters of the middle layer remain unchanged.
  • the calculation amount of the training process is greatly reduced. It is equivalent to that when training the liveness detection neural network shown in Figure 8, the parameters of the shallow layer and the fully connected layer can be updated, while the parameters of the middle layer remain unchanged.
  • the parameters of the middle layer of the basic neural network are used, not that the parameters of the middle layer have never been trained.
  • the parameters of the middle layer are obtained in the training phase of the basic neural network (for ease of understanding, it can be called the first training phase), and in the training phase of the living body detection neural network (for ease of understanding, it can be Call it the second training stage), the parameters of the middle layer remain unchanged (ie freeze), that is to say, the second training stage only updates the parameters of the shallow layer and the fully connected layer of the living body detection neural network.
  • the basic neural network can also be divided into a shallow layer, an intermediate layer, and a fully connected layer.
  • the shallow layer is the first few layers of the basic neural network, but the specific first few layers are not limited, and the middle layer is the shallow All non-fully-connected layers except layers.
  • the middle layer of the basic neural network and the middle layer of the live detection neural network are the same part, the shallow layer of the basic neural network and the shallow layer of the live detection neural network are also the same part, the fully connected layer of the basic neural network and the fully connected layer of the live detection neural network
  • the connection layer is different, because the fully connected layer of the live detection neural network includes an increased fully connected layer for binary classification (FC2 in Figure 8), and does not include the fully connected layer of the basic neural network (FC1 in Figure 8).
  • FIG. 9 is a schematic flowchart of a living body detection method according to an embodiment of the present application, and each step in FIG. 9 will be introduced below.
  • the living body detection method can be implemented by a device or system deployed with a living body detection model, such as a mobile terminal, a vehicle-mounted terminal, a computer, a smart screen, or an intelligent control system and the like.
  • Face images can include real face images and non-real face images, can be real face and non-real face images, video frames, and can also be real face images and non-real face images after feature extraction Feature vector.
  • Non-real faces are divided into one or both of 2D and 3D.
  • 2D mainly includes printed photos of faces, photos and video playback on screens such as tablets, etc.
  • 3D mainly includes 3D masks and 3D head models.
  • the face image can be acquired by using the acquisition unit of the living body detection device, and the acquisition unit can be an image acquisition device, a communication interface, an interface circuit, and the like.
  • the acquisition unit is an image acquisition device, it is equivalent to integrating an image acquisition device in the living body detection device.
  • a smartphone with a camera can be regarded as a living body detection device, and the acquisition unit can be a camera of a smart phone.
  • the smart phone executes the living body detection method of the embodiment of the present application, the camera captures the above-mentioned face image, and transmits the above-mentioned face image to the mobile phone processor, and the mobile phone processor executes subsequent steps.
  • the acquisition unit is a device with a transceiver function such as a communication interface or an interface circuit
  • the connection mode and communication mode adopted may be any mode such as circuit connection, wired communication, wireless communication, etc., and there is no limitation.
  • the control system of the vehicle can be used to implement the living body detection method of the embodiment of the present application, then when the living body detection method is executed, the image acquisition device can collect face images, and the collected face images Send/transmit to the control system, and the acquisition unit in the control system executes step 901 to acquire the face image.
  • Image acquisition devices may include cameras, cameras, and the like.
  • RGB cameras can be used to acquire face images, such as RGB cameras and near infrared (near infrared, NIR) cameras.
  • the RGB camera is greatly affected by light, and the face image obtained by using RGB needs to be processed in grayscale first, converted into a grayscale image, and then the subsequent liveness detection is performed on the grayscale image.
  • the near-infrared camera is less affected by light and has a wider range of applications, and screen-like images (including photos displayed on the screen or videos played) cannot be imaged on the near-infrared camera, because screen-like images cannot be imaged on the near-infrared camera. Imaging within the wavelength band, so the use of near-infrared cameras is equivalent to filtering out non-real face images such as screens. In other words, the near-infrared camera has the advantages of being less disturbed by light and shielding screens from unreal faces.
  • human face images can include multiple lighting conditions. images of the scene. For example, it may be the following lighting scenarios: sunny outside, cloudy or cloudy outside, dim indoor light, and bright indoor light.
  • the acquired face images are clear enough and less affected by the background.
  • the target liveness detection model of this application can still have a good performance on the face image of the scene in the car.
  • one or more cameras installed in the vehicle can be used to capture the above-mentioned face image.
  • the face image will also vary with the position and angle of the camera, and the target liveness detection model of the present application can perform well for face images of different angles and/or different distances, that is, the human Face images may include images from different angles and/or different distances.
  • a camera can be arranged on both sides of the front window of the vehicle, and a camera can be arranged at the instrument position, and these cameras can capture the face image of the driver's seat.
  • FIG. 10 is a schematic layout diagram of an in-vehicle camera according to an embodiment of the present application. Figure 10 mainly introduces the shooting of the driving position.
  • the camera facing the driving position can be set at any number of positions in the left front A, right front B, steering wheel C, control panel D, etc. In this way, images of different distances and angles of the driving position can be captured.
  • the shape of the camera at A in FIG. 10 is different from that of the other three cameras, which is to show that different types of cameras can be used simultaneously for layout.
  • FIG. 10 is only an example, and there may be many layouts in practice. The advantage of this is that the person in the driving position does not need to act in order to cooperate with the camera at a certain position, and can still complete face image collection for liveness detection even while concentrating on driving.
  • FIG. 10 is to explain that multiple cameras can be used to acquire face images for the same position, but the position is not limited to the driver's position, and can also be the co-pilot position, rear seat, etc.
  • FIG. 11 is a schematic layout diagram of another in-vehicle camera according to an embodiment of the present application.
  • cameras can be installed at A in front of the driver's seat, B in front of the co-pilot's seat, C behind the driver's seat, and D behind the co-pilot's seat.
  • the camera at B is used to capture the face image of the passenger seat
  • the camera at C is used to capture the face image of the left rear seat
  • the camera at D is used to capture the face image of the right rear seat .
  • the living body detection of personnel at any position of the vehicle can be realized.
  • FIG. 11 (a) in Figure 11 is a top view of the vehicle cockpit, (b) in Figure 11 is a rear view inside the vehicle, and the layout of the camera in different views is marked in (a) and (b) in Figure 11 happensing.
  • the camera in this application can be an independent camera, or a camera of a device with a shooting function, for example, in (b) of FIG. 11 , C and D are the cameras of the display device installed on the front seats.
  • Fig. 11 is also just an example of the layout of the cameras, and in practice, many layouts can be pointed out according to the requirements, and will not be listed one by one.
  • FIG. 11 For ease of understanding, a practical application scenario of the camera layout shown in Figure 11 is introduced. Assume that the person in the co-pilot seat is shopping through the vehicle intelligent system. When payment is required, the co-pilot seat finds that he does not have payment authority, while the person in the left rear seat has payment authority. At this time, start the camera at C to capture the left The later face image can complete the liveness detection, and then the payment is successful under the premise of passing the liveness detection. Suppose in this scenario, the person in the co-pilot seat wears the 3D mask of the person in the left rear seat, and the camera at B collects the face image of the co-pilot seat. If it is judged to be non-living, the payment cannot be made successfully.
  • the target live detection model includes a first convolutional neural network, the first convolutional neural network includes a second convolutional neural network and a fully connected layer, and the second convolutional neural network is used to obtain the category feature vector of the face according to the face image , the fully connected layer is used to perform live body discrimination according to the category feature vector, and obtain the live body detection result.
  • the above-mentioned first convolutional neural network (that is, the above-mentioned living body detection neural network) includes two parts of the second convolutional neural network (that is, the above-mentioned basic neural network) and a fully connected layer, and the first convolutional neural network (that is, the above-mentioned The living body detection neural network) can be understood as being transformed on the basis of the second convolutional neural network (ie, the basic neural network above).
  • the explanation of the first convolutional neural network and the second convolutional neural network can refer to the introduction of the living body detection neural network and the basic neural network above.
  • the above-mentioned category feature vector can refer to the explanation of the vector of the discriminative feature above. Repeat.
  • the training data sets of the first convolutional neural network and the second convolutional neural network may be different, and the training process may be completed in stages.
  • the above target living body detection model is obtained by using the first training data to update the parameters of the first convolutional neural network, the second convolutional neural network is pre-trained using the second training data, and the first training data Including real face data and non-real face data, the second training data includes real face data.
  • Such an implementation method can further improve the accuracy of the model, thereby improving the accuracy of the liveness detection, and can also simplify the training process of the target liveness detection model and reduce the training cost of the target detection model.
  • the second convolutional neural network can be a data set that utilizes various types of open data-rich real faces (that is, the second training data can be data in a public real face data set), which makes the second
  • the convolutional neural network can be trained more fully and has a better ability to extract facial features.
  • some non-real face data can also be mixed in, without affecting the overall effect. It is even possible to directly select one of the publicly available CNNs for face recognition as the second convolutional neural network, such as the LightCNN-9 above.
  • the training cost that is, when training the first convolutional neural network (updating the parameters of the first convolutional neural network), the requirements for training data and training equipment are relatively low, because no A large amount of sufficient training is then performed on the second convolutional neural network.
  • the first training data that is, the training data used to improve the ability to distinguish living bodies
  • the training phase of the first convolutional neural network is equivalent to the process of fine-tuning the parameters, so that it has the ability of living body discrimination.
  • the target living body detection model can use the first training data to update the parameters of the shallow network of the second convolutional neural network and the parameters of the fully connected layer of the second convolutional neural network and the fully connected layer of the first convolutional neural network.
  • the parameters of the connection layer are obtained.
  • the parameters of the intermediate layers of the first convolutional neural network may remain unchanged. It is equivalent to freezing the parameters of the middle layer during the process of updating the parameters of the first convolutional neural network, or understanding that the parameters of the middle layer are not updated. This takes advantage of the domain independence of the middle layer. For details, please refer to the relevant content above. In this way, under the premise of not reducing the training effect, the calculation amount of the training process can be greatly reduced, and the training of the first convolutional neural network can be further simplified and the training cost can be reduced.
  • the second convolutional neural network may be a lightweight neural network for face recognition. That is to use a neural network model with a simple structure, fewer parameters, and less storage space, which is conducive to deployment in application scenarios with limited computing and storage capabilities, such as vehicle scenarios.
  • the above-mentioned face images can also be used for training the target detection model.
  • the aforementioned face image may be sent, and the sent face image is used to train the target living body detection model, that is, to update the parameters of the target living body detection model.
  • the purpose of online update can be realized, and the accuracy of the target living body detection model can be further improved.
  • the above-mentioned sending can be sent to a local device, or sent to a cloud device, that is, sent to a device that can update the parameters of the target detection model, and there is no limitation.
  • the above-mentioned living body detection method may also include: when the living body detection result indicates that the human body in the face image is artificially alive, executing a target task decision, the target task includes at least one of the following: unlocking, account login, authority permission or payment confirmation. That is, decide whether to unlock, log in, grant permission, and confirm payment.
  • unlocking when the living body detection result is a living body, it is further judged whether the person has the unlocking authority. If the person has the unlocking authority, the unlocking is performed, and if not, the unlocking is not performed.
  • the decision-making task can also be performed first, and then the living body detection is performed, that is, firstly, it is judged whether the person in the face image has the task authorization, and when the authorization is granted, it is further judged whether the face image is a living body.
  • FIG. 12 is a schematic flow chart of a training method for a living body detection model according to an embodiment of the present application. Each step in FIG. 12 will be introduced below.
  • the training data can be referred to as live body detection data, including real face data and non-real face data, which can be real face and non-real face images, video frames, or Feature vectors after feature extraction of real faces and non-real faces.
  • live body detection data including real face data and non-real face data, which can be real face and non-real face images, video frames, or Feature vectors after feature extraction of real faces and non-real faces.
  • the training data can be collected by using a camera, a video camera, etc., or can be read from a storage device.
  • the first training data refers to the training data used to update the parameters of the first convolutional neural network.
  • the first training data includes real face data and non-real face data, that is, the first
  • the training data of the convolutional neural network with the ability to discriminate living body needs to include data of real faces and data of non-real faces.
  • the first training data may include data under multiple lighting scenarios.
  • it may be the following lighting scenarios: sunny outside, cloudy or cloudy outside, dim indoor light, and bright indoor light. That is, it enables the training of a living body detection model that can be applied to more complex lighting scenarios.
  • the obtained living data can be shared, that is to say, the data set will not take too much into account the different acquisition scenarios.
  • the first training data can be captured by one or more cameras installed in the vehicle, so that the trained live body detection model can have a good performance in the in-vehicle scene. .
  • the method of obtaining the data of the real face is to directly capture it, so I will not introduce it again, but mainly introduce how to obtain the data of the non-real face in the car.
  • Printed photos, 3D masks and 3D head models can be worn by people sitting in the car, and then use the camera to capture images.
  • the live data to be trained may include live data from multiple angles.
  • the multiple cameras are set at different positions of the vehicle to obtain first training data from different angles and/or different distances.
  • a camera can be arranged on both sides of the front window of the vehicle, and a camera can be arranged at the instrument position, and these cameras can capture the data of the driving position.
  • the aforementioned camera may be a near-infrared camera, which is applicable to scenarios that do not require screens as training data, that is, the collection and training of non-real face data of screens is omitted.
  • the target detection model does not need to have the ability to detect non-real face data on the screen. At this time, it only needs to use other training data to train the target detection model.
  • the collection, processing and training of screen data can be omitted, thereby effectively reducing training costs.
  • a data set of living body detection data can be established, so that some living body detection data can be selected as the above-mentioned training data or used to test the effect of the living body detection model.
  • an in-vehicle living body detection data set can be established.
  • the data set can include images of real faces and images of non-real faces.
  • the images of real faces can include different accessories (with or without hat glasses, types of glasses), different angles (head up, down, head-up), different An image of an illuminated real face.
  • the images of non-real human faces may include images of non-real human faces such as 2D printed photos, 3D head models and 3D masks.
  • the images in the data set can also be numbered, for example, they can be numbered by row and column.
  • the target live detection model is the updated first convolutional neural network.
  • the first convolutional neural network includes a second convolutional neural network and a fully connected layer, the second convolutional neural network is used to obtain the category feature vector of the face according to the training data, and the fully connected layer is used to perform living body discrimination according to the category feature vector.
  • the first convolutional neural network may be any of the above-mentioned live body detection neural networks, for example, it may be the live body detection neural network as shown in FIG. 8 .
  • the second convolutional neural network may be a real-scene network for face recognition, that is, the basic neural network above.
  • the second convolutional neural network is pre-trained using second training data, and the second training data includes real face data. That is to say, the first convolutional neural network and the second convolutional neural network can be trained with different training data sets (first training data and second training data) respectively. Such an implementation can further improve the accuracy of the model, thereby improving the accuracy of liveness detection, and can also simplify the training process of the target liveness detection model and reduce the training cost of the target detection model.
  • the second convolutional neural network can be a data set that utilizes various types of open data-rich real faces (that is, the second training data can be data in a public real face data set), which makes the second
  • the convolutional neural network can be trained more fully and has a better ability to extract facial features.
  • some non-real face data can also be mixed in, without affecting the overall effect. It is even possible to directly select one of the publicly available CNNs for face recognition as the second convolutional neural network, such as the LightCNN-9 above.
  • the training cost that is, when training the first convolutional neural network (updating the parameters of the first convolutional neural network), the requirements for training data and training equipment are relatively low, because no A large amount of sufficient training is then performed on the second convolutional neural network.
  • the first training data that is, the training data used to improve the ability to distinguish living bodies
  • the training phase of the first convolutional neural network is equivalent to the process of fine-tuning the parameters, so that it has the ability of living body discrimination.
  • the first convolutional neural network is obtained by adding one or more fully connected layers after the output layer of the second convolutional neural network, and the added fully connected layer is used for living body discrimination,
  • the second convolutional neural network is pre-trained, and the second convolutional neural network is used to obtain the category feature vector of the face.
  • the parameters of the shallow layer network of the first convolutional neural network and the parameters of the fully connected layer of the first convolutional neural network may be updated.
  • the computational load of the training process can be greatly reduced without reducing the training effect, further simplifying the training of the first convolutional neural network and reducing the training cost
  • the parameters of the intermediate layers of the first convolutional neural network may remain unchanged. It is equivalent to freezing the parameters of the middle layer during the process of updating the parameters of the first convolutional neural network, or understanding that the parameters of the middle layer are not updated. This takes advantage of the domain independence of the middle layer. For details, please refer to the relevant content above. In this way, under the premise of not reducing the training effect, the calculation amount of the training process can be greatly reduced, and the training of the first convolutional neural network can be further simplified and the training cost can be reduced.
  • the second convolutional neural network may be a lightweight neural network for face recognition. That is to use a neural network model with a simple structure, fewer parameters, and less storage space, which is conducive to deployment in application scenarios with limited computing and storage capabilities, such as vehicle scenarios.
  • the training method shown in FIG. 12 has the advantages of relatively simple training, relatively small training data requirement, and relatively higher accuracy of the model obtained through training.
  • the data of real faces in the live detection data is relatively easy to obtain, and the types are diverse and sufficient in quantity, while the data of non-real faces such as 2D/3D mask data are difficult to obtain, and the types are limited and the quantity is limited.
  • the unbalanced distribution of training data which is not conducive to the training of the classifier of the two classifications
  • the insufficient quantity will also cause the existing living body detection to fail to achieve the desired accuracy.
  • the living body detection in the prior art only pays attention to the ability of living body discrimination but ignores the ability of face recognition (that is, the ability to extract face features).
  • the equipment for training the model (referred to as the training equipment) needs to store a large amount of training data, model parameters, and perform training process operations;
  • the device which can be called inference device) needs to store the model, process data and its intermediate data, and perform calculations in the inference process. Therefore, both devices need sufficient computing power and storage capacity.
  • the living body detection model in the prior art has a large scale, complex training, and a large amount of calculation. It has high requirements for the storage and computing capabilities of the training equipment and the storage and execution capabilities of the inference equipment, which makes it not suitable for computing. For scenarios with weak capabilities and/or storage capabilities, such as in-vehicle scenarios, it is difficult for in-vehicle devices to undertake complex calculations and large-scale model storage, so the existing live detection models are not suitable for in-vehicle scenarios.
  • Fig. 13 is a schematic block diagram of a living body detection device according to an embodiment of the present application.
  • the living body detection device 2000 shown in FIG. 13 includes an acquisition unit 2001 and a processing unit 2002 .
  • the acquisition unit 2001 and the processing unit 2002 may be used to execute the living body detection method of the embodiment of the present application. Specifically, the acquisition unit 2001 may perform the above step 901, and the processing unit 2002 may perform the above step 902.
  • the processing unit 2002 can realize the function of the living body detection neural network shown in FIG. 8 .
  • processing unit 2002 in the above device 2000 may be equivalent to the processor 3002 in the device 3000 hereinafter.
  • FIG. 14 is a schematic diagram of a hardware structure of a living body detection device provided by an embodiment of the present application.
  • the living body detection apparatus 3000 shown in FIG. 14 includes a memory 3001 , a processor 3002 , a communication interface 3003 and a bus 3004 .
  • the memory 3001 , the processor 3002 , and the communication interface 3003 are connected to each other through a bus 3004 .
  • the memory 3001 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 3001 can store programs, and when the programs stored in the memory 3001 are executed by the processor 3002, the processor 3002 and the communication interface 3003 are used to execute various steps of the living body detection method of the embodiment of the present application.
  • the processor 3002 may adopt a general-purpose CPU, a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more integrated circuits for executing related programs, In order to realize the functions required by the units in the living body detection device of the embodiment of the present application, or execute the living body detection method of the method embodiment of the present application.
  • ASIC application specific integrated circuit
  • GPU graphics processing unit
  • the processor 3002 may also be an integrated circuit chip with signal processing capabilities. During implementation, each step of the living body detection method of the present application may be completed by an integrated logic circuit of hardware in the processor 3002 or instructions in the form of software.
  • the above-mentioned processor 3002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-the-shelf programmable gate array
  • FPGA field programmable gate array
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 3001, and the processor 3002 reads the information in the memory 3001, and combines its hardware to complete the functions required by the units included in the living body detection device of the embodiment of the present application, or execute the living body detection of the method embodiment of the present application method.
  • the communication interface 3003 implements communication between the apparatus 3000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the aforementioned face image can be acquired through the communication interface 3003 .
  • Bus 3004 may include pathways for transferring information between various components of device 3000 (e.g., memory 3001, processor 3002, communication interface 3003).
  • Fig. 15 is a schematic block diagram of a training device for a living body detection network according to an embodiment of the present application.
  • the training device 4000 of the living body detection network shown in FIG. 15 includes an acquisition unit 4001 and a training unit 4002 .
  • the acquisition unit 4001 and the training unit 4002 can be used to execute the training method of the living body detection model in the embodiment of the present application. Specifically, the acquisition unit 4001 can perform the above step 1201, and the training unit 4002 can perform the above step 1202.
  • training unit 4002 in the above device 4000 may be equivalent to the processor 5002 in the device 5000 hereinafter.
  • FIG. 16 is a schematic diagram of a hardware structure of a training device for a living body detection network provided by an embodiment of the present application.
  • the training device 5000 of the living body detection network shown in FIG. wherein, the memory 5001 , the processor 5002 , and the communication interface 5003 are connected to each other through a bus 5004 .
  • the memory 5001 may be a ROM, a static storage device, a dynamic storage device or a RAM.
  • the memory 5001 can store a program, and when the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 and the communication interface 5003 are used to execute each step of the training method of the living body detection network in the embodiment of the present application.
  • the processor 5002 may use a CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits for executing related programs, so as to realize the functions required by the units in the training device of the living body detection network of the embodiment of the present application, Or execute the training method of the living body detection network in the method embodiment of the present application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the training method of the living body detection network of the present application can be completed by an integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the aforementioned processor 5002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the training device of the living body detection network of the embodiment of the application, or execute the method embodiment of the application The training method of the liveness detection network.
  • the communication interface 5003 implements communication between the apparatus 5000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the above-mentioned first training data may be obtained through the communication interface 5003 .
  • the bus 5004 may include a pathway for transferring information between various components of the device 5000 (eg, memory 5001, processor 5002, communication interface 5003).
  • the device 3000 shown in FIG. 14 and the device 5000 shown in FIG. 16 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 3000, the device The 5000 also includes other devices necessary for proper operation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 3000 and the apparatus 5000 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the device 3000 and the device 5000 may only include the devices necessary to realize the embodiment of the present application, instead of all the devices shown in Fig. 14 and Fig. 16 .
  • the disclosed systems, methods and devices can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: Universal Serial Bus flash disk (UFD), UFD can also be referred to as U disk or USB flash drive, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., which can store program codes. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供一种活体检测方法、活体检测模型的训练方法及其装置,涉及人工智能领域。该活体检测方法包括:获取人脸图像,以及将该人脸图像输入到目标活体检测模型,得到活体检测结果,活体检测结果用于指示人脸图像中的人是否为活体,目标活体检测模型包括第一卷积神经网络,第一卷积神经网络包括第二卷积神经网络和全连接层,第二卷积神经网络用于根据人脸图像得到人脸的类别特征向量,全连接层用于根据类别特征向量进行活体判别,得到活体检测结果。该方案中,目标活体检测模型兼具有较高的提取人脸特征的能力和活体判别能力,从而能够有效提高活体检测结果的准确性。

Description

活体检测方法、活体检测模型的训练方法及其装置和*** 技术领域
本申请实施例涉及人工智能领域,并且更具体地,涉及一种活体检测方法、活体检测模型的训练方法及其装置和***。
背景技术
随着人脸识别(face recognition,FR)技术的飞速发展,人脸识别***已经在识别精度方面超越了人类水平,目前众多认证***都是基于人脸识别的认证。但是人脸识别***也容易受到一些非法用户的攻击,比如打印的照片、视频重播、或是使用面具来攻破人脸识别***,为了应对上述问题,出现了活体检测技术。
现有的活体检测方法适用于开放场景,但这些方法很容易被非真实人脸的数据(例如一段预先录好的视频、二维面具、三维面具等)攻破。
因此,如何提高活体检测的准确性是亟待解决的技术问题。
发明内容
本申请实施例提供一种活体检测方法、活体检测模型的训练方法及其装置和***,能够提高活体检测的准确性。
第一方面,提供一种活体检测方法,该方法包括:获取人脸图像,以及将该人脸图像输入到目标活体检测模型,得到活体检测结果,活体检测结果用于指示人脸图像中的人是否为活体,目标活体检测模型包括第一卷积神经网络,第一卷积神经网络包括第二卷积神经网络和全连接层,第二卷积神经网络用于根据人脸图像得到人脸的类别特征向量,全连接层用于根据类别特征向量进行活体判别,得到活体检测结果。
在本申请的技术方案中,用来进行活体检测的目标活体检测模型包括能够得到类别特征向量的神经网络和能够进行活体判别的全连接层,因此兼具有较高的提取人脸特征的能力和活体判别能力,从而能够有效提高活体检测结果的准确性。应理解的是,人脸特征提取的精度会直接影响到后续活体判别的准确性,本申请的技术方案则充分兼顾两种能力,使得活体检测的准确性有效提高。
人脸图像可以包括真实人脸图像和非真实人脸图像,可以是真实人脸和非真实人脸的图像、录像帧,也可以是真实人脸图像和非真实人脸图像的特征提取之后的特征向量。
非真实人脸分为二维(2 dimension,2D)类和三维(3 dimension,3D)类,2D类主要包括人脸的打印照片、平板等屏幕类的照片和视频回放,3D类主要包括3D面具和3D头模。
上述第一卷积神经网络(即下文的活体检测神经网络)包括第二卷积神经网络(即下文的基础神经网络)和全连接层两部分,第一卷积神经网络可以理解为是在第二卷积神经 网络的基础上改造得到的。
第一卷积神经网络和第二卷积神经网络的训练数据集可以不相同,训练过程可以分阶段完成。结合第一方面,在第一方面的某些实现方式中,目标活体检测模型是利用第一训练数据更新第一卷积神经网络的参数得到的,第二卷积神经网络是利用第二训练数据预训练好的,第一训练数据包括真实人脸的数据和非真实人脸的数据,第二训练数据包括真实人脸的数据。
这样的实现方式可以进一步提高模型准确性,从而提高活体检测的准确性,同时还可以简化目标活体检测模型的训练过程和降低目标检测模型的训练成本。首先,第二卷积神经网络可以是利用公开的各类数据量丰富的真实人脸的数据集(即第二训练数据可以是公开的真实人脸的数据集中的数据),这就使得第二卷积神经网络能够得到更加充分的训练,具有更加好的提取人脸特征的能力。当然也可以在其中掺入一些非真实人脸的数据,不影响整体效果。甚至还可以直接从已经公开的用于人脸识别的卷积神经网络中选取一个作为第二卷积神经网络。这就可以有效降低训练成本,也就是在对第一卷积神经网络进行训练(更新第一卷积神经网络的参数)的时候,对于训练数据和训练设备的要求都相对较低,因为不需要再对第二卷积神经网络的进行大量充足的训练。其次,第一训练数据(即用于提高活体判别能力的训练数据)不需要数量过多,即对于非真实人脸的数据需求量较小,且不需要非常多的训练次数就可以达到较好的效果,因为对第一卷积神经网络的训练阶段相当于对于参数进行微调的过程,使之具有活体判别能力。
结合第一方面,在第一方面的某些实现方式中,目标活体检测模型是利用第一训练数据更新第二卷积神经网络的浅层网络的参数和第二卷积神经网络的全连接层的参数和第一卷积神经网络的全连接层的参数得到的。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本。
结合第一方面,在第一方面的某些实现方式中,第一卷积神经网络的中间层(也是第二卷积神经网络的中间层)的参数保持不变。相当于在更新第一卷积神经网络的参数的过程中,冻结了中间层的参数,或者理解为不更新中间层的参数。这是利用了中间层具有域独立性的特点。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本。
结合第一方面,在第一方面的某些实现方式中,第二卷积神经网络为用于人脸识别的轻量级神经网络。这样可以进一步减少运算和存储压力。也就是采用结构简单、参数较少、需要存储空间较小的神经网络模型,这样利于部署在运算和存储能力有限的应用场景,例如车载场景。
结合第一方面,在第一方面的某些实现方式中,上述人脸图像包括多个光照场景下的图像。例如可以为以下光照场景:室外晴天、室外多云或阴天、室内光线昏暗、室内光线明亮。由于人脸在不同光照环境下的呈现也是不同的,而本申请的目标活体检测模型对于不同光照场景下的人脸图像都能够有良好的表现。这可以通过丰富目标检测模型的训练数据来得到,也就是,目标检测模型是经过这类不同光照场景下的训练数据的训练的,从而具备对不同光照场景下的人脸图像的检测能力。
结合第一方面,在第一方面的某些实现方式中,上述人脸图像是利用设置在车辆内的一个或多个摄像头拍摄得到的。对于一些简单的应用场景,获取的人脸图像足够清晰,受 背景影响较小。但对于车辆这一特殊场景,不可避免会受到车内装饰、玻璃反光、密闭等的影响,而本申请的目标活体检测模型对于车内场景的人脸图像依然能够有良好的表现。这可以通过丰富目标检测模型的训练数据来得到,也就是,目标检测模型是经过这类车内场景下的训练数据的训练的,从而具备对车内场景下的人脸图像的检测能力。
结合第一方面,在第一方面的某些实现方式中,当摄像头的数量为多个时,多个摄像头设置在车辆的不同位置,用于得到不同角度和/或不同距离的人脸图像。本申请的目标活体检测模型对于不同角度和/或不同距离的人脸图像都能够有良好的表现。
这样的好处是,被检测的人不需要为了配合某个位置的摄像头而动作,依然可以完成活体检测的人脸图像采集。例如,如果只有控制面板处有摄像头,则驾驶位需要右转低头才能让该摄像头采集到人脸图像,交互过程不够友好,还可能对被检测人造成干扰。
此外,还可以通过在车内布局多个摄像头,实现整车所有人员都能够完成活体检测。
结合第一方面,在第一方面的某些实现方式中,上述摄像头为近红外摄像头。在本申请实施例中,人脸图像可以利用相机、摄像头等采集得到。常见的摄像头都可以用于获取人脸图像,例如RGB摄像头和近红外摄像头。其中,RGB摄像头受光线影响较大,且利用RGB获取的人脸图像需要先进行灰度处理,转换成灰度图,再对灰度图执行后续的活体检测。而近红外摄像头受光线影响较小,适用范围更广,且屏幕类的图像(包括屏幕上显示照片或播放的视频)无法在近红外摄像头上成像,因为屏幕类的图像无法在近红外摄像头的波长波段内成像,所以采用近红外摄像头,相当于可以过滤掉屏幕类的非真实人脸的图像。也就是说,近红外摄像头具有受光线干扰小、屏蔽屏幕类的非真实人脸的优点。所以如果采用了近红外摄像头,则自动过滤了屏幕类的攻击。
结合第一方面,在第一方面的某些实现方式中,上述方法还包括:发送人脸图像,人脸图像用于对目标活体检测模型进行训练。也就是说,把上述人脸图像用于目标检测模型的训练,即更新目标活体检测模型的参数。这样可以实现在线更新的目的,使得目标活体检测模型的准确性得到进一步提高。
结合第一方面,在第一方面的某些实现方式中,上述方法还包括:当活体检测结果指示人脸图像中的人为活体时,执行目标任务的决策,目标任务包括以下至少一项:解锁、账号登录、权限许可或确认支付。例如对于解锁任务来说,当活体检测结果为活体时,进一步判断该人是否具备解锁权限,如果具备解锁权限,就执行解锁,如果不具备则不解锁。
也就是说,在执行目标任务之前先进行是否为活体的判别,如果已经判定是非活体了就没有必要执行后面的任务决策了,提高了决策的安全性。当然也可以先执行决策任务的判定,再执行活体检测,也就是,先判断人脸图像中的人是否具备任务权限,当具备权限的情况下再进一步判断这个人脸图像是否是活体。
第二方面,提供一种活体检测模型的训练方法,该训练方法包括:获取第一训练数据,第一训练数据包括真实人脸的数据和非真实人脸的数据;根据第一训练数据,更新第一卷积神经网络的参数,得到目标活体检测模型,第一卷积神经网络包括第二卷积神经网络和全连接层,第二卷积神经网络用于根据训练数据得到人脸的类别特征向量,全连接层用于根据类别特征向量进行活体判别。
本申请的技术方案的训练方法具有训练相对简单,训练数据需求量相对较小,训练得到的模型准确性相对更高的优点。
首先,活体检测数据中真实人脸的数据相对较容易得到,且种类多样、数量充足,而例如2D/3D面具数据这类的非真实人脸的数据是较难获得的,且种类有限、数量很少,导致训练数据分布失衡(这不利于二分类的分类器的训练)和数量不足,也同样会导致现有的活体检测无法达到理想的准确性。其次,如上文所述现有技术的活体检测只注重活体判别的能力却忽视了人脸识别的能力(即人脸特征的提取能力),而事实上人脸特征提取的精度会直接影响到后续活体判别的准确性,所以现有技术无法达到较高的活体检测的准确性,而本申请实施例的方案则充分兼顾两种能力,使得活体检测的准确性有效提高。
此外,由于在训练阶段,训练模型的设备(简称训练设备)需要存储大量的训练数据、模型参数,以及进行训练过程运算;而在执行阶段,部署模型的设备(即利用模型来执行活体检测任务的设备,可以称之为推理设备)需要存储该模型、处理数据及其中间数据,以及进行推理过程的运算,所以,两种设备都需要足够的运算能力和存储能力。而现有技术中的活体检测模型的规模较大、训练复杂、运算量大,对于训练设备的存储和运算能力以及推理设备的存储和执行能力都有较高的要求,导致并不适用于运算能力和/或存储能力较弱的场景,例如车载场景中,车载设备是很难承担复杂运算和大模型存储的,所以导致现有技术的活体检测模型并不适用与车载场景。
结合第二方面,在第二方面的某些实现方式中,第二卷积神经网络是利用第二训练数据预训练好的,第二训练数据包括真实人脸的数据。能够进一步提高模型准确性,从而提高活体检测的准确性,同时还可以简化目标活体检测模型的训练过程和降低目标检测模型的训练成本。
结合第二方面,在第二方面的某些实现方式中,在更新第一卷积神经网络的参数时,更新第一卷积神经网络的浅层网络的参数和第一卷积神经网络的全连接层的参数。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本。
结合第二方面,在第二方面的某些实现方式中,第一卷积神经网络的中间层(也是第二卷积神经网络的中间层)的参数保持不变。相当于在更新第一卷积神经网络的参数的过程中,冻结了中间层的参数,或者理解为不更新中间层的参数。这是利用了中间层具有域独立性的特点。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本。
结合第二方面,在第二方面的某些实现方式中,第二卷积神经网络为用于人脸识别的轻量级神经网络。这样可以进一步减少运算和存储压力。也就是采用结构简单、参数较少、需要存储空间较小的神经网络模型,这样利于部署在运算和存储能力有限的应用场景,例如车载场景。
结合第二方面,在第二方面的某些实现方式中,第一训练数据包括多个光照场景下的数据。例如可以为以下光照场景:室外晴天、室外多云或阴天、室内光线昏暗、室内光线明亮。这样可以提高训练数据的丰富性,从而提高训练效果,使得训练出能够适用于更多复杂光照场景的活体检测模型。
结合第二方面,在第二方面的某些实现方式中,第一训练数据是利用设置在车辆内的一个或多个摄像头拍摄得到的。这样训练出来的活体检测模型在车内场景中可以具有良好的表现。
结合第二方面,在第二方面的某些实现方式中,当上述摄像头的数量为多个时,多个摄像头设置在车辆的不同位置,用于得到不同角度和/或不同距离的第一训练数据。这样可以提高训练数据的丰富性,从而提高训练效果,使得训练出能够适用于更多复杂光照场景的活体检测模型。
结合第二方面,在第二方面的某些实现方式中,上述摄像头为近红外摄像头。这样可以适用于不需要屏幕类作为训练数据的场景,也就是省去了屏幕类的非真实人脸的数据的采集和训练。也就是说,如果是使用近红外摄像头的应用场景,目标检测模型不需要具备对屏幕类的非真实人脸的数据的检测能力,此时只需要利用其他训练数据训练得到目标检测模型即可,可以省去屏幕类数据的采集、处理和训练,从而有效降低训练成本。
第三方面,提供一种活体检测装置,该装置包括用于执行上述第一方面的任意一种实现方式的方法的单元。
第四方面,提供一种活体检测模型的训练装置,该训练装置包括用于执行上述第二方面的任意一种实现方式的训练方法的单元。
第五方面,提供一种活体检测装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第一方面中的任意一种实现方式中的方法。该装置可以设置在车载终端、智慧屏、门禁***等各类需要进行活体检测的设备或***中。该装置还可以为芯片。
第六方面,提供一种活体检测模型的训练装置,该训练装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第二方面中的任意一种实现方式中的训练方法。该训练装置可以为主机、电脑、服务器、云端设备等能够进行模型训练的设备。该训练装置还可以为芯片。
第七方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面或第二方面中的任意一种实现方式中的方法。
第八方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第二方面中的任意一种实现方式中的方法。
第九方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面或第二方面中的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面中的任意一种实现方式中的方法。
附图说明
图1是本申请实施例一种人工智能主体框架示意图。
图2是活体检测方案的应用场景的示意图。
图3是本申请实施例的一种***架构示意图。
图4是卷积神经网络的结构示意图。
图5是卷积神经网络的结构示意图。
图6是本申请实施例的一种芯片的硬件结构示意图。
图7是本申请实施例的基础神经网络的一个示意性结构图。
图8是本申请实施例的活体检测神经网络的示意性结构图。
图9是本申请实施例的活体检测方法的示意性流程图。
图10是本申请实施例的一种车内摄像头的布局示意图。
图11是本申请实施例的另一种车内摄像头的布局示意图。
图12是本申请实施例的活体检测模型的训练方法的示意性流程图。
图13是本申请实施例的活体检测装置的示意性框图。
图14是本申请实施例提供的活体检测装置的硬件结构示意图。
图15是本申请实施例的活体检测网络的训练装置的示意性框图。
图16是本申请实施例提供的活体检测网络的训练装置的硬件结构示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
本申请实施例涉及神经网络,为了更好地理解本申请实施例的方法,下面先对神经网络的相关术语和概念进行介绍。
(1)神经网络(neural network,NN)
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出例如可以如公式(1)所示:
Figure PCTCN2021095597-appb-000001
其中,s=1、2、……n,n为大于1的自然数,代表神经网络的层数,W s为x s的权重,又可以称为神经网络的参数或系数,x s为神经网络的输入,b为神经单元的偏置。f为神经单元的激活函数(activation functions),该激活函数用于对神经网络中的特征进行非线性变换,从而将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络(deep neural network,DNN)
深度神经网络,也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的每个神经元与第i+1层的神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021095597-appb-000002
其中,
Figure PCTCN2021095597-appb-000003
是输入向量,
Figure PCTCN2021095597-appb-000004
是输出向量,
Figure PCTCN2021095597-appb-000005
是偏移向量,W是权重,也可以称为系数或参数;该权重可以是权重矩阵的形式,α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021095597-appb-000006
经过简单的操作得到输出向量
Figure PCTCN2021095597-appb-000007
由于DNN层数多,权重W和偏移向量
Figure PCTCN2021095597-appb-000008
的数量也比较多。权重在DNN中每一层(为了方便描述,称为系数)的含义如下所述:以系数W为例,假设在一个三层的DNN中,第二层的第4个神经元到 第三层的第2个神经元的线性系数定义为
Figure PCTCN2021095597-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021095597-appb-000010
输入层是没有权重W的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,权重越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重的过程,其最终目的是得到训练好的深度神经网络的所有层的权重(例如,包括多层的系数W形成的权重矩阵)。
(3)卷积神经网络(convolutional neuron network,CNN)
卷积神经网络是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)分类器
在本申请实施例中,在预训练好的基础神经网络(即第二卷积神经网络)的基础上,在其最后一层全连接层(fully connected layer),通常为输出层,之后增加一层全连接层,构成二分类的分类器,用于区分图像中的人脸是真人还是非真人,即判别该图像是否为活体图像。该分类器,用于对图像中的物体进行分类。分类器可以包括全连接层和softmax函数(可以称为归一化指数函数),能够根据输入而输出不同类别的概率。
(5)损失函数(loss function)
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置权重),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(6)反向传播(back propagation,BP)
神经网络可以采用误差反向传播算法在训练过程中修正初始的神经网络模型中权重的数值,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中权重,从而 使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的权重,例如权重矩阵。
在传统方案中,活体检测模型往往是利用活体检测数据(包括真实人脸的数据和非真实人脸的数据)来训练得到,但传统方案只把重点放在了区分是否为活体上,而忽视了对于人脸识别能力的提升,所以该活体检测模型提取人脸特征的能力是较差的,在这种较差的人脸特征提取能力基础上进行活体判别很难达到较高的准确性。
针对上述问题,本申请实施例提出一种活体检测方案,在该方案中,用来进行活体检测的目标活体检测模型包括能够得到类别特征向量的神经网络和能够进行活体判别的全连接层,因此兼具有较高的提取人脸特征的能力和活体判别能力,从而能够有效提高活体检测结果的准确性。
本申请实施例的方案能够应用于屏幕解锁、设备解锁、账号登录、权限许可(例如访问许可)、安全支付等各类活体检测的使用场景。
图1是本申请实施例一种人工智能主体框架示意图,该主体框架描述了人工智能***总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“信息技术(information technology,IT)价值链”(垂直轴)两个维度对上述人工智能主题框架进行详细的阐述。
“智能信息链”反映从数据的获取到处理的一系列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到***的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能***提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。
基础设施可以通过传感器与外部沟通,基础设施的计算能力可以由智能芯片提供。
这里的智能芯片可以是中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专门应用的集成电路(application specific integrated circuit,ASIC)或现场可编程门阵列(field programmable gate array,FPGA)等硬件加速芯片。
基础设施的基础平台可以包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。
例如,对于基础设施来说,可以通过传感器和外部沟通获取数据,然后将这些数据提供给基础平台提供的分布式计算***中的智能芯片进行计算。
(2)数据:
基础设施的上一层的数据用于表示人工智能领域的数据来源。该数据涉及到图形、图像、语音、文本等信息的至少一种。该数据在不同应用领域不同,且可以有不同的表现形式。例如,涉及到物联网领域时,该数据的内容与具体的物联网连接终端有关,例如可以包括力、位移、液位、温度、或湿度等感知数据。
在本申请实施例中该数据例如为活体数据,活体数据包括真实人脸的数据和非真实人 脸的数据,这些数据可以是图像、或图形的形式,也可以是特征向量或矩阵的形式。
(3)数据处理:
上述数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等处理方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能***中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力:
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用***,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用:
智能产品及行业应用指人工智能***在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,或智能终端等。
本申请实施例可以应用在人工智能中的很多领域,例如,智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,或平安城市等领域,具体而言,应用在这些人工智能领域中的需要活体检测的分支部分。例如在智能安防领域,只有活体检测确认是真实的人了才进一步允许有访问权限的人进行访问,这样就能避免有人利用指纹手套、假面具等各类工具攻入安防***。又例如,在自动驾驶领域,只有活体检测确认是真实的人才进一步允许有使用权限的人登录和启用车载设备,这样就能避免有人利用各类非真实人脸的手段窃取或使用车辆。
下面对解锁和权限许可这两种应用场景进行简单的介绍。
应用场景一、解锁
在解锁场景中,可以将输入的人脸图像分为两个类别(活体、非活体)中的一种,即判别该人脸图像中的人是否是真实的人(即类别为活体),当类别为活体时允许解锁,否则不允许解锁,如图2所示。也就是说,当把一些人脸图像输入到活体检测模型的时候就可以将输入图像分类到上述两个类中的一个类别(即得到检测结果为活体或非活体),然后解锁决策模块判定是否允许解锁。上述解锁可以是屏幕解锁、门禁解锁、设备解锁、或车辆解锁等等。如图2中的人脸图像示出了A、B、C、D四个,其中A和B为真实人脸的图像;C是2D打印照片的图像,是非真实人脸的图像;D是3D头模的图像,也是非真实人脸的图像。当将这四个图像分别输入到活体检测模型,就可以得到相应的检测结果,其中,A、B检测结果为活体,C、D检测结果为非活体,这些检测结果输入到解锁决策模块。对于A、B,解锁决策模块进一步判定是否具备解锁权限,如果具备解锁权限就进行解锁,如果不具备解锁权限就不解锁。而对于C、D则解锁决策模块直接判定不具备解锁权限,不解锁。也就是说,图2是利用活体检测环节来提高了解锁任务的安全性,有效防止有人通过具有解锁权限的人的非真实人脸的数据来盗用解锁权限。
应用场景二、权限许可
在权限场景中,可以判定是否开通权限,可以直接将图2中的解锁决策模块替换为权限决策模块,也就是说,在判断得到人脸图像中的人是否为活体(即人脸图像中的人是否为真实的人)之后,就可以利用权限决策模块决定是否给予权限许可。是利用活体检测环节来提高了权限许可任务的安全性,有效防止有人通过具有权限许可的人的非真实人脸的数据来骗取许可。例如,针对于车内支付场景,当用户发起了支付请求,车机可以通过摄像头采集用户的人脸区域图像,进行活体检测和人脸检测,以确定当前用户是否具有支付权限。
图3是本申请实施例的一种***架构示意图,可以用于训练神经网络模型,例如人脸识别模型、活体检测模型。如图3所示,数据采集设备160用于采集训练数据。针对本申请实施例的方法来说,训练数据可以包括训练图像以及训练图像对应的分类结果,其中,训练图像的结果可以是人工预先标注的结果。对于第一卷积神经网络的训练来说,该训练图像包括真实人脸的图像和非真实人脸的图像。对于第二卷积神经网络的训练来说,训练图像则包括真实人脸的图像。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。“A/B”描述关联对象的关联关系,表示可以存在三种关系,例如,A/B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述。一种情况下,训练设备120对输入的原始图像进行处理,将输出的图像与原始图像进行对比,直到训练设备120输出的图像与原始图像的差值小于一定的阈值,从而完成目标模型/规则101的训练。这种情况下,可以训练得到本申请实施例的人脸识别模型,也就是得到训练好的基础神经网络(即第二卷积神经网络),以便于后续继续利用基础神经网络进一步训练得到活体检测模型。另一种情况下,训练设备120对输入的人脸图像的类别特征向量进行处理,将输出的类别与标签类别进行对比,直到训练设备120输出的类别的准确率大于或等于一定的阈值,从而完成目标模型/规则101的训练。这种情况下,可以训练得到本申请实施例的活体检测模型,也就是在上述基础神经网络的基础上进一步训练得到活体检测模型。
上述目标模型/规则101能够用于实现本申请实施例的方法。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的***或设备中,如应用于图3所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图3中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的人 脸图像。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如人脸图像)进行预处理,在本申请实施例中,也可以没有预处理模块113和预处理模块114(也可以只有其中的一个预处理模块),而直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储***150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储***150中。
最后,I/O接口112将处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图3中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据可以预先获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、或动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图3仅是本申请实施例提供的一种***架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3中,数据存储***150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储***150置于执行设备110中。
如图3所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101可以是利用本申请实施例的方法得到的神经网络,具体的,本申请实施例的神经网络可以是能够用于活体检测的CNN,或深度卷积神经网络(deep convolutional neural networks,DCNN)等等。
由于CNN是一种非常常见的神经网络,且是本申请实施例重点关注的神经网络,下面结合图4重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
在一种实现中,本申请实施例的活体检测方法中的基础神经网络具体采用的神经网络的结构可以如图4所示。
图4是卷积神经网络的结构示意图。在图4中,卷积神经网络(CNN)200可以包括输入层210,层220(层220可以包括卷积层和池化层,或者,层220可以包括卷积层而不包括池化层),以及全连接层230。其中,输入层210可以获取待处理人脸图像,并将 获取到的待处理人脸图像交由层220以及后面的全连接层230进行处理,可以得到图像的处理结果。下面对图4中的CNN 200中内部的层结构进行详细的介绍。
层220:
卷积层:
以图4为例,如图4所示层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。这里对卷积层和池化层的数量和位置仅为举例,可以有更多或更少的卷积层和池化层,且也可以不包括池化层。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个尺寸”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后可以周期性的引入池化层,在如图4中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的目的是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值 作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层(fully connected)230:
在经过层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200进一步利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图4所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图4由210至240方向的传播为前向传播)完成,反向传播(如图4由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
本申请实施例的活体检测方法中的基础神经网络具体采用的神经网络的结构可以如图5所示。在图5中,卷积神经网络(CNN)300可以包括输入层310,层320(层320可以包括卷积层和池化层,其中池化层为可选的),以及全连接层330。与图4相比,图5中的层320中的多个卷积层或池化层并行,将分别提取的特征均输入给全连接层330进行处理。
需要说明的是,图4和图5所示的卷积神经网络仅作为一种本申请实施例的活体检测方法的基础神经网络的两种可能的卷积神经网络的示例,在具体的应用中,本申请实施例的活体检测方法的基础神经网络所采用的卷积神经网络还可以以其他网络模型的形式存在。
需要说明的是,在本申请实施例中,活体检测模型可以是CNN(经过训练之后的第一卷积神经网络),该CNN只是在能够用于人脸识别的基础神经网络(基础神经网络也是CNN结构)的基础上,增加一层或多层用于二分类的全连接层,所以,可以看作是在图4或图5所示结构的基础上,在输出层之后增加一层或多层二分类全连接层。所以图4和图5的输出层输出的待处理人脸图像的处理结果可以称之为类别特征向量,即能够用于分类的人脸的特征向量。
还应理解,经过上述在基础神经网络的基础上得到的用于活体检测的活体检测神经网络(即第一卷积神经网络)依然是CNN结构,所以,本申请实施例的活体检测模型(用于活体检测的CNN)同样可以用图4和图5所示架构表示,只是此时,待处理人脸图像的处理结果是是否为活体的分类结果。
图6是本申请实施例的一种芯片的硬件结构示意图。该芯片包括神经网络处理器(图示NPU600)。该芯片可以被设置在如图3所示的执行设备110中,用以完成计算模块111 的计算工作。该芯片也可以被设置在如图3所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图4、图5所示的卷积神经网络中各层的算法均可在如图6所示的芯片中得以实现。
NPU600作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路60,控制器604控制运算电路603提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路603内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路603是二维脉动阵列。运算电路603还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路603是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器602中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器601中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)608中。
向量计算单元607可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元607可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元能607将经处理的输出的向量存储到统一缓存器606。例如,向量计算单元607可以将非线性函数应用到运算电路603的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元607生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路603的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器606用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器605(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器601和/或统一存储器606、将外部存储器中的权重数据存入权重存储器602,以及将统一存储器606中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)610,用于通过总线实现主CPU、DMAC和取指存储器609之间进行交互。
与控制器604连接的取指存储器(instruction fetch buffer)609,用于存储控制器604使用的指令;
控制器604,用于调用指存储器609中缓存的指令,实现控制该运算加速器的工作过程。
入口:可以根据实际说明这里的数据是说明数据,比如拍摄到人脸图像等。
可选地,统一存储器606,输入存储器601,权重存储器602以及取指存储器609均为片上(on-chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图4和图5所示的卷积神经网络中各层的运算可以由运算电路603或向量计算单元607执行。
上文中介绍的图3中的执行设备110能够执行本申请实施例的活体检测方法或活体检测模型的训练方法的各个步骤,图4和图5所示的CNN模型和图6所示的芯片也可以用于执行本申请实施例的活体检测方法的各个步骤。
在本申请实施例中,活体检测模型可以是在一个基础神经网络的基础上进一步得到,下面结合图7和图8分别介绍一下两个神经网络。
图7是本申请实施例的基础神经网络的一个示意性结构图。
可选地,基础神经网络(即第二卷积神经网络)可以是已有的用于人脸识别的神经网络,也可以利用训练库来训练得到一个能够用于人脸识别的基础神经网络。由于该基础神经网络是用于人脸识别而不是活体检测,所以可以采用大量的真实人脸的数据来进行训练,可选地,也可以利用非真实人脸的数据来训练。
在一些实现方式中,为了进一步减少运算和存储压力,基础神经网络可以为轻量级神经网络,也就是结构简单、参数较少、需要存储空间较小的神经网络。例如,基础神经网络可以为现有的轻量级卷积神经网络(LightCNN)的人脸识别(face recognition)模型(也可以称之为人脸识别神经网络),例如可以采用9层版本的LightCNN FR模型(以下称之为LightCNN-9)。该LightCNN-9是准确率很高的公开的FR CNN之一,LightCNN-9与其他FR CNN相比,使用更小的参数集就可以具有非常优秀的性能。对于目前的活体检测来说,这样的小网络规模特别适用于车辆等各类运算和存储能力有限的场景。
应理解,上述LightCNN-9只是一个用于人脸识别的轻量级卷积神经网络的一个示例,还可以采用其他LightCNN,例如LightCNN-4、LightCNN-29等。如上文所述,只要是用于人脸识别的神经网络都可以用于本申请的方案,例如还可以采用DeepFace、Webface、FaceNet、或可视几何组(visual geometry group,VGG)网络等各类神经网络,为了简洁不再一一列举。
图7示出了的LightCNN-9卷积层(conv1)、最大特征映射(max-feature-map,MFM)层(MFM1)、池化层(pool1-pool4)、卷积层的组合层(group2-group5)和全连接层(MFM_fc1)。如图7所示,将128x128的输入图像输入到LightCNN-9中,输出层(此处为全连接层MFM_fc1作为输出层)就可以输出256维的特征向量(也可以是128维等其它维数的特征向量)。这些256维的特征向量是一些鉴别性的特征的向量,可以区分开不同人脸的一些特征,例如鼻子的形状位置、眼睛的形状位置等等的特征,称之为类别特征向量。该LightCNN-9可以利用已有的数据量庞大的数据集来训练得到,训练时还可以将数据集中的图片进行随机翻转、随机裁剪等增强后,转换为灰度图再来进行训练,训练时可以采用arcface、cosineface、或sphereface等人脸识别领域常用的损失函数来进行训练,本申请对此不做限定。
需要说明的是,上述输入图像的尺寸和输出的特征向量的维度是由LightCNN-9决定的,如果采用其他的轻量级神经网络,则也可能是其他的输入图像尺寸和输出维度,不存在限定。
在基础神经网络的基础上可以得到图8所示的活体检测神经网络。图8是本申请实施例的活体检测神经网络的示意性结构图。在本申请实施例中活体检测神经网络也称之为活 体检测模型,对其进行训练(即更新其参数)之后就可以得到目标活体检测模型。但应理解,图8中的基础神经网络可以是用于人脸识别的任意神经网络,并不是指局限于图7所示的神经网络,只是为了便于理解方案,采用了与图7所示的相同结构,而在实际中,不存在对于基础神经网络的结构限定,只要是用于人脸识别的CNN均可。
此外,神经网络的参数又可以称为系数或权重,其可以为矩阵的形式。
活体检测神经网络(即第一卷积神经网络)是用于进行活体检测的,经过训练后的活体检测神经网络可以作为目标活体检测模型,用于活体检测。如图8所示,该活体检测神经网络可以划分为三个部分:浅层部分、中间层部分和全连接层部分,其中浅层可以为神经网络较浅的层,但具体前几层作为浅层部分并不存在限定。而中间层则可以理解为除浅层以外的所有非全连接层的层。图8中的全连接层部分包括基础神经网络的全连接层和增加的全连接层,为了便于区分,在图8中分别以FC1和FC2表示基础神经网络的全连接层和新的、不是基础神经网络的组成部分的全连接层。新的全连接层可以是一层也可以是多层。
图8可以理解为是在基础神经网络的最后一层(即作为输出层的全连接层)之后,增加了一层或多层用于二分类的全连接层。该增加的全连接层,用于进行活体判别,也就是判断是否为活体(即判断是真实的人还是非真实的人)。图8所示的活体检测神经网络的输出为检测结果,该检测结果可以理解为对于是否为活体的判定结果,或者可以理解为将输入图像分到两个类别中的一个:“活体”、“非活体”。
研究表明,CNN浅层学习到的信息特定于任务和数据集,因此,可以将CNN中这种浅层的、用于特定领域的特征命名为领域特定单元(domain specific units,DSU),DSU及可以对应于图8所示的浅层部分。CNN中比浅层更高的层(对应于图8中的中间层部分),即中间层的特征则可以在不同的成像域之间共享,学习到的特征是域独立的,在不同任务和数据集中具有较为鲁棒的表现,也就是说,中间层可以在不同的数据集之间参数共享。全连接层则高度特定于任务和数据集。图8在基础神经网络上增加了一个基于回归的分类器(即FC2),该分类器特定于二分类任务,即活体检测任务。
假设图8的基础神经网络是图7所示LightCNN-9,则浅层部分可以包括LightCNN-9的包括group2及其之前的层,中间层部分可以包括pool2至pool4层,全连接层可以包括MFM_fc1(MFM_fc1即为投入8所示FC1)和FC2层。FC2层将MFM_fc1输出的256维特征向量作为输入,输出是否为活体的二分类结果。可以这样理解,新增的FC2,通过学习具有丰富的人脸的鉴别性特征的256维类别特征向量,来区分图像中的人是否为活体。
在对一活体检测神经网络进行训练时,可以利用二分类交叉熵(binary cross entropy,BCE)损失函数来进行训练,BCE损失函数可以用式子L BCE=-(ylog(p)+(1-y)log(p))表示,其中,L BCE表示该损失函数的值,y表示是否为活体,取值为1或0,p表示真实人脸的预测概率。
在训练过程中,可以采用随机梯度下降(stochastic gradient descent,SGD)的方法,可以设定学习率learning rate=1e-2,权值衰减weight decay=1e-4,动量momentum=0.90。
如上文分析所述,中间层具有域独立性,即可以在不同数据集共享,所以可以基于此来进一步降低训练的运算量。可以冻结中间层的参数(权重),冻结也可以理解为保持、 不更新、不训练,在训练的时候,只更新浅层网络和全连接层的参数,而中间层的参数保持不变。则在不降低训练效果的前提下,大大减少训练过程的运算量。相当于在对图8所示活体检测神经网络进行训练时,可以更新浅层部分和全连接层部分的参数,而中间层的参数则保持不变。
应理解,所谓的保持不变,是说基础神经网络是已经训练好的,所以沿用了基础神经网络的中间层的参数,并不是说中间层的参数从来没被训练过。换而言之,中间层的参数是在基础神经网络的训练阶段(为了便于理解,可以称之为第一个训练阶段)得到的,而在活体检测神经网络的训练阶段(为了便于理解,可以称之为第二个训练阶段),中间层的参数保持不变(即冻结),就是说,第二个训练阶段只更新活体检测神经网络的浅层和全连接层的参数。
还应理解,基础神经网络同样可以划分为浅层部分、中间层部分、全连接层部分,浅层为基础神经网络的前几层,但具体前几层不存在限定,中间层则是处浅层之外的所有非全连接层。基础神经网络的中间层和活体检测神经网络的中间层是同一部分,基础神经网络的浅层和活体检测神经网络的浅层也是同一部分,基础神经网络的全连接层和活体检测神经网络的全连接层则不同,因为活体检测神经网络的全连接层包括增加的用于二分类的全连接层(如图8的FC2),不包括基础神经网络的全连接层(如图8的FC1)。
图9是本申请实施例的活体检测方法的示意性流程图,下面对图9的各个步骤进行介绍。该活体检测方法可以利用部署了活体检测模型的装置或***执行,例如可以是移动终端、车载终端、电脑、智慧屏、或智能控制***等等。
901、获取人脸图像。
人脸图像可以包括真实人脸图像和非真实人脸图像,可以是真实人脸和非真实人脸的图像、录像帧,也可以是真实人脸图像和非真实人脸图像的特征提取之后的特征向量。
非真实人脸分为2D类和3D类之一或全部,2D类主要包括人脸的打印照片、平板等屏幕类的照片和视频回放等,3D类主要包括3D面具和3D头模等。
人脸图像可以是利用活体检测装置的获取单元获取,该获取单元可以是图像采集设备、通信接口、接口电路等。当该获取单元为图像采集设备时,相当于活体检测装置中集成了图像采集设备,例如,具有摄像头的智能手机就可以看作是一个活体检测装置,则获取单元可以是智能手机的摄像头,当该智能手机执行本申请实施例的活体检测方法时,摄像头获取上述人脸图像,并将上述人脸图像传输给手机处理器,手机处理器再执行后续的步骤。当该获取单元为通信接口或接口电路等具有收发功能的设备时,相当于是通过获取单元从外部的图像采集设备处获取人脸图像。具体采用的连接方式和通信方式可以是电路连接、有线通信、无线通信等任意方式,不存在限定。例如对于车载场景,车辆的控制***就可以用于执行本申请实施例的活体检测方法,则当执行活体检测方法时,就可以是图像采集设备采集人脸图像,并将采集到的人脸图像发送/传输给控制***,控制***中的获取单元执行步骤901,获取人脸图像。
图像采集设备可以包括相机、摄像头等。
常见的摄像头都可以用于获取人脸图像,例如RGB摄像头和近红外(near infra-red,NIR)摄像头。RGB摄像头受光线影响较大,且利用RGB获取的人脸图像需要先进行灰度处理,转换成灰度图,再对灰度图执行后续的活体检测。而近红外摄像头受光线影响较 小,适用范围更广,且屏幕类的图像(包括屏幕上显示照片或播放的视频)无法在近红外摄像头上成像,因为屏幕类的图像无法在近红外摄像头的波长波段内成像,所以采用近红外摄像头,相当于可以过滤掉屏幕类的非真实人脸的图像。也就是说,近红外摄像头具有受光线干扰小、屏蔽屏幕类的非真实人脸的优点。
由于人脸在不同光照环境下的呈现也是不同的,而本申请的目标活体检测模型对于不同光照场景下的人脸图像都能够有良好的表现,也就是说,人脸图像可以包括多个光照场景下的图像。例如可以为以下光照场景:室外晴天、室外多云或阴天、室内光线昏暗、室内光线明亮。
对于一些简单的应用场景,获取的人脸图像足够清晰,受背景影响较小。但对于车辆这一特殊场景,不可避免会受到车内装饰、玻璃反光、密闭等的影响,而本申请的目标活体检测模型对于车内场景的人脸图像依然能够有良好的表现。
对于车辆场景,可以在车辆内设置的一个或多个摄像头来拍摄得到上述人脸图像。
此外,人脸图像也会随着摄像头的位置和角度不同而不同,而本申请的目标活体检测模型对于不同角度和/或不同距离的人脸图像都能够有良好的表现,也就是说,人脸图像可以包括不同角度和/或不同距离的图像。例如,可以在车辆前车窗的两侧分别布置一个摄像头,在仪表位置布局一个摄像头,这些摄像头都可以拍摄到驾驶位的人脸图像。图10是本申请实施例的一种车内摄像头的布局示意图。图10主要介绍的是对驾驶位的拍摄,如图10所示,可以在左前方A、右前方B、方向盘处C、控制面板处D等中的任意多个位置设置朝向驾驶位的摄像头,这样就可以拍摄到驾驶位的不同距离、不同角度的图像。可以看出,图10中的A处摄像头的形状和其他三处摄像头不同,这是为了示出可以同时采用不同种类的摄像头进行布局。但应理解,图10只是一个示例,在实际中可以有很多种布局方式。这样的好处是,驾驶位的人不需要为了配合某个位置的摄像头而动作,即使在专注于驾驶的过程中依然可以完成活体检测的人脸图像采集。例如,如果只有控制面板D处有摄像头,则驾驶位需要右转低头才能让该摄像头采集到人脸图像,交互过程不够友好,还可能对驾驶员造成干扰。但应理解,图10的目的是为了解释对于同一位置可以用多个摄像头来获取人脸图像,但该位置并不只局限于驾驶位,还可以是副驾驶位、后乘坐位等。
可选地,还可以通过在车内布局多个摄像头,实现整车所有人员都能够完成活体检测。图11是本申请实施例的另一种车内摄像头的布局示意图。如图11所示,可以分别在驾驶位前的A处、副驾驶位前的B处、驾驶位座椅后的C处、副驾驶位座椅后的D处设置摄像头,其中A处摄像头用于拍摄驾驶位的人脸图像,B处摄像头用于拍摄副驾驶位的人脸图像,C处摄像头用于拍左后位的人脸图像,D处摄像头用于拍摄右后位的人脸图像。这样就可以实现整车任意位置的人员的活体检测。
图11中的(a)是车辆座舱的俯视图,图11中的(b)是车内的后视图,在图11中的(a)和(b)中分别标注了在不同视图中摄像头的布局情况。
本申请中的摄像头可以是独立的摄像头,也可以是具有拍摄功能的设备的摄像头,例如图11中的(b)中,C和D就是设置在前排座椅上的显示设备的摄像头。
应理解,图11同样只是一个摄像头布局方式的示例,实际中还可以按照需求社指出很多种布局方式,不再一一列举。
为了便于理解,结合一个图11的摄像头的布局方式的实际应用场景介绍。假设在车辆上副驾驶位的人员正在通过车载智能***购物,在需要支付的时候,副驾驶位发现没有支付权限,而左后位的人员是有支付权限的,此时启动C处摄像头拍摄左后位的人脸图像,就可以完成活体检测,进而在通过活体检测的前提下支付成功。假设在这个场景下,副驾驶位的人员戴上左后位的人员的3D面具,则是B处摄像头采集副驾驶位的人脸图像,但在活体检测的时候,副驾驶位的人脸图像被判定为非活体,就无法支付成功。
902、将人脸图像输入到目标活体检测模型,得到活体检测结果,活体检测结果用于指示人脸图像中的人是否为活体。
该目标活体检测模型包括第一卷积神经网络,第一卷积神经网络包括第二卷积神经网络和全连接层,第二卷积神经网络用于根据人脸图像得到人脸的类别特征向量,全连接层用于根据类别特征向量进行活体判别,得到活体检测结果。
上述第一卷积神经网络(即上文的活体检测神经网络)包括第二卷积神经网络(即上文的基础神经网络)和全连接层两部分,第一卷积神经网络(即上文的活体检测神经网络)可以理解为是在第二卷积神经网络(即上文的基础神经网络)的基础上改造得到的。第一卷积神经网络和第二卷积神经网络的解释可以参照上文活体检测神经网络和基础神经网络的介绍,上述类别特征向量可以参照上文鉴别性特征的向量的解释,为了简洁,不再重复。
第一卷积神经网络和第二卷积神经网络的训练数据集可以不相同,训练过程可以分阶段完成。
在一些实现方式中,上述目标活体检测模型是利用第一训练数据更新第一卷积神经网络的参数得到的,第二卷积神经网络是利用第二训练数据预训练好的,第一训练数据包括真实人脸的数据和非真实人脸的数据,第二训练数据包括真实人脸的数据。这样的实现方式可以进一步提高模型准确性,从而提高活体检测的准确性,同时还可以简化目标活体检测模型的训练过程和降低目标检测模型的训练成本。首先,第二卷积神经网络可以是利用公开的各类数据量丰富的真实人脸的数据集(即第二训练数据可以是公开的真实人脸的数据集中的数据),这就使得第二卷积神经网络能够得到更加充分的训练,具有更加好的提取人脸特征的能力。当然也可以在其中掺入一些非真实人脸的数据,不影响整体效果。甚至还可以直接从已经公开的用于人脸识别的CNN中选取一个作为第二卷积神经网络,例如上文的LightCNN-9。这就可以有效降低训练成本,也就是在对第一卷积神经网络进行训练(更新第一卷积神经网络的参数)的时候,对于训练数据和训练设备的要求都相对较低,因为不需要再对第二卷积神经网络的进行大量充足的训练。其次,第一训练数据(即用于提高活体判别能力的训练数据)不需要数量过多,即对于非真实人脸的数据需求量较小,且不需要非常多的训练次数就可以达到较好的效果,因为对第一卷积神经网络的训练阶段相当于对于参数进行微调的过程,使之具有活体判别能力。
在一些实现方式中,可以只更新第一卷积神经网络的浅层网络的参数和第一卷积神经网络的全连接层的参数。也就是说,目标活体检测模型可以是利用第一训练数据更新第二卷积神经网络的浅层网络的参数和第二卷积神经网络的全连接层的参数和第一卷积神经网络的全连接层的参数得到的。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本。
在一些实现方式中,第一卷积神经网络的中间层(也是第二卷积神经网络的中间层)的参数可以保持不变。相当于在更新第一卷积神经网络的参数的过程中,冻结了中间层的参数,或者理解为不更新中间层的参数。这是利用了中间层具有域独立性的特点,具体内容可以参照上文相关内容。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本。
在一些实现方式中,为了进一步减少运算和存储压力,第二卷积神经网络可以为用于人脸识别的轻量级神经网络。也就是采用结构简单、参数较少、需要存储空间较小的神经网络模型,这样利于部署在运算和存储能力有限的应用场景,例如车载场景。
可选地,还可以把上述人脸图像用于目标检测模型的训练。可以发送上述人脸图像,发送的人脸图像用于对目标活体检测模型进行训练,即更新目标活体检测模型的参数。这样可以实现在线更新的目的,使得目标活体检测模型的准确性得到进一步提高。
需要说明的是,上述发送可以是发送给本地设备,也可以是发送给云端设备,也就是发送给可以更新目标检测模型参数的设备即可,不存在限定。
可选地,对于不同应用场景,还可以根据活体检测结果生成不同的执行动作。即上述活体检测方法还可以包括:当活体检测结果指示人脸图像中的人为活体时,执行目标任务的决策,目标任务包括以下至少一项:解锁、账号登录、权限许可或确认支付。即决策是否解锁,是否登录,是否给予权限许可,是否确认支付。例如对于解锁任务来说,当活体检测结果为活体时,进一步判断该人是否具备解锁权限,如果具备解锁权限,就执行解锁,如果不具备则不解锁。
也就是说,在执行目标任务之前先进行是否为活体的判别,如果已经判定是非活体了就没有必要执行后面的任务决策了,提高了决策的安全性。当然也可以先执行决策任务的判定,再执行活体检测,也就是,先判断人脸图像中的人是否具备任务权限,当具备权限的情况下再进一步判断这个人脸图像是否是活体。
图12是本申请实施例的活体检测模型的训练方法的示意性流程图,下面对图12的各个步骤进行介绍。
1201、获取第一训练数据。
在本申请实施例中,训练数据可以称之为活体检测数据,包括真实人脸的数据和非真实人脸的数据,可以是真实人脸和非真实人脸的图像、录像帧,也可以是真实人脸和非真实人脸的特征提取之后的特征向量。对于真实人脸和非真实人脸的接收可以参照上文相关内容,不再重复介绍。
训练数据可以利用相机、摄像头等采集得到,也可以是从存储设备上读取。
第一训练数据是指用于更新第一卷积神经网络的参数的训练数据,第一训练数据第一训练数据包括真实人脸的数据和非真实人脸的数据,也就是说,使得第一卷积神经网络具备活体判别能力的训练数据需要包括真实人脸的数据和非真实人脸的数据。
由于人脸在不同光照环境下的呈现也是不同的,所以,为了提高训练数据的丰富性,从而提高训练效果,该第一训练数据可以包括多个光照场景下的数据。例如可以为以下光照场景:室外晴天、室外多云或阴天、室内光线昏暗、室内光线明亮。也就是使得训练出能够适用于更多复杂光照场景的活体检测模型。
对于一些简单的应用场景,不需要考虑背景影响,所以获取的活体数据可以共享,也 就是说,数据集不会过于考虑获取场景的不同。但对于车辆这一特殊场景,不可避免会受到车内装饰、玻璃反光、密闭等的影响,所以对于车内场景,则直接将其他场景下的活体数据用来训练很容易导致训练效果欠佳。为了适应车内场景,可以利用设置在车辆内的一个或多个摄像头拍摄得到第一训练数据,这样训练出来的活体检测模型在车内场景中可以具有良好的表现。。
对于车辆场景,真实人脸的数据的获取方法就是直接拍摄得到,所以不再介绍,主要介绍车内非真实人脸的数据如何获取。
如上文所述,由于屏幕类活体数据无法在近红外摄像头上成像,所以采用近红外摄像头可以不获取屏幕类的活体数据。打印照片、3D面具和3D头模则可以是人员坐在车内佩戴上述工具,然后用摄像头获取图像。
由于人脸在从不同角度、距离拍摄时的呈现也是不同的,所以,为了提高训练数据的丰富性,从而提高训练效果,该待训练的活体数据可以包括多个角度下的活体数据。例如对于车内场景:当上述摄像头的数量为多个时,多个摄像头设置在车辆的不同位置,用于得到不同角度和/或不同距离的第一训练数据。例如,可以在车辆前车窗的两侧分别布置一个摄像头,在仪表位置布局一个摄像头,这些摄像头都可以拍摄到驾驶位的数据。对于车内如何布局摄像头以获得不同角度和/或不同距离的训练数据,可以参照上文获取人脸图像的方法,在此不再重复介绍。
可选地,上述摄像头可以为近红外摄像头,这样可以适用于不需要屏幕类作为训练数据的场景,也就是省去了屏幕类的非真实人脸的数据的采集和训练。也就是说,如果是使用近红外摄像头的应用场景,目标检测模型不需要具备对屏幕类的非真实人脸的数据的检测能力,此时只需要利用其他训练数据训练得到目标检测模型即可,可以省去屏幕类数据的采集、处理和训练,从而有效降低训练成本。
可以建立活体检测数据的数据集,从而可以从中选取一些活体检测数据作为上述训练数据或者用于测试活体检测模型的效果等。例如可以建立车内活体检测数据集。该数据集中可以包括真实人脸的图像和非真实人脸的图像,真实人脸的图像可以包括不同配饰(有无帽子眼镜、眼镜的种类)、不同角度(仰头、低头、平视)、不同光照的真实人脸的图像。非真实人脸的图像可以包括2D打印照片、3D头模3D面具等非真实人脸的图像。为了方便,还可以对数据集中的图像进行编号,例如可以按照行、列编号。
1202、根据上述第一训练数据,更新第一卷积神经网络的参数,得到目标活体检测模型。
目标活体检测模型就是更新后的第一卷积神经网络。
第一卷积神经网络包括第二卷积神经网络和全连接层,第二卷积神经网络用于根据训练数据得到人脸的类别特征向量,全连接层用于根据类别特征向量进行活体判别。
可选地,该第一卷积神经网络可以是上文所述任意一种活体检测神经网络,例如可以是如图8所示的活体检测神经网络。第二卷积神经网络可以是用于人脸识别的实景网络,即上文的基础神经网络。
在一些实现方式中,第二卷积神经网络是利用第二训练数据预训练好的,第二训练数据包括真实人脸的数据。也就是说,第一卷积神经网络和第二卷积神经网络可以分别用不同的训练数据集(第一训练数据和第二训练数据)进行训练。这样的实现方式可以进一步 提高模型准确性,从而提高活体检测的准确性,同时还可以简化目标活体检测模型的训练过程和降低目标检测模型的训练成本。首先,第二卷积神经网络可以是利用公开的各类数据量丰富的真实人脸的数据集(即第二训练数据可以是公开的真实人脸的数据集中的数据),这就使得第二卷积神经网络能够得到更加充分的训练,具有更加好的提取人脸特征的能力。当然也可以在其中掺入一些非真实人脸的数据,不影响整体效果。甚至还可以直接从已经公开的用于人脸识别的CNN中选取一个作为第二卷积神经网络,例如上文的LightCNN-9。这就可以有效降低训练成本,也就是在对第一卷积神经网络进行训练(更新第一卷积神经网络的参数)的时候,对于训练数据和训练设备的要求都相对较低,因为不需要再对第二卷积神经网络的进行大量充足的训练。其次,第一训练数据(即用于提高活体判别能力的训练数据)不需要数量过多,即对于非真实人脸的数据需求量较小,且不需要非常多的训练次数就可以达到较好的效果,因为对第一卷积神经网络的训练阶段相当于对于参数进行微调的过程,使之具有活体判别能力。
在一些情况下,可以理解为,第一卷积神经网络是在第二卷积神经网络的输出层之后增加一层或多层全连接层得到的,增加的全连接层用于进行活体判别,第二卷积神经网络是预训练好的,第二卷积神经网络用于得到人脸的类别特征向量。
在一些实现方式中,在更新第一卷积神经网络的参数时,可以是更新第一卷积神经网络的浅层网络的参数和第一卷积神经网络的全连接层的参数。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本
在一些实现方式中,第一卷积神经网络的中间层(也是第二卷积神经网络的中间层)的参数可以保持不变。相当于在更新第一卷积神经网络的参数的过程中,冻结了中间层的参数,或者理解为不更新中间层的参数。这是利用了中间层具有域独立性的特点,具体内容可以参照上文相关内容。这样可以在不降低训练效果的前提下,大大减少训练过程的运算量,进一步简化第一卷积神经网络的训练和降低训练成本。
在一些实现方式中,为了进一步减少运算和存储压力,第二卷积神经网络可以为用于人脸识别的轻量级神经网络。也就是采用结构简单、参数较少、需要存储空间较小的神经网络模型,这样利于部署在运算和存储能力有限的应用场景,例如车载场景。
图12所示训练方法与现有技术中的训练方法具有训练相对简单,训练数据需求量相对较小,训练得到的模型准确性相对更高的优点。
首先,活体检测数据中真实人脸的数据相对较容易得到,且种类多样、数量充足,而例如2D/3D面具数据这类的非真实人脸的数据是较难获得的,且种类有限、数量很少,导致训练数据分布失衡(这不利于二分类的分类器的训练)和数量不足,也同样会导致现有的活体检测无法达到理想的准确性。其次,如上文所述现有技术的活体检测只注重活体判别的能力却忽视了人脸识别的能力(即人脸特征的提取能力),而事实上人脸特征提取的精度会直接影响到后续活体判别的准确性,所以现有技术无法达到较高的活体检测的准确性,而本申请实施例的方案则充分兼顾两种能力,使得活体检测的准确性有效提高。
此外,由于在训练阶段,训练模型的设备(简称训练设备)需要存储大量的训练数据、模型参数,以及进行训练过程运算;而在执行阶段,部署模型的设备(即利用模型来执行活体检测任务的设备,可以称之为推理设备)需要存储该模型、处理数据及其中间数据, 以及进行推理过程的运算,所以,两种设备都需要足够的运算能力和存储能力。而现有技术中的活体检测模型的规模较大、训练复杂、运算量大,对于训练设备的存储和运算能力以及推理设备的存储和执行能力都有较高的要求,导致并不适用于运算能力和/或存储能力较弱的场景,例如车载场景中,车载设备是很难承担复杂运算和大模型存储的,所以导致现有技术的活体检测模型并不适用与车载场景。
图13是本申请实施例的活体检测装置的示意性框图。图13所示的活体检测装置2000包括获取单元2001和处理单元2002。
获取单元2001和处理单元2002可以用于执行本申请实施例的活体检测方法,具体地,获取单元2001可以执行上述步骤901,处理单元2002可以执行上述步骤902。
处理单元2002能够实现图8所示的活体检测神经网络的功能。
应理解,上述装置2000中的处理单元2002可以相当于下文中的装置3000中的处理器3002。
图14是本申请实施例提供的活体检测装置的硬件结构示意图。图14所示的活体检测装置3000(该装置3000具体可以是一种计算机设备)包括存储器3001、处理器3002、通信接口3003以及总线3004。其中,存储器3001、处理器3002、通信接口3003通过总线3004实现彼此之间的通信连接。
存储器3001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器3001可以存储程序,当存储器3001中存储的程序被处理器3002执行时,处理器3002和通信接口3003用于执行本申请实施例的活体检测方法的各个步骤。
处理器3002可以采用通用的CPU,微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的活体检测装置中的单元所需执行的功能,或者执行本申请方法实施例的活体检测方法。
处理器3002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的活体检测方法的各个步骤可以通过处理器3002中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器3002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、ASIC、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器3001,处理器3002读取存储器3001中的信息,结合其硬件完成本申请实施例的活体检测装置中包括的单元所需执行的功能,或者执行本申请方法实施例的活体检测方法。
通信接口3003使用例如但不限于收发器一类的收发装置,来实现装置3000与其他设备或通信网络之间的通信。例如,可以通过通信接口3003获取上述人脸图像。
总线3004可包括在装置3000各个部件(例如,存储器3001、处理器3002、通信接 口3003)之间传送信息的通路。
图15是本申请实施例的活体检测网络的训练装置的示意性框图。图15所示的活体检测网络的训练装置4000包括获取单元4001和训练单元4002。
获取单元4001和训练单元4002可以用于执行本申请实施例的活体检测模型的训练方法,具体地,获取单元4001可以执行上述步骤1201,训练单元4002可以执行上述步骤1202。
应理解,上述装置4000中的训练单元4002可以相当于下文中的装置5000中的处理器5002。
图16是本申请实施例提供的活体检测网络的训练装置的硬件结构示意图。图16所示的活体检测网络的训练装置5000(该装置5000具体可以是一种计算机设备)包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。
存储器5001可以是ROM,静态存储设备,动态存储设备或者RAM。存储器5001可以存储程序,当存储器5001中存储的程序被处理器5002执行时,处理器5002和通信接口5003用于执行本申请实施例的活体检测网络的训练方法的各个步骤。
处理器5002可以采用CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的活体检测网络的训练装置中的单元所需执行的功能,或者执行本申请方法实施例的活体检测网络的训练方法。
处理器5002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的活体检测网络的训练方法的各个步骤可以通过处理器5002中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器5002,还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器5001,处理器5002读取存储器5001中的信息,结合其硬件完成本申请实施例的活体检测网络的训练装置中包括的单元所需执行的功能,或者执行本申请方法实施例的活体检测网络的训练方法。
通信接口5003使用例如但不限于收发器一类的收发装置,来实现装置5000与其他设备或通信网络之间的通信。例如,可以通过通信接口5003获取上述第一训练数据。
总线5004可包括在装置5000各个部件(例如,存储器5001、处理器5002、通信接口5003)之间传送信息的通路。
应注意,尽管图14所示的装置3000、图16所示的装置5000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置3000、装置5000还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置3000、装置5000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置3000、装置5000也可仅仅包括实现本申请实施例所必须的 器件,而不必包括图14、图16中所示的全部器件。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同装置来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***、方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:通用串行总线闪存盘(USB flash disk,UFD),UFD也可以简称为U盘或者优盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (46)

  1. 一种活体检测方法,其特征在于,包括:
    获取人脸图像;
    将所述人脸图像输入到目标活体检测模型,得到活体检测结果,所述活体检测结果用于指示所述人脸图像中的人是否为活体,所述目标活体检测模型包括第一卷积神经网络,所述第一卷积神经网络包括第二卷积神经网络和全连接层,所述第二卷积神经网络用于根据所述人脸图像得到人脸的类别特征向量,所述全连接层用于根据所述类别特征向量进行活体判别,得到所述活体检测结果。
  2. 如权利要求1所述的方法,其特征在于,所述目标活体检测模型是利用第一训练数据更新所述第一卷积神经网络的参数得到的,所述第二卷积神经网络是利用第二训练数据预训练好的,所述第一训练数据包括真实人脸的数据和非真实人脸的数据,所述第二训练数据包括真实人脸的数据。
  3. 如权利要求2所述的方法,其特征在于,所述目标活体检测模型是利用所述第一训练数据更新所述第二卷积神经网络的浅层网络的参数和所述第二卷积神经网络的全连接层的参数和所述第一卷积神经网络的所述全连接层的参数得到的。
  4. 如权利要求1至3中任一项所述的方法,其特征在于,所述第二卷积神经网络的中间层的参数保持不变。
  5. 如权利要求1至4中任一项所述的方法,其特征在于,所述第二卷积神经网络为用于人脸识别的轻量级神经网络。
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述人脸图像包括多个光照场景下的图像。
  7. 如权利要求1至6中任一项所述的方法,其特征在于,所述人脸图像是利用设置在车辆内的一个或多个摄像头拍摄得到的。
  8. 如权利要求1至7中任一项所述的方法,其特征在于,当所述摄像头的数量为多个时,多个所述摄像头设置在所述车辆的不同位置,用于得到不同角度和/或不同距离的所述人脸图像。
  9. 如权利要求7或8所述的方法,其特征在于,所述摄像头为近红外摄像头。
  10. 如权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    发送所述人脸图像,所述人脸图像用于对所述目标活体检测模型进行训练。
  11. 如权利要求1至10中任一项所述的方法,其特征在于,所述方法还包括:
    当所述活体检测结果指示所述人脸图像中的人为活体时,执行目标任务的决策,所述目标任务包括以下至少一项:解锁、账号登录、权限许可或确认支付。
  12. 一种活体检测模型的训练方法,其特征在于,包括:
    获取第一训练数据,所述第一训练数据包括真实人脸的数据和非真实人脸的数据;
    根据所述第一训练数据,更新第一卷积神经网络的参数,得到目标活体检测模型,所述第一卷积神经网络包括第二卷积神经网络和全连接层,所述第二卷积神经网络用于根据所述训练数据得到人脸的类别特征向量,所述全连接层用于根据所述类别特征向量进行活 体判别。
  13. 如权利要求12所述的训练方法,其特征在于,所述第二卷积神经网络是利用第二训练数据预训练好的,所述第二训练数据包括真实人脸的数据。
  14. 如权利要求12或13所述的训练方法,其特征在于,所述更新第一卷积神经网络的参数,包括:
    更新所述第二卷积神经网络的浅层网络的参数和所述第二卷积神经网络的全连接层的参数和所述第一卷积神经网络的所述全连接层的参数。
  15. 如权利要求12至14中任一项所述的训练方法,其特征在于,所述第二卷积神经网络的中间层的参数保持不变。
  16. 如权利要求12至15中任一项所述的训练方法,其特征在于,所述第二卷积神经网络为用于人脸识别的轻量级神经网络。
  17. 如权利要求12至16中任一项所述的训练方法,其特征在于,所述第一训练数据包括多个光照场景下的数据。
  18. 如权利要求12至17中任一项所述的训练方法,其特征在于,所述第一训练数据是利用设置在车辆内的一个或多个摄像头拍摄得到的。
  19. 如权利要求18所述的训练方法,其特征在于,当所述摄像头的数量为多个时,多个所述摄像头设置在所述车辆的不同位置,用于得到不同角度和/或不同距离的所述第一训练数据。
  20. 如权利要求18或19所述的训练方法,其特征在于,所述摄像头为近红外摄像头。
  21. 一种活体检测装置,其特征在于,包括:
    获取单元,用于获取人脸图像;
    处理单元,用于将所述人脸图像输入到目标活体检测模型,得到活体检测结果,所述活体检测结果用于指示所述人脸图像中的人是否为活体,所述目标活体检测模型包括第一卷积神经网络,所述第一卷积神经网络包括第二卷积神经网络和全连接层,所述第二卷积神经网络用于根据所述人脸图像得到人脸的类别特征向量,所述全连接层用于根据所述类别特征向量进行活体判别,得到所述活体检测结果。
  22. 如权利要求21所述的装置,其特征在于,所述目标活体检测模型是利用第一训练数据更新所述第一卷积神经网络的参数得到的,所述第二卷积神经网络是利用第二训练数据预训练好的,所述第一训练数据包括真实人脸的数据和非真实人脸的数据,所述第二训练数据包括真实人脸的数据。
  23. 如权利要求22所述的装置,其特征在于,所述目标活体检测模型是利用所述第一训练数据更新所述第二卷积神经网络的浅层网络的参数和所述第二卷积神经网络的全连接层的参数和所述第一卷积神经网络的所述全连接层的参数得到的。
  24. 如权利要求21至23中任一项所述的装置,其特征在于,所述第二卷积神经网络的中间层的参数保持不变。
  25. 如权利要求21至24中任一项所述的装置,其特征在于,所述第二卷积神经网络为用于人脸识别的轻量级神经网络。
  26. 如权利要求21至25中任一项所述的装置,其特征在于,所述人脸图像包括多个光照场景下的图像。
  27. 如权利要求21至26中任一项所述的装置,其特征在于,所述人脸图像是利用设置在车辆内的一个或多个摄像头拍摄得到的。
  28. 如权利要求21至27中任一项所述的装置,其特征在于,当所述摄像头的数量为多个时,多个所述摄像头设置在所述车辆的不同位置,用于得到不同角度和/或不同距离的所述人脸图像。
  29. 如权利要求27或28所述的装置,其特征在于,所述摄像头为近红外摄像头。
  30. 如权利要求21至29中任一项所述的装置,其特征在于,所述装置还包括:
    发送单元,用于发送所述人脸图像,所述人脸图像用于对所述目标活体检测模型进行训练。
  31. 如权利要求21至30中任一项所述的装置,其特征在于,所述处理单元还用于:
    当所述活体检测结果指示所述人脸图像中的人为活体时,执行目标任务的决策,所述目标任务包括以下至少一项:解锁、账号登录、权限许可或确认支付。
  32. 一种活体检测模型的训练装置,其特征在于,包括:
    获取单元,用于获取第一训练数据,所述第一训练数据包括真实人脸的数据和非真实人脸的数据;
    训练单元,用于根据所述第一训练数据,更新第一卷积神经网络的参数,得到目标活体检测模型,所述第一卷积神经网络包括第二卷积神经网络和全连接层,所述第二卷积神经网络用于根据所述训练数据得到人脸的类别特征向量,所述全连接层用于根据所述类别特征向量进行活体判别。
  33. 如权利要求32所述的训练装置,其特征在于,所述第二卷积神经网络是利用第二训练数据预训练好的,所述第二训练数据包括真实人脸的数据。
  34. 如权利要求32或33所述的训练装置,其特征在于,所述训练单元具体用于:
    更新所述第二卷积神经网络的浅层网络的参数和所述第二卷积神经网络的全连接层的参数和所述第一卷积神经网络的全连接层的参数。
  35. 如权利要求32至34中任一项所述的训练装置,其特征在于,所述第二卷积神经网络的中间层的参数保持不变。
  36. 如权利要求32至35中任一项所述的训练装置,其特征在于,所述第二卷积神经网络为用于人脸识别的轻量级神经网络。
  37. 如权利要求32至36中任一项所述的训练装置,其特征在于,所述第一训练数据包括多个光照场景下的数据。
  38. 如权利要求32至37中任一项所述的训练装置,其特征在于,所述第一训练数据是利用设置在车辆内的一个或多个摄像头拍摄得到的。
  39. 如权利要求38所述的训练装置,其特征在于,当所述摄像头的数量为多个时,多个所述摄像头设置在所述车辆的不同位置,用于得到不同角度和/或不同距离的所述第一训练数据。
  40. 如权利要求38或39所述的训练装置,其特征在于,所述摄像头为近红外摄像头。
  41. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至11中任一项或者权利要求12至20中任一项所述方法的指令。
  42. 一种活体检测装置,其特征在于,所述装置包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至11中任一项所述的方法。
  43. 一种活体检测模型的训练装置,其特征在于,所述训练装置包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求12至20中任一项所述的训练方法。
  44. 一种计算机程序产品,其特征在于,当所述计算机程序在计算机上执行时,使得所述计算机执行如权利要求1至11中任一项或者权利要求12至20中任一项所述的方法。
  45. 一种活体检测模型,其特征在于,所述活体检测模型包括第一卷积神经网络,所述第一卷积神经网络包括第二卷积神经网络和全连接层,所述第二卷积神经网络用于根据待处理人脸图像得到人脸的类别特征向量,所述全连接层用于根据所述类别特征向量进行活体判别,得到活体检测结果,所述活体检测结果用于指示所述待处理人脸图像中的人是否为活体。
  46. 如权利要求45所述的活体检测模型,其特征在于,所述活体检测模型是利用如权利要求12至20中任一项所述的训练方法得到的。
PCT/CN2021/095597 2021-05-24 2021-05-24 活体检测方法、活体检测模型的训练方法及其装置和*** WO2022246612A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180040878.6A CN116057587A (zh) 2021-05-24 2021-05-24 活体检测方法、活体检测模型的训练方法及其装置和***
PCT/CN2021/095597 WO2022246612A1 (zh) 2021-05-24 2021-05-24 活体检测方法、活体检测模型的训练方法及其装置和***

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/095597 WO2022246612A1 (zh) 2021-05-24 2021-05-24 活体检测方法、活体检测模型的训练方法及其装置和***

Publications (1)

Publication Number Publication Date
WO2022246612A1 true WO2022246612A1 (zh) 2022-12-01

Family

ID=84229396

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095597 WO2022246612A1 (zh) 2021-05-24 2021-05-24 活体检测方法、活体检测模型的训练方法及其装置和***

Country Status (2)

Country Link
CN (1) CN116057587A (zh)
WO (1) WO2022246612A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424335A (zh) * 2022-11-03 2022-12-02 智慧眼科技股份有限公司 活体识别模型训练方法、活体识别方法及相关设备
CN116562338A (zh) * 2022-01-27 2023-08-08 美的集团(上海)有限公司 多分支卷积结构、神经网络模型及其确定方法、确定装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075295A1 (en) * 2016-09-14 2018-03-15 Kabushiki Kaisha Toshiba Detection apparatus, detection method, and computer program product
CN108416324A (zh) * 2018-03-27 2018-08-17 百度在线网络技术(北京)有限公司 用于检测活体的方法和装置
CN108983979A (zh) * 2018-07-25 2018-12-11 北京因时机器人科技有限公司 一种手势跟踪识别方法、装置和智能设备
CN109886087A (zh) * 2019-01-04 2019-06-14 平安科技(深圳)有限公司 一种基于神经网络的活体检测方法及终端设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075295A1 (en) * 2016-09-14 2018-03-15 Kabushiki Kaisha Toshiba Detection apparatus, detection method, and computer program product
CN108416324A (zh) * 2018-03-27 2018-08-17 百度在线网络技术(北京)有限公司 用于检测活体的方法和装置
CN108983979A (zh) * 2018-07-25 2018-12-11 北京因时机器人科技有限公司 一种手势跟踪识别方法、装置和智能设备
CN109886087A (zh) * 2019-01-04 2019-06-14 平安科技(深圳)有限公司 一种基于神经网络的活体检测方法及终端设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116562338A (zh) * 2022-01-27 2023-08-08 美的集团(上海)有限公司 多分支卷积结构、神经网络模型及其确定方法、确定装置
CN115424335A (zh) * 2022-11-03 2022-12-02 智慧眼科技股份有限公司 活体识别模型训练方法、活体识别方法及相关设备
CN115424335B (zh) * 2022-11-03 2023-08-04 智慧眼科技股份有限公司 活体识别模型训练方法、活体识别方法及相关设备

Also Published As

Publication number Publication date
CN116057587A (zh) 2023-05-02

Similar Documents

Publication Publication Date Title
WO2021043112A1 (zh) 图像分类方法以及装置
CN110298262B (zh) 物体识别方法及装置
WO2021143101A1 (zh) 人脸识别方法和人脸识别装置
WO2021043273A1 (zh) 图像增强方法和装置
WO2021147325A1 (zh) 一种物体检测方法、装置以及存储介质
WO2021155792A1 (zh) 一种处理装置、方法及存储介质
WO2019227479A1 (zh) 人脸旋转图像的生成方法及装置
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN111368972B (zh) 一种卷积层量化方法及其装置
CN110222718B (zh) 图像处理的方法及装置
WO2022246612A1 (zh) 活体检测方法、活体检测模型的训练方法及其装置和***
WO2021047587A1 (zh) 手势识别方法、电子设备、计算机可读存储介质和芯片
US20220157046A1 (en) Image Classification Method And Apparatus
CN111539351B (zh) 一种多任务级联的人脸选帧比对方法
WO2021190433A1 (zh) 更新物体识别模型的方法和装置
CN114596622A (zh) 基于对比知识驱动的虹膜与眼周对抗自适应融合识别方法
CN112446835A (zh) 图像恢复方法、图像恢复网络训练方法、装置和存储介质
CN117157679A (zh) 感知网络、感知网络的训练方法、物体识别方法及装置
CN114972182A (zh) 一种物体检测方法及其装置
Sedik et al. An efficient cybersecurity framework for facial video forensics detection based on multimodal deep learning
CN113065575A (zh) 一种图像处理方法及相关装置
Rajakumar et al. Design of advanced security system using vein pattern recognition and image segmentation techniques
CN113449550A (zh) 人体重识别数据处理的方法、人体重识别的方法和装置
WO2022179599A1 (zh) 一种感知网络及数据处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21942201

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21942201

Country of ref document: EP

Kind code of ref document: A1