CN117275075B

CN117275075B - Face shielding detection method, system, device and storage medium

Info

Publication number: CN117275075B
Application number: CN202311438777.1A
Authority: CN
Inventors: 葛罗棋; 夏鑫; 金朝汇
Original assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Current assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-02-13
Anticipated expiration: 2043-11-01
Also published as: CN117275075A

Abstract

The embodiment of the specification provides a face shielding detection method, a system, a device and a storage medium, wherein the method is executed by a processor and comprises the following steps: processing the initial face data to determine target face data; and determining an occlusion result through an occlusion detection model based on target face data, wherein the occlusion detection model is a first machine learning model, the occlusion detection model is obtained through combined training with a gradient prediction model, the gradient prediction model is a second machine learning model, the gradient prediction model determines a first gradient image based on an intermediate feature image, and the intermediate feature image is output of an intermediate layer of the initial occlusion detection model.

Description

Face shielding detection method, system, device and storage medium

Technical Field

The present disclosure relates to the field of computer vision processing technologies, and in particular, to a face occlusion detection method, system, device, and storage medium.

Background

Along with the development of computer vision processing technology, face recognition plays an important role in various fields, and the shielding of face images can directly influence the face recognition result. Therefore, face occlusion detection is an essential element in a face system. At present, the face shielding detection generally adopts the modes of recognizing face cells subjected to image processing segmentation through a machine learning algorithm, shielding detection according to face key points and/or five sense organs, two classification of shielding problems through a deep convolutional neural network and the like, and has the problems of limited detection scenes, poor detection generalization capability and robustness, limited detection results, large model training sample size and the like.

Therefore, it is desirable to provide a face occlusion detection method, system, device and storage medium, which can implement occlusion result classification for various face images, and improve the face occlusion detection efficiency and accuracy.

Disclosure of Invention

One or more embodiments of the present specification provide a face occlusion detection method, the method performed by a processor, comprising: processing the initial face data to determine target face data; and determining an occlusion result through an occlusion detection model based on the target face data, wherein the occlusion detection model is a first machine learning model, the occlusion detection model is obtained through combined training with a gradient prediction model, the gradient prediction model is a second machine learning model, the gradient prediction model determines a first gradient image based on an intermediate feature image, and the intermediate feature image is output of an intermediate layer of the initial occlusion detection model.

One or more embodiments of the present specification provide a face occlusion detection system, the system comprising: the first determining module is configured to process the initial face data and determine target face data; and the second determining module is configured to determine an occlusion result through an occlusion detection model based on the target face data, wherein the occlusion detection model is a first machine learning model, the occlusion detection model is obtained through combined training with a gradient prediction model, the gradient prediction model is a second machine learning model, the gradient prediction model determines a first gradient image based on an intermediate feature image, and the intermediate feature image is output of an intermediate layer of the initial occlusion detection model.

One or more embodiments of the present disclosure provide a face occlusion detection device, where the device includes at least one memory and at least one processor, where the at least one memory is configured to store computer instructions, and the at least one processor executes the computer instructions or a part of the instructions to implement the face occlusion detection method described above.

One or more embodiments of the present specification provide a computer-readable storage medium storing computer instructions that, when read by a computer, perform a face occlusion detection method as described above.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

fig. 1 is a schematic view of an application scenario of a face occlusion detection system according to some embodiments of the present disclosure;

FIG. 2 is an exemplary block diagram of a face occlusion detection system shown in accordance with some embodiments of the present description;

FIG. 3 is an exemplary flow chart of a face occlusion detection method according to some embodiments of the present description;

FIG. 4 is an exemplary diagram of acquiring occlusion detection models through joint training, shown in accordance with some embodiments of the present description;

fig. 5 is an exemplary diagram of determining target face data according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As used in this specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

In many application scenarios, face occlusion detection plays an important role. For example, in the generation of digital human broadcast video, mosaic shielding face phenomenon is easy to occur, and through face shielding detection, the shielded face can be found and removed in time, so that the visual quality of the digital human broadcast video can be improved; when a user uploads a photo, face shielding detection can be used for judging whether the photo uploaded by the user is qualified or not; in face recognition application, the picture error recognition rate of the blocked face is high, the blocked picture can be filtered through face blocking detection, and the accuracy of face recognition is improved.

At present, the human face is segmented through image processing to perform shielding detection, only simple characteristics based on manual design or characteristics designed according to specific scenes can be extracted in the process, advanced semantic characteristics and complex characteristic information are difficult to obtain, the generalization capability and robustness of the human face shielding detection are poor, and the overall regulation and control are lacked; the technology for shielding detection based on the key points of the human face and the five sense organs is difficult to effectively detect the shielding phenomenon under the condition of small-area shielding or non-five sense organ area shielding; when the deep learning method is used for face shielding detection, a training model needs to contain a large number of samples of different shielding objects and shielding action postures, so that the training cost is high, and the generalization capability of an algorithm is low.

In view of this, some embodiments of the present disclosure provide a face occlusion detection method, system, device, and storage medium, which process target face data through an occlusion detection model, and determine an occlusion result, where the target face data is obtained based on initial face data, and the occlusion detection model is obtained through combined training with a gradient prediction model, so that efficiency and accuracy of face occlusion detection are improved.

Fig. 1 is a schematic view of an application scenario of a face occlusion detection system according to some embodiments of the present disclosure. As shown in fig. 1, an application scenario 100 of a face occlusion detection system may include a processor 110, a network 120, a user terminal 130, and a storage device 140.

The processor 110 may process data and/or information obtained from other devices or system components. Processor 110 may execute program instructions based on such data, information, and/or processing results to perform one or more of the functions described in the embodiments herein. For example, the processor 110 may detect the initial face data by performing the face occlusion detection methods disclosed in the present specification to determine an occlusion result. Illustratively, the processor 110 may process the initial face data to determine target face data; and determining an occlusion result through an occlusion detection model based on the target face data.

In some embodiments, the processor 110 may communicate with the user terminal 130, the storage device 140 over the network 120 to provide various functions of the online service.

In some embodiments, the processor 110 may contain one or more sub-processing devices (e.g., single-core processing devices or multi-core processing devices). By way of example only, the processor 110 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a special purpose instruction processor (ASIP), a Graphics Processor (GPU), a Digital Signal Processor (DSP), a microprocessor, or the like, or any combination thereof. In some embodiments, the processor 110 may be implemented on a cloud platform. For example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, or the like, or any combination thereof.

The network 120 may include any suitable network capable of facilitating the exchange of information and/or data of the application scenario 100 of the face occlusion detection system. Network 120 enables communication between components and other parts of the system to facilitate the exchange of data and/or information. For example, the initial face data stored by the storage device 140 may be transmitted to the processor 110 for processing via the network 120. For another example, the processor 110 may transmit the occlusion result to the storage device 140 over the network 120.

In some embodiments, network 120 may be any one or more of a wired network or a wireless network. For example, the network 120 may include a fiber optic network, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a bluetooth network, a ZigBee network (ZigBee), a cable connection, and the like, or any combination thereof. In some embodiments, network 120 may be a point-to-point, shared, centralized, etc. variety of topologies or combinations of topologies.

User terminal 130 refers to one or more terminal devices or software used by a user. The user may interact with the face occlusion detection system via the user terminal 130. By way of example, the user terminal 130 may include a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, or the like, or any combination thereof. In some embodiments, the user may send face data to the processor 110 through the user terminal 130 and receive occlusion results fed back by the processor 110. In some embodiments, the processor 110 may be part of the user terminal 130.

Storage device 140 may be used to store data, instructions, and/or any other information. In some embodiments, the storage device 140 may store data and/or information acquired from at least one component of the application scenario 100 of the face occlusion detection system or an external data source. In some embodiments, the storage device 140 may also store data and/or instructions related to face occlusion detection. For example, the storage device 140 may store computer instructions for face occlusion detection. For another example, the storage device 140 may store initial face data, target face data, occlusion results, and the like.

In some embodiments, the storage device 140 may be connected to the network 120 for communication with the processor 110.

In some embodiments, the storage device 140 may include Random Access Memory (RAM), read Only Memory (ROM), mass storage, removable memory, volatile read-write memory, and the like, or any combination thereof. By way of example, mass storage may include magnetic disks, optical disks, solid state disks, and the like. In some embodiments, storage device 140 may be implemented on a cloud platform.

For more details on the initial face data, target face data, occlusion results, etc. described above, see fig. 2-5 and their associated description.

It should be noted that the application scenario 100 of the face occlusion detection system is provided for illustrative purposes only and is not intended to limit the scope of the present application. Many modifications and variations will be apparent to those of ordinary skill in the art in light of the present description. For example, the application scenario 100 of the face occlusion detection system may implement similar or different functionality on other devices. However, such changes and modifications do not depart from the scope of the present application.

FIG. 2 is an exemplary block diagram of a face occlusion detection system according to some embodiments of the present description. As shown in fig. 2, the face occlusion detection system 200 may include a first determination module 210, a second determination module 220. The face occlusion detection system 200 according to the embodiment of the present specification will be described in detail below. It should be noted that the following examples are only for explaining the present specification, and do not constitute a limitation of the present specification.

In some embodiments, the first determination module 210 may be configured to process the initial face data to determine the target face data.

In some embodiments, the first determination module 210 may be further configured to determine a face detection result by a face detection model based on the initial face data, the face detection model being a third machine learning model; and performing edge expansion processing on the face detection result to determine target face data.

In some embodiments, the second determination module 220 may be configured to determine the occlusion result by an occlusion detection model based on the target face data, wherein the occlusion detection model is a first machine learning model, the occlusion detection model is obtained by training in combination with a gradient prediction model, the gradient prediction model is a second machine learning model, the gradient prediction model determines the first gradient image based on an intermediate feature image, the intermediate feature image being an output of an intermediate layer of the initial occlusion detection model.

In some embodiments, the face occlusion detection system 200 may further include a training module 230, and the training module 230 may be configured to train the acquisition occlusion detection model in conjunction with the gradient prediction model.

In some embodiments, the jointly trained penalty function includes a first penalty function; the training sample of the combined training comprises sample target face data, the label corresponding to the training sample comprises an actual shielding result, the first loss function reflects the difference between the sample shielding result and the actual shielding result, and the sample shielding result is determined through an initial shielding detection model based on the sample target face data.

In some embodiments, the jointly trained penalty function further comprises a second penalty function; the second loss function reflects a gradient difference between the first gradient image and the second gradient image of the sample target face data, and the intermediate feature image is determined by an intermediate layer of the initial occlusion detection model based on the sample target face data.

In some embodiments, the second gradient image is determined by a laplace transform based on the sample target face data.

In some embodiments, the weight corresponding to the first loss function is greater than the weight corresponding to the second loss function.

For more details on the contents of the above-described target face data, occlusion result, face detection result, occlusion detection model, gradient prediction model, etc., see description of other parts of the present specification (e.g., fig. 3).

It should be noted that the above description of the face mask detection system 200 and its modules is for convenience only and is not intended to limit the present disclosure to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles. In some embodiments, the first determination module 210, the second determination module 220, and the training module 230 disclosed in fig. 2 may be different modules in one system, or may be one module to implement the functions of two or more modules described above. For example, each module may share one memory module, or each module may have a respective memory module. Such variations are within the scope of the present description.

Fig. 3 is an exemplary flow chart of a face occlusion detection method according to some embodiments of the present description. As shown in fig. 3, the process 300 may include the following steps. In some embodiments, the process 300 may be performed by a processor.

Step 310, the initial face data is processed to determine target face data.

The initial face data refers to initially acquired data related to face features. For example, the initial face data may include an initially acquired face image or the like. In some embodiments, the initial face data may be obtained in a variety of ways. For example, the initial face data may be captured by an image capturing device, which may include a camera or the like.

The target face data refers to processed data related to face features. In some embodiments, the target face data may refer to a processed face image, e.g., size 3112/>112 (having 3 channels, size 112 +.>112 pixels) of a face RGB image, etc.

In some embodiments, the processor may process the initial face data in a number of ways to determine the target face data. For example, the processor may process the initial face data according to a preset rule, and determine the target face data, where the preset rule refers to a rule for performing multiple aspects of processing on the initial face data, and the preset rule may be preset by a technician according to experience, and for example, the preset rule may include performing image normalization processing on brightness, contrast, color, and the like of the initial face data.

In some embodiments, the processor may determine the face detection result through the face detection model based on the initial face data, and perform edge-expansion processing on the face detection result to determine the target face data. For more on this part see fig. 5 and its related description.

Step 320, determining an occlusion result through an occlusion detection model based on the target face data.

Occlusion detection model 320-2 refers to a model for determining occlusion results. The occlusion detection model may be a first machine learning model. In some embodiments, occlusion detection model 320-2 may be a multi-layer lightweight neural network, such as a mobilet_v3_small network or other viable structure, or the like.

The attention mechanism module SE module squeeze and excitation block is introduced into the Mobilene_V3_Small network, and the learning capacity of the network is enhanced by utilizing the relation of combining characteristic channels, so that the masking of the central areas of non-human faces such as chin, ears and the like can be dynamically adapted and effectively judged. The Mobilene_V3_Small network has the characteristics of Small parameter quantity and light weight, can enable face shielding detection to be carried out in an environment with limited resources, and can improve the face shielding detection efficiency.

In some embodiments, the input of occlusion detection model 320-2 may include target face data 320-1 and the output may be occlusion result 320-3.

The occlusion result refers to a judgment result of whether an occlusion object exists in the face image. For example, occlusion results may include occlusion and non-occlusion. The occlusion result is that there is an occlusion, which can be represented by a value of 1, and the occlusion result is that there is no occlusion, which can be represented by a value of 0.

In some embodiments, the occlusion detection model may be obtained in a number of ways. For example, the occlusion detection model may be obtained by training the initial occlusion detection model alone. In some embodiments, the processor may input a first training sample (sample target face data) into the initial occlusion detection model, and construct a separately trained loss function based on the output result of the initial occlusion detection model (sample occlusion result) and the first label (actual occlusion result).

The separate training of the initial occlusion detection model may include one or more iterations. For example only, in the current iteration, for each first training sample, the processor may input sample target face data into an initial occlusion detection model, determining a sample occlusion result. If the current iteration is the first iteration, the intermediate occlusion detection model may be an initial occlusion detection model. If the current iteration is another iteration, the intermediate occlusion detection model may be the occlusion detection model updated in the previous iteration.

In some embodiments, the processor may iteratively update parameters of the initial occlusion detection model based on the plurality of first training samples such that the individually trained loss function satisfies a preset condition (e.g., the loss function converges, or the loss function value is less than a preset value, etc.). When the loss function trained independently meets the preset condition, model training is completed, and the middle occlusion detection model at the completion of the iteration can be used as an occlusion detection model. Methods of updating parameters of the initial occlusion detection model may include, but are not limited to, batch gradient descent algorithms (Batch Gradient Descent, BGD), random gradient descent algorithms (Stochastic Gradient Descent, SGD), and the like. The first training sample and the first label of the separate training may be consistent with those of the joint training, and specific content of the sample target face data, the sample shielding result and the actual shielding result may be referred to as fig. 4 and related descriptions thereof.

The initial occlusion detection model refers to a pre-trained model that acquires the basis of the occlusion detection model. In some embodiments, the initial occlusion detection model may be a machine learning model of the custom structure below, or other structure. For example, the initial occlusion detection model may be a Mobilene_V3_Small model trained from scratch.

In some embodiments, the occlusion detection model may be obtained by co-training an initial occlusion detection model with a gradient prediction model. For more on joint training see fig. 4 and its related description.

The gradient prediction model refers to a model for predicting pixel gradients in an image. The gradient prediction model may be a second machine learning model. In some embodiments, the gradient prediction model may be a multi-layer convolutional neural network, e.g., a three-layer convolutional network or other viable structure, etc., where the three-layer convolutional network may be a convolutional layer (convolutional layer, conv layer) +batch normalization layer (Batch Normalization Layer, BN layer) +linear correction unit layer (Rectified Linear Unit layer, reLU layer) network structure.

In some embodiments, the input of the gradient prediction model may comprise an intermediate feature image and the output may be a first gradient image.

The intermediate feature image refers to an image capable of describing the deeper features of the face. For example, size 6424/>24 (64 channels, size 24 +.>24 pixels) face image, etc. In some embodiments, the intermediate feature image may be obtained from an intermediate layer output of the initial occlusion detection model.

The middle layer refers to a layer structure located between an input layer and an output layer in the occlusion detection model. In some embodiments, the intermediate layer may be a layer structure that performs feature extraction. In some embodiments, the number of intermediate layers may include a plurality, for example, a plurality of convolution layers. The plurality of intermediate layer inputs and outputs have a sequencing. The input of the first intermediate layer is the output of the input layer and the output of the first intermediate layer is the input of the second intermediate layer. In some embodiments, the input of a certain intermediate layer may be the output of the previous intermediate layer, and the output of the intermediate layer may be the intermediate feature image. The positions of the intermediate layers in the plurality of intermediate layers can be preset according to actual requirements or historical experience.

The first gradient image refers to a visual image reflecting the magnitude of change in face data. In some embodiments, the first gradient image may be 1 in size24/>24 (having 1 channel, size 24 +.>24 pixels) face image, etc.

In some embodiments, the processor may perform a two-class prediction on the target face data through an occlusion detection model to determine an occlusion result.

In some embodiments of the present disclosure, the quality of input data of the occlusion detection model can be improved by processing the initial face data to obtain target face data, thereby improving the recognition capability of the occlusion detection model. The occlusion detection model is obtained through the combined training of the initial occlusion detection model and the gradient prediction model, the characteristic representation capability of different models can be combined, and richer and more abstract characteristic representations are learned, so that the performance and generalization capability of the occlusion detection model are improved, and the occlusion result is determined more accurately.

FIG. 4 is an exemplary diagram illustrating acquisition of occlusion detection models through joint training according to some embodiments of the present description.

In some embodiments, the occlusion detection model may be obtained by co-training an initial occlusion detection model with a gradient prediction model. In some embodiments, the processor may iteratively update parameters of the initial occlusion detection model and the gradient prediction model based on a plurality of training samples of the joint training such that a loss function of the joint training meets a preset condition (e.g., the loss function converges, or the loss function value is less than a preset value). When the loss function of the combined training meets the preset condition, model training is completed, and the trained initial shielding detection model can be used as a shielding detection model.

In some embodiments, the jointly trained penalty function 409 may include a first penalty function 409-1.

In some embodiments, the training samples of the joint training may include sample target face data 401. Sample target face data 401 refers to target face data used as a training sample for joint training. For more content on the target face data, see fig. 3 and its associated description.

In some embodiments, the sample target face data may be obtained by processing the sample initial face data. For example, the processor may determine a sample face detection result by a face detection model based on the sample initial face data; and performing edge expansion processing on the sample face detection result to determine sample target face data. The above-mentioned process of processing the sample initial face data is similar to the process of processing the initial face data, and specific reference may be made to fig. 5, which is not repeated herein.

The sample initial face data may be obtained based on a face data set. In some embodiments, the processor may construct the face data set based on a variety of manners of a disclosed face detection data set (e.g., MAFA data set, asian face data set), an occlusion face picture generated by a digital person broadcasting video, a self-captured occlusion face picture, and the like. The number of the sample initial face data and the proportion of the occlusion and non-occlusion face data in the sample initial face data can be preset manually according to experience or set by default of the system. For example, the number may beThe ratio may be 1:1.

In some embodiments, the label to which the training sample corresponds may include the actual occlusion result. The actual shielding result refers to the actual category of whether the sample target face data is shielded or not, and the actual shielding result can be obtained through manual labeling.

In some embodiments, the first loss function 409-1 may reflect the difference between the sample occlusion result 403 and the actual occlusion result 404, e.g., the first loss function 409-1 may be a cross entropy loss function (Cross Entropy Loss Function), or the like. Illustratively, the first loss function may be expressed by a formula, as shown in equation (1):

（1）

Wherein,、/>、/>、/>respectively represent a first loss function, a +.>Training samples (th->Sample target face data), +.>Labels corresponding to training samples (actual occlusion result), +.>The samples of the individual samples mask the results.

The sample occlusion result 403 refers to a prediction type of whether or not the sample target face data is occluded. In some embodiments, the sample occlusion result 403 may be determined by the initial occlusion detection model 402 based on the sample target face data 401. For more on the initial occlusion detection model see FIG. 3 and its associated description.

In some embodiments, the processor may iteratively update parameters of the initial occlusion detection model based on the plurality of sample target face data, and when the first loss function meets a preset condition (e.g., the first loss function converges, or the first loss function value is less than a preset value), model training is completed, and a trained initial occlusion detection model is obtained. Methods of updating parameters may include, but are not limited to, batch gradient descent algorithms, random gradient descent algorithms, and the like.

In some embodiments of the present disclosure, in the joint training process, by referring to the first loss function to reflect the difference between the sample occlusion result and the actual occlusion result, the model performance may be measured, and the initial occlusion detection model parameters may be adjusted, so that the occlusion detection model obtained by training is more optimized.

In some embodiments, the jointly trained penalty function 409 may also include a second penalty function 409-2.

In some embodiments, the second loss function 409-2 may reflect a gradient difference between the first gradient image 407 and the second gradient image 408 of the sample target face data, e.g., the second loss function 409-2 may be a mean square error loss function (Mean Square Error Loss Function), or the like. Illustratively, the second loss function may be expressed by a formula, as shown in equation (2):

（2）

wherein,、/>、/>、/>respectively represent a second loss function, a +.>Image gradient of first gradient image of individual sample target face data, < th->Image gradient and sample number of the second gradient image of the target face data of each sample.

In some embodiments, the processor may determine the first gradient image 407 by the gradient prediction model 406 based on the intermediate feature image 405, and the intermediate feature image 405 may be determined by the intermediate layer 402-1 of the initial occlusion detection model based on the sample target face data 401. For more on the first gradient image see fig. 3 and its related description.

The second gradient image refers to another visualized image different from the first gradient image reflecting the magnitude of change of the face data. In some embodiments, the second gradient image may be 1 in size 24/>24.

In some embodiments, the second gradient image 408 may be determined by a laplace transform based on the sample target face data 401.

When the face picture is shielded, for example, a mask or a sunglasses are worn, and other objects (such as hands, lipstick and the like) touch the face, obvious gradient changes can appear at the touch edge. According to the characteristics, a Laplace auxiliary supervision deep convolutional network training method is introduced, and a network structure combining Laplace characteristics and deep convolutional neural network characteristics fusion learning is designed according to gradient diagram changes of a face shielding diagram after Laplace transformation.

In some embodiments, the processor may fix the sample target face data size to a specified size (e.g., 124/>24, etc.), and then regularized by laplace transform to obtain a second gradient image. The Laplace transform (Laplace Transform) can effectively extract edge information, has rotation invariance, and regularization can recover image edges and texture details and can smooth noise.

In some embodiments of the present disclosure, the face image is transformed using laplace transform, so that there is a significant gradient change in the touch edge when the object occludes the face, which is beneficial to accurately determining the second gradient image that can detect the edge and occlusion details in the image.

The gradient difference refers to a difference in image gradient between the first gradient image and the second gradient image of the sample target face data. In some embodiments, the processor may acquire the image gradient in a variety of ways, such as a Sobel operator, or the like. In some embodiments, the processor may determine the difference in image gradients as a gradient difference.

In some embodiments, the gradient prediction model may be obtained through training. The second training sample of the training gradient prediction model may include sample initial face data and intermediate feature images. The second label corresponding to the second training sample may be a second gradient image.

In some embodiments, the processor may determine a second loss function based on a gradient difference between the first gradient image and the second label second gradient image output by the gradient prediction model, and update the gradient prediction model based on a value of the loss function. Based on the second loss function meeting a preset condition (e.g., the loss function converges, or the loss function value is less than a preset value), model training is completed, and a trained gradient prediction model is obtained.

In some embodiments of the present disclosure, in the joint training process, by reflecting the gradient difference between the first gradient image and the second gradient image by referring to the second loss function, the performance of the intermediate layer may be measured, and the parameters of the intermediate layer may be adjusted, so that the occlusion detection model obtained by training is further optimized.

In the process of joint training, in order to balance the contribution of the mobilet_v3_small network and the Conv layer+bn layer+relu layer network to network learning, two parameters of a weight corresponding to a first loss function and a weight corresponding to a second loss function are introduced to control learning optimization of two losses to a model, wherein the sum of the weight corresponding to the first loss function and the weight corresponding to the second loss function is 1, and in an exemplary embodiment, the weight corresponding to the first loss function is 0.8 and the weight corresponding to the second loss function is 0.2 in joint training.

In some embodiments, the jointly trained penalty function may be determined by a weighted sum of the first penalty function and the second penalty function. Illustratively, the weighted sum formula is as shown in formula (3):

（3）

wherein,、/>、/>respectively represents the loss function of the joint training, the weight corresponding to the first loss function and the weight corresponding to the second loss function, and +.>，/>。

In some embodiments of the present disclosure, a greater weight is given to the first loss function than the second loss function, so that the contribution of the first loss function and the second loss function to network learning can be balanced, learning of an initial occlusion detection model is optimized, and the accuracy of the occlusion detection model obtained by training on face occlusion classification is higher.

In some embodiments, the processor may input the sample target face data 401 into the initial occlusion detection model 402, output the sample occlusion result 403, and output the intermediate feature image 405 by the intermediate layer 402-1 of the initial occlusion detection model, input the intermediate feature image 405 into the gradient prediction model 406, and output the first gradient image 407. Based on the sample shielding result 403 and the actual shielding result 404 corresponding to the training sample, a first loss function 409-1 is constructed, based on the first gradient image 407 and a second gradient image 408 obtained by performing laplace transform on the sample target face data 401, a second loss function 409-2 is constructed, and the first loss function 409-1 and the second loss function 409-2 are weighted and summed to determine a loss function 409 for joint training. Based on the iterative update of the jointly trained loss function until the jointly trained loss function 409 satisfies the iterative update condition (e.g., the loss function value is less than a threshold, or the loss function converges), a trained gradient prediction model 406 and an initial occlusion detection model 402 are obtained, and the trained initial occlusion detection model is determined to be the occlusion detection model 320-2.

In some embodiments, the processor processes the initial face data 510 and determining the target face data 540 may include: based on the initial face data 510, determining a face detection result 530 by a face detection model 520, the face detection model 520 being a third machine learning model; the face detection result 530 is subjected to edge-expansion processing, and target face data 540 is determined.

The face detection model 520 refers to a model for identifying and locating faces in images or videos. The face detection model 520 may be a third machine learning model. In some embodiments, the face detection model 520 may be a deep learning model, such as a RetinaFace model or other viable structures, or the like. In some embodiments, the input of the face detection model 520 may include initial face data 510 and the output may be a face detection result 530. For more on the initial face data see fig. 3 and its associated description.

In some embodiments, the face detection model may be obtained through training. The third training sample for training the face detection model may comprise sample initial face data. Sample initial face data refers to initial face data used as a training sample for training a face detection model. The third label corresponding to the third training sample may be an actual face detection result of the initial face data corresponding to the sample. In some embodiments, the third tag may be based on a manual annotation.

In some embodiments, parameters of the face detection model may be iteratively updated based on a plurality of third training samples such that a loss function of the face detection model satisfies a preset condition. For example, the loss function converges, or the loss function value is smaller than a preset value. And when the loss function meets the preset condition, model training is completed, and a trained face detection model is obtained.

The face detection result 530 refers to the identified and located face region and related information. For example, the face detection result may include a face bounding box or the like. The bounding box (bbox) may be a rectangular box with coordinate information, and the face bounding box contains a face region.

In some embodiments, the edge-expansion process may refer to edge-expanding the face bounding box by a factor such that the face boundary may include the entire face contour, ear position, and so on.

In some embodiments, the edge extension multiple of the edge extension treatment may be between 1.05 and 1.5 times.

The expansion multiple refers to a multiple of expansion of the length, width, or area of the face bounding box, for example, 1.2 times.

In some embodiments of the present description, using a magnification of between 1.05 and 1.5 times, the size of the face bounding box is slightly enlarged to accommodate more surrounding information, which can subsequently improve the robustness and recall of occlusion detection models.

In some embodiments of the present disclosure, the face detection model is used to identify the initial face data, so that a face region can be located in an image, so that the occlusion detection model only detects a face, thereby improving the detection efficiency of the model and reducing the false detection rate. The face detection result output by the face detection model is moderately subjected to edge expansion processing, so that the boundary box can be expanded, the face is ensured to be completely extracted, missed detection is avoided, and the quality of target face data is improved.

One or more embodiments of the present disclosure provide a face occlusion detection device. The device comprises at least one memory and at least one processor, wherein the at least one memory is used for storing computer instructions, and the at least one processor executes the computer instructions or part of the instructions to realize the face shielding detection method.

One or more embodiments of the present specification provide a computer-readable storage medium storing computer instructions that, when read by a computer, perform the face occlusion detection method.

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. A method of face occlusion detection, the method performed by a processor, comprising:

Processing the initial face data to determine target face data;

determining an occlusion result through an occlusion detection model based on the target face data, wherein the occlusion detection model is a first machine learning model, the occlusion detection model is obtained through combined training with a gradient prediction model, the gradient prediction model is a second machine learning model, the gradient prediction model determines a first gradient image based on an intermediate feature image, and the intermediate feature image is output of an intermediate layer of an initial occlusion detection model;

the training samples of the combined training comprise sample target face data, and the labels corresponding to the training samples comprise actual shielding results; the loss function of the combined training comprises a first loss function, wherein the first loss function reflects the difference between a sample shielding result and the actual shielding result, and the sample shielding result is determined through the initial shielding detection model based on the sample target face data;

the loss function of the combined training further comprises a second loss function, the second loss function reflects gradient differences between the first gradient image and the second gradient image of the sample target face data, and the intermediate feature image is determined through the intermediate layer of the initial occlusion detection model based on the sample target face data;

Wherein the second gradient image is determined by laplace transform based on the sample target face data.

2. The method of claim 1, wherein the first loss function corresponds to a greater weight than the second loss function.

3. The method of claim 1, wherein processing the initial face data to determine target face data comprises:

determining a face detection result through a face detection model based on the initial face data, wherein the face detection model is a third machine learning model;

and performing edge expansion processing on the face detection result to determine the target face data.

4. A face occlusion detection system, the system comprising:

the first determining module is configured to process the initial face data and determine target face data;

the second determining module is configured to determine an occlusion result through an occlusion detection model based on the target face data, wherein the occlusion detection model is a first machine learning model, the occlusion detection model is obtained through combined training with a gradient prediction model, the gradient prediction model is a second machine learning model, the gradient prediction model determines a first gradient image based on an intermediate feature image, and the intermediate feature image is output of an intermediate layer of the initial occlusion detection model;

5. The system of claim 4, wherein the first determination module is further configured to:

6. A face occlusion detection device, characterized in that the device comprises at least one memory for storing computer instructions and at least one processor executing the computer instructions or part of the instructions to implement the face occlusion detection method of any of claims 1-3.

7. A computer-readable storage medium storing computer instructions that, when read by a computer, perform the face occlusion detection method of any of claims 1-3.