CN112699826A

CN112699826A - Face detection method and device, computer equipment and storage medium

Info

Publication number: CN112699826A
Application number: CN202110009932.2A
Authority: CN
Inventors: 丘延君
Original assignee: Forchange Technology Shenzhen Co ltd
Current assignee: Forchange Technology Shenzhen Co ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-04-23
Anticipated expiration: 2041-01-05
Also published as: CN112699826B

Abstract

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for face detection, a computer device, and a storage medium. The method comprises the following steps: acquiring an image to be detected, wherein the image to be detected comprises a face to be detected; extracting features of an image to be detected through a detection model to obtain human face features, wherein the detection model comprises a lightweight deconvolution module, and the lightweight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer; and performing regression prediction on the face features to obtain a detection result of the face to be detected. By adopting the method, the accuracy of the face detection of the edge equipment can be improved.

Description

Face detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for face detection, a computer device, and a storage medium.

Background

With the development of computer technology, face detection is widely applied to various industries, such as remote body checking of bank systems, face payment of WeChat, remote authentication of taxi drivers, community access control systems and the like.

In a conventional manner, in order to ensure accuracy of face detection, a detection model needs to occupy hundreds of M of memory, that is, needs to be supported by a Graphics Processing Unit (GPU). On edge devices without a GPU, conventional detection models do not function properly at all. Thus, the accuracy of face detection is low.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a face detection method, an apparatus, a computer device, and a storage medium, which can improve face detection accuracy of an edge device.

A method of face detection, the method comprising:

acquiring an image to be detected, wherein the image to be detected comprises a face to be detected;

extracting features of an image to be detected through a detection model to obtain human face features, wherein the detection model comprises a lightweight deconvolution module, and the lightweight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer;

and performing regression prediction on the face features to obtain a detection result of the face to be detected.

In one embodiment, extracting features of an image to be detected through a detection model to obtain human face features includes:

extracting multi-scale features of an image to be detected through each convolution module of the detection model to obtain initial features corresponding to each scale;

determining target scales from a plurality of scales, and determining each initial feature corresponding to each target scale as a target initial feature;

and carrying out deconvolution processing on the initial features of each target through a lightweight deconvolution module to generate the face features corresponding to the scales of each target.

In one embodiment, deconvoluting each target initial feature by a lightweight deconvolution module to generate a face feature corresponding to each target scale includes:

obtaining the face features corresponding to the minimum target scale according to the target initial features of the minimum scale in the target scales;

carrying out deconvolution processing on the target initial characteristic of the minimum scale through a lightweight deconvolution module to generate a target characteristic corresponding to a first target scale;

performing fusion processing on the target features corresponding to the first target scale and the target initial features corresponding to the first target scale to obtain the face features corresponding to the first target scale;

and taking the target initial feature corresponding to the first target scale as the target initial feature of the minimum scale, and continuing to perform deconvolution processing and fusion processing until the face features corresponding to all the target scales are obtained.

In one embodiment, deconvoluting the target initial feature of the minimum scale by a lightweight deconvolution module to generate a target feature corresponding to a first target scale includes:

performing first deconvolution processing on the target initial characteristic through the general convolution layer to obtain a first characteristic of a first proportional output channel meeting a first target scale;

performing second deconvolution processing on the first features through the deconvolution layer to obtain second features of a second proportion output channel meeting the first target scale, wherein the first proportion is smaller than the second proportion;

and performing third deconvolution processing on the second characteristic through separable convolution to generate the target characteristic meeting the first target scale and corresponding to the output channel.

In one embodiment, performing regression prediction on the face features to obtain a detection result of a face to be detected includes:

determining a target anchor frame corresponding to each target scale, wherein the target anchor frame comprises an anchor frame of the current target scale and anchor frames of adjacent target scales, and the scales of the adjacent target scales are closest to the current target scale and are larger than the current target scale;

and performing regression prediction on the face features of each target scale according to the target anchor point frame corresponding to each target scale to obtain detection results of the face to be detected corresponding to different target scales.

In one embodiment, the detection model is a pre-trained detection model, and the training mode of the detection model includes:

acquiring a training set image;

calibrating each face in the training set image to obtain a calibrated training set image;

inputting the images of the training set into the constructed initial detection model, carrying out face detection through the initial detection model pair, and generating a corresponding detection result;

calculating the model loss of the initial detection model according to the detection result and the calibrated training set data;

based on the model loss, adjusting model parameters of the initial detection model to obtain a parameter-adjusted initial detection model;

and performing iterative training on the initial detection model after parameter adjustment based on preset training parameters to obtain a trained detection model.

In one embodiment, calibrating each face in the training set image to obtain a calibrated training set image includes:

determining the image area ratio of each face in the training set image;

and removing the human face with the image area ratio smaller than the image ratio threshold value through a preset mask template, and obtaining a calibrated training set image by using the calibration frame to the human face with the image area ratio larger than or equal to the image ratio threshold value.

An apparatus for face detection, the apparatus comprising:

the image acquisition module to be detected is used for acquiring an image to be detected, wherein the image to be detected comprises a face to be detected;

the system comprises a feature extraction module, a detection module and a processing module, wherein the feature extraction module is used for extracting features of an image to be detected through a detection model to obtain human face features, the detection model comprises a lightweight deconvolution module, and the lightweight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer;

and the regression prediction module is used for carrying out regression prediction on the face characteristics to obtain the detection result of the face to be detected.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the above embodiments when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.

According to the face detection method, the face detection device, the computer equipment and the storage medium, the face features are obtained by obtaining the image to be detected and extracting the features of the image to be detected through the detection model, the image to be detected comprises the face to be detected, the detection model comprises the lightweight deconvolution module, the lightweight deconvolution module comprises the general convolutional layer, the deconvolution layer and the separable convolutional layer, and then regression prediction is carried out on the face features to obtain the detection result of the face to be detected. Therefore, after the image to be detected is obtained, feature extraction and regression prediction can be carried out according to the detection model comprising the lightweight deconvolution module, the occupation of the model on a memory can be reduced, the edge device is used for face detection, and the accuracy of face detection of edge recognition can be improved. And the light-weight deconvolution module comprises a general convolution layer, a deconvolution layer and a separable convolution layer, the model is simple in structure, and compared with a traditional deconvolution network, the light-weight deconvolution module is less in parameter calculation amount, can reduce the calculation amount in the detection process, and further can improve the processing efficiency.

Drawings

FIG. 1 is a diagram of an application scenario of a face detection method in an embodiment;

FIG. 2 is a schematic flow chart of a face detection method according to an embodiment;

FIG. 3 is a schematic diagram of SeparabableBlock and LightDeconv2DBlock in one embodiment;

FIG. 4 is a schematic illustration of a detection model in one embodiment;

FIG. 5 is a diagram illustrating face detection performed by a model according to one embodiment;

FIG. 6 is a block diagram of an embodiment of a face detection apparatus;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In this embodiment, the edge device may acquire an image to be detected through the image acquisition device, where the image to be detected includes a face to be detected, so that the edge device acquires the image to be detected. Further, after the edge device obtains the image to be detected, the edge device can extract features of the image to be detected through a detection model to obtain the human face features, wherein the detection model comprises a lightweight deconvolution module, and the lightweight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer. Further, the edge device can perform regression prediction on the face features to obtain a detection result of the face to be detected. The edge device may be, but not limited to, various terminals carrying an image capturing function, such as a personal computer, a notebook computer, a smart phone, a tablet computer, and a portable wearable device.

In an embodiment, as shown in fig. 1, a face detection method is provided, which is described by taking the method as an example of being applied to the foregoing edge device, i.e., a terminal, and includes the following steps:

and S102, acquiring an image to be detected, wherein the image to be detected comprises a face to be detected.

The image to be detected is an image including a face to be detected, and the image to be detected may include a single face or a plurality of faces to be detected.

In this embodiment, a user may acquire an image to be detected through an image acquisition device on the terminal, for example, a camera on a mobile phone or a camera on a notebook computer, so that the terminal, that is, the edge device, may acquire the image to be detected.

In one embodiment, the user may also acquire video data by capturing a video stream, and then intercept each frame of image from the video stream as an image to be detected, so as to facilitate subsequent processing by the terminal.

And step S104, extracting the features of the image to be detected through a detection model to obtain the human face features, wherein the detection model comprises a lightweight deconvolution module, and the lightweight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer.

In this embodiment, the detection model is a model for performing face detection on an image to be detected, and the model inputs the image to be detected and outputs a detection result corresponding to the image to be detected.

In this embodiment, the detection model may include a convolution module and a deconvolution module, where the convolution module is configured to perform continuous convolution processing on an image to be detected, continuously perform convolution on an image with a large scale, and output an image with a small scale to obtain feature maps corresponding to different scales, that is, a human face feature. And the deconvolution module performs deconvolution processing on the small-scale image to obtain a large-scale image.

In this embodiment, the deconvolution module is a lightweight deconvolution module (Light decontv 2D Block), which may include a general convolutional layer, a deconvolution layer, and a separable convolutional layer.

In this embodiment, referring to fig. 2, a schematic structural diagram of a separable convolution module separatablocklock and a lightweight deconvolution module lightdeconnv 2DBlock is shown. Unlike the direct use of the common deep separable convolutional SeparableConv2D, the LightDeconv2DBlock proposed in this application comprises a generic convolutional layer Conv2D, a deconvolution layer Conv2DTranpose and a separable convolutional layer SeparableBlock. Also in LightDeConv2DBlock we use BN layers for normalization between different convolutional layers.

In this embodiment, with continued reference to fig. 2(b), the convolution kernel of Conv2D is 3 × 3, the step size is 1, and the activation function is a linear function; conv2DTranspose has a convolution kernel of 3x3, a step size of 1 and an activation function of a linear function; the convolution kernel of the separatableblock is 3x3, step size is 1, activation function is relu function, no residual concatenation. Since the separatableblock does not use residual concatenation, this corresponds to separatableconv 2D. In other embodiments, the separatableblock may also include residual concatenation, i.e., as shown in fig. 2 (a).

In this embodiment, both the Conv2D and the Conv2DTranspose use linear activation, so that the two layers of networks can act as compression and expansion of features, the linear activation function can retain feature information to the maximum extent, and relu filters out a part of information with a value less than 0.

Further, in LightDeConv2DBlock, the different convolutional layers may be normalized with BN layers to normalize the results.

The parameter of the lightweight deconvolution module lightdecontv 2DBlock in the present application is analyzed as follows.

In this example, through H_outRepresenting the number of channels of the input signature in the convolution, H_inAnd W_inHeight and width of input feature map, respectively, C_outNumber of channels, H, representing output feature map in convolution_outAnd W_outRepresenting the height and width of the output signature, respectively. The size of the convolution kernel is denoted by k (considering the k × k case only), the step size is 1, and the bias (bias) is not considered. The parameter amounts and the operation amounts of the normal convolution, the deconvolution, and the depth separable convolution can be calculated as shown in the following equations (1) to (5), that is, the parameter amount calculation formula of the normal convolution (Conv2D) is shown in the following equation (1), the operation amount calculation formula of the normal convolution (Conv2D) is shown in the following equation (2), the parameter amount calculation formula of the deconvolution (Conv2 dtransose) is shown in the following equation (3), the operation amount calculation formula of the deconvolution (Conv2 dtransose) is shown in the following equation (4), the parameter amount calculation formula of the depth separable convolution (separateconv 2D) is shown in the following equation (5), and the operation amount calculation formula of the depth separable convolution (separatoconv 2D) is shown in the following equation (6).

C_out×k²×C_in (1)

C_in×k²×C_out×H_out×W_out (2)

C_out×k²×C_in (3)

C_in×k²×C_out×H_out×W_out (4)

C_out×k²+C_out×C_in (5)

C_out×k²×H_out×W_out+k²×C_in×H_out×W_out (6)

In this embodiment, the lightweight deconvolution module lightdecontv 2DBlock is composed of 1 × 1 Conv2D, 3 × 3 Conv2 dtransose, and 3 × 3 spectbleconv 2D, and it can be determined that the input and output of each module layer are as shown in table 1 below.

TABLE 1 LightDeconv2DBlock different part I/O

Further, the parameter and the operand of each module layer can be obtained according to parameter and operand calculation formulas of different convolution structures as shown in table 2.

TABLE 2 parameters and operands for different parts of LightDeconv2DBlock

Further, the total parameter quantity of the light-weight deconvolution module lightdecontv 2DBlock can be obtained as

That is, the corresponding result can be obtained as

Further, a comparison of the parameters of the lightweight deconvolution module lightdeconrv 2DBlock and the general deconvolution module (Conv2DTranspose) under different conditions can be obtained, as shown in table 3.

TABLE 3 LightDeconv2DBlock and Conv2DTranspose reference numbers comparison

In this example, as can be seen from Table 3, when C_outLarger, e.g. C_outGreater than 128, 1/C_outCan be ignored. And, in different cases, the parameter amount of the lightweight deconvolution module lightdecontv 2DBlock is about 1/23 to 1/14 of the normal deconvolution module Conv2 dtransose. And at C_in＝C_outAnd C_in＝2×C_outIn both cases, the parameters of the lightweight deconvolution module lightdecontv 2DBlock are approximately 1/23 and 1/14 of the normal deconvolution module Conv2 dtransose.

Further, for the operand, if only the most common case of deconvolution is considered, namely H_out＝W_out＝2×H_in＝2×W_inThe light-weight deconvolution module lightdecontv 2DBlock has an operation amount of 2h

While the operation amount of the common Conv2 DTdisplacement is 9 × C_out×C_in×h²。

Therefore, the comparison result of the operation amounts of the lightweight deconvolution module lightdecontv 2DBlock and the general deconvolution module Conv2 dtdisplacement can be obtained, as shown in table 4.

TABLE 4 comparison of LightDeconv2DBlock and Conv2 DTdisplacement

As can be seen from Table 4, when C is_outLarger, e.g. C_outGreater than 128, 4/C_outCan be ignored. And, in different cases, lightweightThe operation amount of the deconvolution module lightdecontv 2DBlock is about 1/12 to 1/3 of the common deconvolution module Conv2 dtransose. And at C_in＝C_outAnd C_in＝2×C_outIn both cases, the light-weight deconvolution module lightdecontv 2DBlock operates approximately 1/12 and 1/6 of the normal deconvolution module Conv2 dtransose.

From the above, the computation and parameter of the lightweight deconvolution module in the scheme of the application are originally smaller than those of the common deconvolution module.

In this embodiment, the server may perform convolution processing and deconvolution processing on the image to be detected through the detection model to obtain feature data, i.e., a face feature, corresponding to the image to be detected.

And S106, performing regression prediction on the face features to obtain a detection result of the face to be detected.

In this embodiment, after obtaining the face features of the face to be detected in the image to be detected, the server may perform regression prediction on the face features to obtain a corresponding detection structure.

Specifically, the detection result may refer to fig. 3, and the detection result may be a result of performing framing identification on a face in the image to be detected through a face frame.

In the face detection method, the face features are obtained by obtaining an image to be detected and extracting the features of the image to be detected through a detection model, the image to be detected comprises a face to be detected, the detection model comprises a lightweight deconvolution module, the lightweight deconvolution module comprises a general convolution layer, a deconvolution layer and a separable convolution layer, and then regression prediction is carried out on the face features to obtain the detection result of the face to be detected. Therefore, after the image to be detected is obtained, feature extraction and regression prediction can be carried out according to the detection model comprising the lightweight deconvolution module, the occupation of the model on a memory can be reduced, the edge device is used for face detection, and the accuracy of face detection of edge recognition can be improved. And the light-weight deconvolution module comprises a general convolution layer, a deconvolution layer and a separable convolution layer, the model is simple in structure, and compared with a traditional deconvolution network, the light-weight deconvolution module is less in parameter calculation amount, can reduce the calculation amount in the detection process, and further can improve the processing efficiency.

In one embodiment, extracting features of an image to be detected through a detection model to obtain a face feature may include: extracting multi-scale features of an image to be detected through each convolution module of the detection model to obtain initial features corresponding to each scale; determining target scales from a plurality of scales, and determining each initial feature corresponding to each target scale as a target initial feature; and carrying out deconvolution processing on the initial features of each target through a lightweight deconvolution module to generate the face features corresponding to the scales of each target.

In this embodiment, the model structure of the test model may be named TFFD-FPN as shown in fig. 4, and may include convolution modules corresponding to different scales and the aforementioned lightweight deconvolution modules, for example, 128 × 128, 64 × 64, …, 4 × 4, etc., and the specific parameters of the model may be shown in table 5 below.

TABLE 5 TFFD-FPN model Structure

In this embodiment, after acquiring the image to be detected, the terminal may input the image to be detected into the detection model, and perform convolution processing on the detection model for multiple times to obtain initial features corresponding to different scales, that is, initial features corresponding to 128 × 128, initial features corresponding to 64 × 64, initial features corresponding to 32 × 32, initial features corresponding to 16 × 16, initial features corresponding to 8 × 8, initial features corresponding to 4 × 4, and the like.

Further, the terminal may determine the target dimension from a plurality of dimensions. Specifically, the server may determine the target scale according to the output requirement, and with continued reference to fig. 4, the server may determine a plurality of scales with the smallest scale as the target scales, i.e., 16 × 16, 8 × 8, and 4 × 4.

Further, the terminal may determine, based on the determined target scale, each initial feature corresponding to each target scale as a target initial feature, and continue to combine the foregoing examples, that is, determine initial features corresponding to 16 × 16, initial features corresponding to 8 × 8, and initial features corresponding to 4 × 4 as target initial features.

In this embodiment, the terminal may perform deconvolution processing on each target initial feature through the lightweight deconvolution module to generate face features corresponding to each target scale, that is, face features corresponding to 16 × 16, 8 × 8, and 4 × 4 are obtained respectively.

In one embodiment, deconvoluting each target initial feature by a lightweight deconvolution module to generate a face feature corresponding to each target scale may include: obtaining the face features corresponding to the minimum target scale according to the target initial features of the minimum scale in the target scales; carrying out deconvolution processing on the target initial characteristic of the minimum scale through a lightweight deconvolution module to generate a target characteristic corresponding to a first target scale; performing fusion processing on the target features corresponding to the first target scale and the target initial features corresponding to the first target scale to obtain the face features corresponding to the first target scale; and taking the target initial feature corresponding to the first target scale as the target initial feature of the minimum scale, and continuing to perform deconvolution processing and fusion processing until the face features corresponding to all the target scales are obtained.

Specifically, with reference to fig. 4, if the minimum scale is 4 × 4, the terminal may directly obtain the face feature corresponding to the minimum target scale according to the obtained target initial feature of the minimum scale 4 × 4, that is, obtain the face feature of 4 × 4.

Further, the terminal may perform deconvolution processing on the target initial feature of the minimum scale through a lightweight deconvolution module, that is, perform deconvolution processing on the face feature of 4 × 4, and generate a target feature corresponding to the first target scale, that is, a target feature corresponding to 8 × 8.

In this embodiment, the first target dimension is a dimension that is larger than and adjacent to the minimum dimension, and for 4 × 4, the corresponding first target dimension is 8 × 8, and for 8 × 8, the corresponding first target dimension is 16 × 16.

In this embodiment, the server may obtain more target initial features corresponding to the first target scale, and perform fusion processing on the target initial features corresponding to the first target scale obtained by calculating deconvolution to obtain face features corresponding to the first target scale, that is, the terminal may perform feature fusion on 8 × 8 target features and 8 × 8 target initial features to obtain fused features, that is, the face features corresponding to 8 × 8 are obtained.

Further, the terminal can take the target initial feature corresponding to the first target scale as the target initial feature of the minimum scale, and continue to perform deconvolution processing and fusion processing until the face features corresponding to all the target scales are obtained. Continuing to use the previous example, the terminal may use the 8 × 8 target initial features as the minimum scale target initial features, then perform deconvolution processing through the aforementioned lightweight deconvolution module to beat the target features corresponding to the scale 16 × 16, and then perform fusion processing on the 16 × 16 target features and the target initial features to obtain the face features corresponding to the scale 16 × 16.

It will be understood by those skilled in the art that this is merely illustrative, and that in other embodiments, the target dimension may be more or less than 3 dimensions, and the minimum dimension may be greater than 4 x 4 or less than 4 x 4, which is not limited herein.

In one embodiment, deconvolving the target initial feature of the minimum scale by using a lightweight deconvolution module to generate a target feature corresponding to a first target scale, which may include: performing first deconvolution processing on the target initial characteristic through the general convolution layer to obtain a first characteristic of a first proportional output channel meeting a first target scale; performing second deconvolution processing on the first features through the deconvolution layer to obtain second features of a second proportion output channel meeting the first target scale, wherein the first proportion is smaller than the second proportion; and performing third deconvolution processing on the second characteristic through separable convolution to generate the target characteristic meeting the first target scale and corresponding to the output channel.

With continued reference to fig. 2, in this embodiment, the terminal may perform a first deconvolution process on the target initial feature by using the common convolutional layer Conv2D, and perform a deconvolution process on the target initial feature by using a 1/8 output channel (1/8filters) corresponding to a first target scale, so as to obtain a corresponding first feature.

Further, the terminal may perform a second deconvolution process on the first feature through the deconvolution layer Conv2 dtrank, that is, perform a deconvolution process on the first feature corresponding to the 1/4 output channels (1/4filters) with the first target scale, so as to obtain a corresponding second feature.

Further, the terminal may perform a third deconvolution process on the second feature through the separable convolution separable block, that is, perform a deconvolution process corresponding to the first target scale output channel (1.0filters) on the second feature, and generate a target initial feature satisfying the first target scale output channel.

In the above embodiment, the final target feature is obtained by performing deconvolution processing on the target initial feature for multiple times, and compared with the conventional method in which the final target diagnosis is obtained by sequential convolution, the calculation amount can be reduced, and the data processing speed and the processing efficiency can be improved.

In one embodiment, performing regression prediction on the face features to obtain a detection result of the face to be detected may include: determining a target anchor frame corresponding to each target scale, wherein the target anchor frame comprises an anchor frame of the current target scale and anchor frames of adjacent target scales, and the scales of the adjacent target scales are closest to the current target scale and are larger than the current target scale; and performing regression prediction on the face features of each target scale according to the target anchor point frame corresponding to each target scale to obtain detection results of the face to be detected corresponding to different target scales.

In this embodiment, the terminal copies one copy by copying the anchor frame anchor of the large-scale feature layer and moves it down to the small-scale feature layer in a nested manner, that is, if the 16 × 16 feature layer corresponds to the anchor of (16, 24), the 8 × 8 feature layer contains not only the anchor frame anchor (32, 48) corresponding to itself but also the anchor frame anchor (16, 24) of 16 × 16. Similarly, the 4 × 4 feature layer corresponds to (32, 48, 54, 86). So that when the anchor block anchor has a small total amount, e.g., 8832 in fig. 3, positive samples can be generated more easily.

In addition, referring to fig. 5, for 8 × 8 anchor frames anchor with a width and a height of 32, if the face is just at the intersection of several anchor frames, the overlapping degree of the face and any anchor frame anchor meeting the face does not meet the requirement of being a positive sample, and at this time, the overlap degree of any anchor frame anchor meeting the face does not meet the requirement of being a positive sample

Therefore, in the scheme of the application, the anchor point frame anchor corresponding to the large-scale feature layer is copied to the small-scale feature layer, so that the detection model can better distinguish the boundaries of the positive and negative samples.

In addition, by copying the anchor frame anchorages corresponding to the large-scale feature layer to the small-scale feature layer, and continuing to refer to fig. 5, anchor frames anchorages having a width and a height greater than those of the anchor frames anchorages corresponding to 4 × 4 appear on the features of 4 × 4, so that the face can be matched with one anchor frame.

Furthermore, the problem of unbalanced quantity of anchor frames anchor of different feature layers can be solved by copying the anchor frames anchor corresponding to the large-scale feature layer to the small-scale feature layer. For example, conventionally, the anchor block anchor ratio of the feature layers with different scales is 16:4:1, and after the scheme of the application is adopted, the ratio is 8:4: 1. Table 7 shows the number of anchor blocks anchor in one embodiment of the present application.

TABLE 7 Nested Down Anchor policy Table

Feature	Stride	Anchors	Numbers
				16x16	8	16,24	512
8x8	16	16,24,32,48	256
				4x4	32	32,48,54,86	64

In one embodiment, the detection model is a pre-trained detection model, and the training mode of the detection model may include: acquiring a training set image; calibrating each face in the training set image to obtain a calibrated training set image; inputting the images of the training set into the constructed initial detection model, carrying out face detection through the initial detection model pair, and generating a corresponding detection result; calculating the model loss of the initial detection model according to the detection result and the calibrated training set data; based on the model loss, adjusting model parameters of the initial detection model to obtain a parameter-adjusted initial detection model; and performing iterative training on the initial detection model after parameter adjustment based on preset training parameters to obtain a trained detection model.

In this embodiment, because an open source data set specially used for Face detection training acquired by a terminal such as a front camera is absent at present, data similar to the front camera is obtained from the wide Face in a clipping manner.

Further, face calibration is carried out on the obtained training set images, and each face in the training set images is calibrated to obtain calibrated training set images.

In this embodiment, the terminal may input the calibrated training set image into the constructed initial detection model, extract the face features through the initial detection model, perform regression prediction on the extracted face features, and generate a corresponding detection result. The detection result may include a detection frame for each face.

Further, the terminal may calculate a model loss of the initial detection model according to the detection result and the calibrated training set data, for example, calculate a difference between the calibrated face frame and the detected detection frame, and further calculate a loss value.

In this embodiment, the terminal may use a plurality of different loss functions to perform the loss calculation, for example, L1 or L2 loss functions.

In this embodiment, the loss may be classified into classification loss and regression loss, and the terminal may set different loss weights to train the model, for example, the weights of the classification loss and the regression loss are 0.1 and 1.0, respectively, the regression is set to adopt smooth-l1 loss, the classification adopts cross entropy loss, and the like.

Further, the terminal may adjust the model parameters of the initial detection model based on the model loss to obtain an initial detection model after the parameter adjustment, and perform iterative training on the initial detection model after the parameter adjustment based on a preset training parameter to obtain a trained detection model.

In this embodiment, the terminal may perform training of the model by setting a corresponding training configuration, for example, during model training, a conventional data enhancement method such as rotation, mirror image, random mask may be adopted, and the model may also be trained by using an Aadm optimizer, where the initial learning rate is 0.001, the batch-size is 4, the iou (intersection over ratio) thresholds are 0.4 and 0.2, respectively, a sample greater than 0.4 is considered to be a positive sample, a sample less than 0.2 is considered to be a negative sample, and the like.

In one embodiment, calibrating each face in the training set image to obtain a calibrated training set image may include: determining the image area ratio of each face in the training set image; and removing the human face with the image area ratio smaller than the image ratio threshold value through a preset mask template, and obtaining a calibrated training set image by using the calibration frame to the human face with the image area ratio larger than or equal to the image ratio threshold value.

As mentioned above, the scheme of the application obtains the data approximate to the front camera in a cutting mode from the WIDER Face.

Specifically, a data set with relatively many faces may be cropped first, and the smallest face area is 0.0351562 of the image area. For faces smaller than this ratio, the picture pixel mean mask is used. If not mask, which is equivalent to this data set telling the model that this slightly larger face is a "face" and that the smaller face is not a "face", the model does not easily converge.

In this embodiment, the terminal may determine an image area ratio of each face in the training set image, then remove the face whose image area ratio is smaller than the image ratio threshold through a preset mask template, and obtain the calibrated training set image by using the calibration frame for the face whose image area ratio is greater than or equal to the image ratio threshold.

In this embodiment, the model may be pre-trained based on the obtained training set image data set, and after convergence, cut out Face data with a Face area ratio not less than 0.0625 from the new Face, and perform fine adjustment. Thereby completing the training of the model.

The test effects of the present invention will be explained in detail below.

First, for model accuracy. TFFD-SSD and the detection model of the present application TFFD-FPN were evaluated on LFW, FDDB and IPNFace, with the results shown in Table 8.

TABLE 8 comparison of TFFD model results

Model	Dataset	AP
			MobileNetV2-SSD	2k	97.95％
BlazeFace	2k	98.61％
			ACF	FDDB 3k	85.20％
CasCNN	FDDB 3k	85.70％
			FaceCraft	FDDB 3k	90.80％
STN	FDDB 3k	91.50％
			MTCNN	FDDB 3k	94.40％
FaceBoxes	FDDB 3k	96.00％
			TFFD-FPN	FDDB 3k	98.34％
TFFD-SSD	FDDB 3k	95.68％
			TFFD-FPN	LFW 13k	99.56％
TFFD-SSD	LFW 13k	98.11％
			TFFD-FPN	IPNFace 10k	99.52％
TFFD-SSD	IPNFace 10k	98.08％

As can be seen from Table 8, TFFD-FPN has an AP of 98.34% over FDDB, which is superior to the FaceBoxes, ACF, CasCNN, FaceCraft, STN, MTCNN, and other models. Meanwhile, the accuracy rate is high on different data sets, and the AP of the TFFD-FPN on the LFW and the AP of the IPNFace are respectively 99.56% and 99.52%, which shows that the TFFD-FPN can well detect the human face in a complex real scene.

On the other hand, the AP of TFFD-SSD on FDDB, LFW and IPNFace is 95.68%, 98.11% and 98.08%, respectively, which are lower than TFFD-FPN, which shows that LightDeConv2DBlock proposed by the present invention can better utilize the context information of features while reducing the number of parameters.

Secondly, the analysis of the model reasoning performance.

TABLE 9 CPU inference Performance comparison of TFFD with other models

Model	CPU	FPS
			ACF	[email protected]	20
CasCNN	[email protected]	14
			FaceCraft	N/A	10
STN	[email protected]	10
			MTCNN	N/[email protected]	16
FaceBoxes	[email protected]	20
			TFFD-FPN	[email protected]	140

As can be seen from Table 9, the inference speed of TFFD-FPN on CPU can reach 140fps, i.e. about 7 ms. The method is far superior to models such as faceBox, ACF, CasCNN, FaceCraft, STN, MTCNN and the like. And TFFD-FPN only accounts for 1.3Mb, about 400Kb after the model quantization, and can detect pictures shot by a front camera with complex scene and multiple faces. Therefore, the method can be transplanted to most devices related to the front camera scene, such as mobile phones, tablet computers, notebook computers and the like.

Further, with continuing reference to fig. 3, the TFFD-FPN model in the solution of the present application also has very good performance in the case of multiple faces and complex scenes, and can meet the actual requirements of most of devices with front-facing cameras, such as a mobile terminal and a client terminal.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a face detection apparatus including: the image to be detected comprises an acquisition module 100, a feature extraction module 200 and a regression prediction module 300, wherein:

the image to be detected acquiring module 100 is configured to acquire an image to be detected, where the image to be detected includes a face to be detected.

The feature extraction module 200 is configured to extract features of an image to be detected through a detection model to obtain human face features, where the detection model includes a lightweight deconvolution module, and the lightweight deconvolution module includes a general convolutional layer, a deconvolution layer, and a separable convolutional layer.

The regression prediction module 300 is configured to perform regression prediction on the face features to obtain a detection result of the face to be detected.

In one embodiment, the feature extraction module 200 may include:

and the characteristic extraction submodule is used for extracting multi-scale characteristics of the image to be detected through each convolution module of the detection model to obtain initial characteristics corresponding to each scale.

And the target scale determining submodule is used for determining a target scale from a plurality of scales and determining each initial feature corresponding to each target scale as a target initial feature.

And the deconvolution processing submodule is used for performing deconvolution processing on the initial features of each target through the lightweight deconvolution module to generate the face features corresponding to the scales of each target.

In one embodiment, the deconvolution processing sub-module may include:

and the minimum target scale face feature determination unit is used for obtaining the face features corresponding to the minimum target scale according to the target initial features of the minimum scale in the target scales.

And the deconvolution processing unit is used for performing deconvolution processing on the target initial characteristic with the minimum scale through the lightweight deconvolution module to generate a target characteristic corresponding to the first target scale.

And the feature fusion unit is used for carrying out fusion processing on the target features corresponding to the first target scale and the target initial features corresponding to the first target scale to obtain the face features corresponding to the first target scale.

And the repeated processing unit is used for taking the target initial feature corresponding to the first target scale as the target initial feature of the minimum scale, and continuously performing deconvolution processing and fusion processing until the face features corresponding to all the target scales are obtained.

In one embodiment, the deconvolution processing unit may include:

and the first deconvolution processing subunit is used for performing first deconvolution processing on the target initial feature through the general convolutional layer to obtain a first feature of the first proportional output channel meeting the first target scale.

And the second deconvolution processing subunit is used for performing second deconvolution processing on the first features through the deconvolution layer to obtain second features of a second proportion output channel meeting the first target scale, wherein the first proportion is smaller than the second proportion.

And the third deconvolution processing subunit is used for performing third deconvolution processing on the second feature through separable convolution to generate the target feature meeting the first target scale and corresponding to the output channel.

In one embodiment, the regression prediction module 300 may include:

and the target anchor frame determining submodule is used for determining the target anchor frames corresponding to all the target scales, the target anchor frames comprise the anchor frame of the current target scale and the anchor frames of the adjacent target scales, and the scales of the adjacent target scales are closest to the current target scale and are larger than the current target scale.

And the regression prediction submodule is used for performing regression prediction on the face features of each target scale according to the target anchor point frame corresponding to each target scale to obtain the detection results of the face to be detected corresponding to different target scales.

In one embodiment, the detection model is a pre-trained detection model, and the apparatus may further include:

and the training set image acquisition module is used for acquiring training set images.

And the calibration module is used for calibrating each face in the training set image to obtain a calibrated training set image.

And the detection module is used for inputting the images of the training set into the constructed initial detection model, detecting the human face through the initial detection model and generating a corresponding detection result.

And the model loss calculation module is used for calculating the model loss of the initial detection model according to the detection result and the calibrated training set data.

And the parameter adjusting module is used for adjusting the model parameters of the initial detection model based on the model loss to obtain the initial detection model after the parameters are adjusted.

And the iterative training module is used for performing iterative training on the initial detection model after the parameter adjustment based on a preset training parameter to obtain a trained detection model.

In one embodiment, the calibration module may include:

and the image area ratio determining submodule is used for determining the image area ratio of each face in the training set image.

And the calibration submodule is used for removing the human face with the image area ratio smaller than the image ratio threshold value through a preset mask template, and obtaining a calibrated training set image by using the calibration frame to remove the human face with the image area ratio larger than or equal to the image ratio threshold value.

For specific limitations of the face detection apparatus, reference may be made to the above limitations of the face detection method, and details are not described here. All or part of the modules in the face detection device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as images to be detected, human face characteristics, detection results and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a face detection method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program: acquiring an image to be detected, wherein the image to be detected comprises a face to be detected; extracting features of an image to be detected through a detection model to obtain human face features, wherein the detection model comprises a lightweight deconvolution module, and the lightweight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer; and performing regression prediction on the face features to obtain a detection result of the face to be detected.

In one embodiment, the implementing, when the processor executes the computer program, the extracting features of the image to be detected through the detection model to obtain the face features may include: extracting multi-scale features of an image to be detected through each convolution module of the detection model to obtain initial features corresponding to each scale; determining target scales from a plurality of scales, and determining each initial feature corresponding to each target scale as a target initial feature; and carrying out deconvolution processing on the initial features of each target through a lightweight deconvolution module to generate the face features corresponding to the scales of each target.

In one embodiment, when the processor executes the computer program, the deconvolution processing is performed on each target initial feature by the lightweight deconvolution module to generate the face feature corresponding to each target scale, which may include: obtaining the face features corresponding to the minimum target scale according to the target initial features of the minimum scale in the target scales; carrying out deconvolution processing on the target initial characteristic of the minimum scale through a lightweight deconvolution module to generate a target characteristic corresponding to a first target scale; performing fusion processing on the target features corresponding to the first target scale and the target initial features corresponding to the first target scale to obtain the face features corresponding to the first target scale; and taking the target initial feature corresponding to the first target scale as the target initial feature of the minimum scale, and continuing to perform deconvolution processing and fusion processing until the face features corresponding to all the target scales are obtained.

In one embodiment, the deconvoluting, by a lightweight deconvolution module, the target initial feature of the minimum scale to generate the target feature corresponding to the first target scale when the processor executes the computer program may include: performing first deconvolution processing on the target initial characteristic through the general convolution layer to obtain a first characteristic of a first proportional output channel meeting a first target scale; performing second deconvolution processing on the first features through the deconvolution layer to obtain second features of a second proportion output channel meeting the first target scale, wherein the first proportion is smaller than the second proportion; and performing third deconvolution processing on the second features through separable convolution to generate target features meeting the first target scale leading-out channel.

In one embodiment, the performing, by the processor, regression prediction on the face features when the processor executes the computer program to obtain a detection result of the face to be detected may include: determining a target anchor frame corresponding to each target scale, wherein the target anchor frame comprises an anchor frame of the current target scale and anchor frames of adjacent target scales, and the scales of the adjacent target scales are closest to the current target scale and are larger than the current target scale; and performing regression prediction on the face features of each target scale according to the target anchor point frame corresponding to each target scale to obtain detection results of the face to be detected corresponding to different target scales.

In one embodiment, when the processor executes the computer program, the detection model is implemented as a pre-trained detection model, and the training mode of the detection model may include: acquiring a training set image; calibrating each face in the training set image to obtain a calibrated training set image; inputting the images of the training set into the constructed initial detection model, carrying out face detection through the initial detection model pair, and generating a corresponding detection result; calculating the model loss of the initial detection model according to the detection result and the calibrated training set data; based on the model loss, adjusting model parameters of the initial detection model to obtain a parameter-adjusted initial detection model; and performing iterative training on the initial detection model after parameter adjustment based on preset training parameters to obtain a trained detection model.

In one embodiment, the calibrating, when the processor executes the computer program, each face in the training set image to obtain a calibrated training set image may include: determining the image area ratio of each face in the training set image; and removing the human face with the image area ratio smaller than the image ratio threshold value through a preset mask template, and obtaining a calibrated training set image by using the calibration frame to the human face with the image area ratio larger than or equal to the image ratio threshold value.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring an image to be detected, wherein the image to be detected comprises a face to be detected; extracting features of an image to be detected through a detection model to obtain human face features, wherein the detection model comprises a lightweight deconvolution module, and the lightweight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer; and performing regression prediction on the face features to obtain a detection result of the face to be detected.

In one embodiment, the implementation of the computer program, when executed by the processor, to perform feature extraction on the image to be detected through the detection model to obtain the facial features may include: extracting multi-scale features of an image to be detected through each convolution module of the detection model to obtain initial features corresponding to each scale; determining target scales from a plurality of scales, and determining each initial feature corresponding to each target scale as a target initial feature; and carrying out deconvolution processing on the initial features of each target through a lightweight deconvolution module to generate the face features corresponding to the scales of each target.

In one embodiment, when executed by a processor, the implementing deconvolution processing on each target initial feature by a lightweight deconvolution module to generate a face feature corresponding to each target scale may include: obtaining the face features corresponding to the minimum target scale according to the target initial features of the minimum scale in the target scales; carrying out deconvolution processing on the target initial characteristic of the minimum scale through a lightweight deconvolution module to generate a target characteristic corresponding to a first target scale; performing fusion processing on the target features corresponding to the first target scale and the target initial features corresponding to the first target scale to obtain the face features corresponding to the first target scale; and taking the target initial feature corresponding to the first target scale as the target initial feature of the minimum scale, and continuing to perform deconvolution processing and fusion processing until the face features corresponding to all the target scales are obtained.

In one embodiment, the computer program, when executed by the processor, for performing deconvolution processing on the target initial feature of the minimum scale by using the lightweight deconvolution module to generate the target feature corresponding to the first target scale, may include: performing first deconvolution processing on the target initial characteristic through the general convolution layer to obtain a first characteristic of a first proportional output channel meeting a first target scale; performing second deconvolution processing on the first features through the deconvolution layer to obtain second features of a second proportion output channel meeting the first target scale, wherein the first proportion is smaller than the second proportion; and performing third deconvolution processing on the second characteristic through separable convolution to generate the target characteristic meeting the first target scale and corresponding to the output channel.

In one embodiment, the performing, by the processor, regression prediction on the face features to obtain a detection result of the face to be detected may include: determining a target anchor frame corresponding to each target scale, wherein the target anchor frame comprises an anchor frame of the current target scale and anchor frames of adjacent target scales, and the scales of the adjacent target scales are closest to the current target scale and are larger than the current target scale; and performing regression prediction on the face features of each target scale according to the target anchor point frame corresponding to each target scale to obtain detection results of the face to be detected corresponding to different target scales.

In one embodiment, the computer program, when executed by the processor, implements the detection model as a pre-trained detection model, and the training of the detection model may include: acquiring a training set image; calibrating each face in the training set image to obtain a calibrated training set image; inputting the images of the training set into the constructed initial detection model, carrying out face detection through the initial detection model pair, and generating a corresponding detection result; calculating the model loss of the initial detection model according to the detection result and the calibrated training set data; based on the model loss, adjusting model parameters of the initial detection model to obtain a parameter-adjusted initial detection model; and performing iterative training on the initial detection model after parameter adjustment based on preset training parameters to obtain a trained detection model.

In one embodiment, the computer program, when executed by the processor, performs calibration on each face in the training set image to obtain a calibrated training set image, and may include: determining the image area ratio of each face in the training set image; and removing the human face with the image area ratio smaller than the image ratio threshold value through a preset mask template, and obtaining a calibrated training set image by using the calibration frame to the human face with the image area ratio larger than or equal to the image ratio threshold value.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A face detection method, comprising:

extracting features of the image to be detected through a detection model to obtain human face features, wherein the detection model comprises a light-weight deconvolution module, and the light-weight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer;

2. The method according to claim 1, wherein the extracting features of the image to be detected through the detection model to obtain the human face features comprises:

extracting multi-scale features of the image to be detected through each convolution module of the detection model to obtain initial features corresponding to each scale;

and carrying out deconvolution processing on each target initial feature through the lightweight deconvolution module to generate the human face features corresponding to each target scale.

3. The method of claim 2, wherein the deconvolving each of the target initial features by the lightweight deconvolution module to generate a face feature corresponding to each of the target scales, comprises:

carrying out deconvolution processing on the target initial feature of the minimum scale through the lightweight deconvolution module to generate a target feature corresponding to a first target scale;

performing fusion processing on the target features corresponding to the first target scale and the target initial features corresponding to the first target scale to obtain face features corresponding to the first target scale;

4. The method according to claim 3, wherein the deconvolving the target initial feature of the minimum scale by the lightweight deconvolution module to generate the target feature corresponding to the first target scale, includes:

performing first deconvolution processing on the target initial characteristic through the general convolutional layer to obtain a first characteristic of a first proportional output channel meeting a first target scale;

performing second deconvolution processing on the first features through the deconvolution layer to obtain second features of a second proportion output channel meeting a first target scale, wherein the first proportion is smaller than the second proportion;

and performing third deconvolution processing on the second features through the separable convolution to generate target features meeting the requirements of the first target scale and corresponding to the output channel.

5. The method according to claim 2, wherein the performing regression prediction on the face features to obtain the detection result of the face to be detected comprises:

determining a target anchor frame corresponding to each target scale, wherein the target anchor frame comprises an anchor frame of a current target scale and anchor frames of adjacent target scales, and the scales of the adjacent target scales are closest to the current target scale and are larger than the current target scale;

6. The method of claim 1, wherein the detection model is a pre-trained detection model, and the training of the detection model comprises:

acquiring a training set image;

inputting the images of the training set into a constructed initial detection model, carrying out face detection through the initial detection model, and generating a corresponding detection result;

and performing iterative training on the initial detection model after the parameter adjustment based on a preset training parameter to obtain a trained detection model.

7. The method according to claim 6, wherein the calibrating each face in the training set images to obtain calibrated training set images comprises:

determining the image area ratio of each face in the training set image;

and removing the human face with the image area ratio smaller than the image ratio threshold value through a preset mask template, and obtaining a calibrated training set image by using a calibration frame to the human face with the image area ratio larger than or equal to the image ratio threshold value.

8. An apparatus for face detection, the apparatus comprising:

the system comprises an image acquisition module to be detected, a face detection module and a face detection module, wherein the image acquisition module to be detected is used for acquiring an image to be detected, and the image to be detected comprises a face to be detected;

the feature extraction module is used for extracting features of the image to be detected through a detection model to obtain human face features, the detection model comprises a light-weight deconvolution module, and the light-weight deconvolution module comprises a general convolutional layer, a deconvolution layer and a separable convolutional layer;

and the regression prediction module is used for carrying out regression prediction on the human face features to obtain the detection result of the human face to be detected.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.