CN110490115B

CN110490115B - Training method and device of face detection model, electronic equipment and storage medium

Info

Publication number: CN110490115B
Application number: CN201910746118.1A
Authority: CN
Inventors: 杨帆
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2021-08-13
Anticipated expiration: 2039-08-13
Also published as: CN110490115A

Abstract

The disclosure relates to a training method, a device, an electronic device and a storage medium for a face detection model, relates to the technical field of image processing, and aims to solve the problem that the detection effect of the face detection model in the related technology on a small-scale face is poor, and the method comprises the following steps: determining the pixel range of the positive sample detected by each characteristic layer according to the anchor point corresponding to each characteristic layer in the face detection model; determining target training images corresponding to the feature layers from training images of the face detection data set according to the proportion of the number of negative samples among the feature layers, wherein the target training images corresponding to the feature layers are training images with face scales in the range of positive sample pixels detected by the feature layers; the target training images corresponding to the feature layers are adopted to train the face detection model, and the target training images corresponding to the feature layers are determined based on the proportion of the number of the negative samples, so that the detection effect of the small-scale face is enhanced.

Description

Training method and device of face detection model, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a training method and apparatus for a face detection model, an electronic device, and a storage medium.

Background

The face detection is the first step of all face-related tasks, and face recognition, face attribute analysis and face key point extraction can be performed only after the face is detected. The main task is to judge whether a face exists on a given face image, and if the face exists, the position and the size of the face are given. Compared with a general object, the scale change of the face is larger, and for a 640x640 picture, the pixel size of the face is changed from the interval of 10-640.

In the field of face detection, the detection of small-scale faces is a difficult problem, and the learning of a small-scale face detector needs a large amount of small-scale face data, but the detection effect of a common face or a large-scale face cannot be reduced. The face detection model in the related technology can detect faces of different scales, but has a poor detection effect on small-scale faces.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a face detection model, an electronic device, and a storage medium, so as to at least solve the problem that a face detection model in the related art has a poor detection effect on a small-scale face. The technical scheme of the disclosure is as follows:

determining the range of positive sample pixels detected by each characteristic layer according to anchor points corresponding to each characteristic layer in a face detection model, wherein the face detection model is a neural network model comprising at least two characteristic layers, each characteristic layer corresponds to at least one anchor point, and the anchor points are used for representing the range of face scales detected by the characteristic layers;

determining target training images corresponding to the feature layers from training images of a face detection data set according to the pixel range of the positive samples detected by the feature layers, so that the proportion of the number of the target training images among the feature layers is the same as the proportion of the number of the negative samples among the preset feature layers, wherein the face scale contained in the target training images corresponding to the feature layers is in the pixel range of the positive samples detected by the feature layers;

and training the face detection model by adopting the target training images corresponding to the characteristic layers.

In an optional implementation manner, the step of determining, according to anchor points corresponding to each feature layer in the face detection model, a positive sample pixel range detected by each feature layer includes:

determining the upper limit of the pixel range of the positive sample detected by the middle characteristic layer according to the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer; and taking the upper limit of the positive sample pixel range detected by the next feature layer of the middle feature layer as the lower limit of the positive sample pixel range detected by the middle feature layer.

and determining the upper limit of the pixel range of the positive sample detected by the uppermost characteristic layer according to the size of the training image, and taking the upper limit of the pixel range of the positive sample detected by the next characteristic layer of the uppermost characteristic layer as the lower limit of the pixel range of the positive sample detected by the uppermost characteristic layer.

determining the upper limit of the pixel range of the positive sample detected by the characteristic layer located at the lowest according to the maximum anchor point corresponding to the characteristic layer located at the lowest and the minimum anchor point corresponding to the last characteristic layer of the characteristic layer located at the lowest; and taking the minimum anchor point corresponding to the feature layer positioned at the lowest as the lower limit of the pixel range of the positive sample detected by the feature layer positioned at the lowest.

In an optional embodiment, the determining, according to the largest anchor point corresponding to the middle feature layer and the smallest anchor point corresponding to the last feature layer of the middle feature layer, the upper limit of the range of positive sample pixels detected by the middle feature layer includes:

and taking the average value of the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer as the upper limit of the positive sample pixel range detected by the middle characteristic layer.

In an optional implementation manner, the determining, according to the largest anchor point corresponding to the feature layer located at the lowermost position and the smallest anchor point corresponding to the feature layer above the feature layer located at the lowermost position, the upper limit of the range of positive sample pixels detected by the feature layer located at the lowermost position includes:

and taking the average value of the maximum anchor point corresponding to the feature layer positioned at the lowest and the minimum anchor point corresponding to the last feature layer of the feature layer positioned at the lowest as the upper limit of the range of the positive sample pixels detected by the feature layer positioned at the lowest.

In an alternative embodiment, the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;

the step of determining the target training image corresponding to each feature layer from the face detection data set according to the positive sample pixel range detected by each feature layer comprises:

determining a target training image corresponding to the topmost feature layer from the face detection dataset;

if the image quantity proportion corresponding to other feature layers is larger than the quantity proportion of the negative samples between the uppermost feature layer and the other feature layers, copying the training images in the face detection data set, and scaling the copied training images;

determining target training images corresponding to the other feature layers from the scaled training images;

the image quantity proportion corresponding to the other feature layers is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the other feature layers, and the candidate training images are training images with the face scale in the range of the positive sample pixels detected by the other feature layers from the face detection data set.

In an optional implementation manner, the copying the training image in the face detection data set, and scaling the copied training image includes:

copying the training image of which the face scale is not less than the upper limit of the pixel range of the positive sample detected by the feature layer at least once;

and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.

In an optional implementation manner, the step of determining the target training image corresponding to the feature layer from the scaled training images further includes:

and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer.

In an optional embodiment, the step of determining the target training image corresponding to the feature layer from the scaled training images includes:

and selecting a training image with the face size in the positive sample pixel range detected by the characteristic layer from the zoomed training image, and taking the selected training image as a target training image corresponding to the characteristic layer.

According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a face detection model, including:

the first determination unit is configured to determine a positive sample pixel range detected by each feature layer according to an anchor point corresponding to each feature layer in a face detection model, wherein the face detection model is a neural network model comprising at least two feature layers, each feature layer corresponds to at least one anchor point, and the anchor point is used for representing a range of a face scale detected by the feature layer;

a second determining unit, configured to perform determining, from training images of a face detection data set, a target training image corresponding to each feature layer according to the positive sample pixel range detected by the feature layer, so that a ratio of the number of target training images between the feature layers is the same as a preset ratio of the number of negative samples between the feature layers, where the target training images corresponding to the feature layers include a face scale within the positive sample pixel range detected by the feature layer;

and the training unit is configured to execute training of the face detection model by using the target training images corresponding to the feature layers.

In an optional implementation, the first determining unit is specifically configured to perform:

In an alternative implementation, the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;

the second determination unit is specifically configured to perform:

In an optional implementation, the second determining unit is specifically configured to perform:

In an optional implementation, the second determining unit is further configured to perform:

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the training method of the face detection model according to any one of the first aspect of the embodiments of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-volatile readable storage medium, where instructions of the storage medium, when executed by a processor of a training apparatus for a face detection model, enable the training apparatus for the face detection model to perform the training method for the face detection model according to any one of the first aspect of the embodiments of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, which, when run on an electronic device, causes the electronic device to perform a method of implementing the first aspect of embodiments of the present disclosure as described above and any one of the possible aspects to which the first aspect relates.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the training of the image face detection model is carried out according to the proportion of the number of the negative samples of different feature layers, the pixel ranges of the positive samples detected by the feature layers are determined according to the anchor points corresponding to the feature layers, and the different pixel ranges of the positive samples can be used for detecting faces of different scales; and the proportion of the number of the target training images among all the characteristic layers is the same as the proportion of the number of the preset negative samples among all the characteristic layers, so that the balance of different-scale face data is ensured, the different characteristic layers can be well trained, a more robust multi-scale face detector can be trained according to the requirement of multi-scale face detection, and the effect of small-scale face detection is enhanced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating a face detection according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of training a face detection model, according to an example embodiment.

FIG. 3 is a network architecture diagram illustrating a face detection model according to an exemplary embodiment.

FIG. 4A is a schematic diagram illustrating a first type of training image, according to an example embodiment.

FIG. 4B is a schematic diagram illustrating a second type of training image, according to an example embodiment.

FIG. 4C is a schematic diagram illustrating a third type of training image, according to an example embodiment.

FIG. 5A is a schematic diagram illustrating a first type of scaled training image, according to an example embodiment.

FIG. 5B is a schematic diagram illustrating a second type of scaled training image, according to an example embodiment.

FIG. 5C is a schematic diagram illustrating a third scaled training image in accordance with an exemplary embodiment.

FIG. 6 is a flow diagram illustrating a complete method of training of a face detection model, according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating an apparatus for training a face detection model according to an exemplary embodiment.

FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Some of the words that appear in the text are explained below:

1. the term "and/or" in the embodiments of the present disclosure describes an association relationship of associated objects, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

2. The term "electronic equipment" in the embodiments of the present disclosure refers to equipment that is composed of electronic components such as integrated circuits, transistors, and electronic tubes, and functions by applying electronic technology (including) software, and includes electronic computers, robots controlled by the electronic computers, numerical control or program control systems, and the like.

3. The term "positive sample" in the embodiments of the present disclosure refers to a target object to be detected by a task, and the target object in face detection is a face in an image, such as faces of different ethnic ages, faces of different expressions, faces wearing different decorations, and the like, and may be a face of any scale.

4. In the embodiment of the present disclosure, the term "negative sample" refers to a background where an object to be detected by a task is located, and more than 99% of the background is a non-face area, for example, a face may appear in different environments, such as a street, a room, and the like.

5. The term "multi-scale face" in the embodiments of the present disclosure refers to faces of different scales, for example, a face included in a training image in a face detection data set has a relatively large scale variation range, and has a relatively large-scale face and a relatively small-scale face.

6. The term "RetinaFace" in the disclosed embodiments is a powerful single-stage face detector that performs pixel-wise face localization on various face scales using joint supervised and self-supervised multitask learning.

The application scenario described in the embodiment of the present disclosure is for more clearly illustrating the technical solution of the embodiment of the present disclosure, and does not form a limitation on the technical solution provided in the embodiment of the present disclosure, and as a person having ordinary skill in the art knows, with the occurrence of a new application scenario, the technical solution provided in the embodiment of the present disclosure is also applicable to similar technical problems. Wherein, in the description of the present disclosure, unless otherwise indicated, "plurality" means.

Face detection is a very hot and challenging algorithmic problem in the field of computer vision. Face detection is also one of the most important business scenarios for artificial intelligence algorithms. In order to improve the Artificial Intelligence algorithm and externally display the technical strength of the Artificial Intelligence, many AI (Artificial Intelligence) companies select an open data set to verify the algorithm capability of the AI. Among many data sets, the face is the face detection data set with the largest scale and the highest detection difficulty disclosed in the industry at present, and is established by hong Kong Chinese university in 2016. The data set contains 2 ten thousand training images, which contain about 18 thousands of faces with different face sizes.

The data set of the widget face is very difficult to recognize and is closer to a real scene, and various influence factors such as the size of a human face, various photographing angles and posture changes of the human face, human face shielding and expression changes in different degrees, different types of illumination pollution and intensity differences, various makeup styles and the like are gathered in the data set.

In recent years, with the development of deep learning, some face detection methods based on deep neural networks are gradually appeared, for example, in the case of Cascade Convolutional Neural Network (Cascade Convolutional Neural Network), fast Regions Convolutional Neural Network (fast Regions Convolutional Neural Network), etc., compared with the conventional face detection method, the features extracted by the deep Neural Network have stronger robustness and description capability, as shown in fig. 1, a schematic diagram of face detection based on the deep Neural Network, wherein two faces are detected, and the proportion of the two faces in the image is larger, the face size is larger, however, because the human face has a large variation range and different scales, especially a small-scale human face (the proportion of the human face in the image is small) in a multi-scale human face, the detection effect of the current human face detection model for human face detection is poor.

Fig. 2 is a flowchart illustrating a training method of a face detection model according to an exemplary embodiment, and as shown in fig. 2, the method includes the following steps.

In step S21, determining a positive sample pixel range detected by each feature layer according to an anchor point corresponding to each feature layer in a face detection model, where the face detection model is a neural network model including at least two feature layers, each feature layer corresponds to at least one anchor point, and the anchor point is used to represent a range of a face scale detected by the feature layer;

in step S22, determining a target training image corresponding to each feature layer from the training images of the face detection data set according to the positive sample pixel range detected by each feature layer, so that the ratio of the number of target training images between each feature layer is the same as the preset ratio of the number of negative samples between each feature layer, where the target training images corresponding to the feature layers include face scales within the positive sample pixel range detected by the feature layers;

in step S23, the face detection model is trained using the target training images corresponding to the feature layers.

In the embodiments of the present disclosure, the face detection model refers to a neural network model for detecting a face.

By the scheme, training of an image face detection model is carried out according to the proportion of the number of negative samples of different feature layers, the pixel range of a positive sample detected by the feature layer is determined according to the size of each feature layer anchor, and different pixel ranges of the positive sample can be used for detecting faces of different scales; and the proportion of the number of the target training images among all the characteristic layers is the same as the proportion of the number of the preset negative samples among all the characteristic layers, so that the balance of different-scale face data is ensured, the different characteristic layers can be well trained, a more robust multi-scale face detector can be trained according to the requirement of multi-scale face detection, and the effect of small-scale face detection is enhanced.

The anchor represents the range of the face scale detected by the feature layer, the larger the anchor, the larger the range of the detected face scale, for example, the anchor is 16, which means that a face with a size of 16 pixels can be detected, that is, the range of the detectable face scale is about 16 × 16, and can be detected by a 16 × 16 anchor box (anchor box); and anchor 25 means that a face with a size of 25 pixels can be detected, that is, the detectable face scale range is about 25 × 25.

In an actual process, the arrangement of multi-scale training data depends on the selection of a model, the human face detection model in the embodiment of the present disclosure adopts the retinaFace with the best effect at present as a basic model, and a basic framework of the human face detection model is shown in fig. 3.

The basic Network structure selects resnet50(Residual Neural Network), block1(p2), block2(p3), block3(p4), block4(p5), and a convolutional layer (c6/p6) with stride (step size) of 2 is made on the basis of block4 to output as a feature layer.

In the embodiment of the disclosure, blocks 1-4 are 4 blocks included in a face detection model, correspond to 4 feature layers, and are p 2-p 5, where c 2-c 5 represent convolutional layers, and c6/p6 is both convolutional layers and feature layers.

Optionally, assuming that the number of anchors corresponding to each feature layer is 3, the size of the anchors is selected according to the scale of each feature layer, as shown in the following table.

Table 1 Anchor settings table

Wherein, the setting of the anchors in the table 1 is selected according to the dimension of each feature layer, and the larger the dimension of the feature layer is, the larger the anchors are.

In the embodiment of the present disclosure, each feature layer corresponds to at least one anchor, and the number of anchors between feature layers may be the same, for example, each feature layer corresponds to 3 anchors. In an alternative embodiment, the number of anchors may be different between each feature layer, for example, P2 feature layer for 2 anchors and P3 feature layer for 3 anchors.

In the embodiment of the present disclosure, when a feature layer corresponds to only one anchor, assuming that the feature layer corresponds to the anchor1, the largest anchor and the smallest anchor of the feature layer are both anchors 1.

Aiming at the network structure and the Anchor setting, determining a target training image for multi-scale face detection according to the following steps:

firstly, an open source face detection data set widget face is collected.

And then, analyzing the number of negative samples corresponding to different feature layers in the face detection model, wherein the number of negative samples is determined according to the dimensions of the feature layers, the number of negative samples corresponding to each feature layer is about the dimensions of the feature layer, taking the P2 feature layer as an example, the dimensions of the P2 feature layer are 160x 160-25600, and thus the number of negative samples corresponding to the P2 feature layer is 25600.

In the embodiment of the present disclosure, the number of negative samples corresponding to each feature layer is approximately the scale of the feature layer, and the larger the scale of the feature layer is, the larger the area is, the larger the number of negative samples corresponding to the feature layer is, as shown in the following table:

feature layer	Dimension	Number of negative examples
			P2	160x 160	25600
P3	80x 80	6400
			P4	40x 40	1600
P5	20x 20	400
			P6	10x 10	100

TABLE 2 characteristic layer Scale Table

Table 2 is an example of determining the number of negative samples corresponding to each feature layer according to the scale of each feature layer, which is enumerated in the embodiment of the present disclosure.

In the embodiment of the disclosure, each anchor corresponds to one classifier, taking a P2 feature layer as an example, three anchors (16, 20.16, 25.40) on the P2 feature layer correspond to 3 classifiers, the number of negative samples corresponding to the 3 classifiers is 25600, the number of negative samples corresponding to the 3 anchors on the P3 feature layer is 6400, and so on, the number of negative samples corresponding to the 3 anchors on the P4 feature layer is 1600, the number of negative samples corresponding to the classifiers on the P5 feature layer is 400, and the number of negative samples corresponding to the anchors on the P6 feature layer is 100. The ratio of the number of negative samples between the feature layers is 25600:6400:1600:400:100, that is, 256:64:16:4: 1.

In the embodiment of the present disclosure, negative samples refer to non-face regions in an image, taking as an example that three anchors (16, 20.16, 25.40) on a P2 feature layer correspond to 3 classifiers, the number of the negative samples corresponding to the 3 classifiers is 25600, and for a classifier corresponding to 16 anchors, the 25600 negative samples corresponding to the classifier, that is, 25600 non-face regions with a scale of about 16 × 16; for the classifier corresponding to anchor ═ 20.16, 25600 negative samples corresponding to the classifier, that is, 25600 non-face regions with a scale of about 20.16 × 20.16, where multiple negative samples may exist in the same target training image, and the number of negative samples refers to the number of non-face regions, but not the number of non-training images.

In the embodiment of the disclosure, in order to ensure that the classifier corresponding to each anchor can be well trained, it is necessary to ensure that the positive and negative samples of each classifier are relatively balanced, so the ratio of the number of positive samples on each feature layer should also be 256:64:16:4: 1.

In an optional implementation manner, when determining the positive sample pixel range detected by each feature layer according to the anchor point corresponding to each feature layer, specifically:

in fig. 3, the feature layer located at the bottom is P2, the feature layer located at the top is P6, and the feature layers in the middle are P3, P4 and P5, wherein the feature layer P6 and the convolutional layer C6 are the same layer.

In an alternative embodiment, the upper limit of the positive sample pixel range detected by the feature layer located at the lowest position is determined according to the maximum anchor point corresponding to the feature layer located at the lowest position and the minimum anchor point corresponding to the feature layer located at the last feature layer of the feature layer located at the lowest position; and the minimum anchor point corresponding to the characteristic layer positioned at the lowest position is taken as the lower limit of the pixel range of the positive sample detected by the characteristic layer positioned at the lowest position.

The P2 feature layer is the feature layer located at the bottom, and the 3 anchors corresponding to the P2 feature layer are 16, 20.16, and 25.40, respectively, so that the lower limit of the positive sample pixel range detected by the P2 feature layer is 16.

The upper limit of positive sample pixels detected by the P2 feature layer is the average value of the maximum anchor (25.40) corresponding to the P2 feature layer and the minimum anchor (32) corresponding to the P3 feature layer:

(25.40+32)/2＝28.7。

in an alternative embodiment, the upper limit of the positive sample pixel range detected by the middle feature layer is determined according to the maximum anchor point corresponding to the middle feature layer and the minimum anchor point corresponding to the last feature layer of the middle feature layer; and the upper limit of the positive sample pixel range detected by the next feature layer of the feature layer located in the middle is taken as the lower limit of the positive sample pixel range detected by the feature layer located in the middle.

For example, the upper limit of positive sample pixels detected by the P3 feature layer is the average of the maximum anchor (50.80) corresponding to the P3 feature layer and the minimum anchor (64) corresponding to the P4 feature layer:

(50.80+64)/2＝57.4。

in an alternative embodiment, if the average of the largest anchor corresponding to a feature layer and the smallest anchor corresponding to the previous feature layer of the feature layer is not an integer, the average may be rounded, where the rounding includes, but is not limited to, some or all of the following:

rounding up, rounding down, etc.

For example, when the average is 57.4, round down, i.e., 57, round up, i.e., 58; when the average is 28.7, it is rounded down, i.e., 28, and rounded up, i.e., 29.

Taking a rounding-down example, the upper limit of the positive sample pixel range detected by the P2 feature layer, namely 28, the upper limit of the positive sample pixel range detected by the P3 feature layer, namely 57, the upper limit of the positive sample pixel range detected by the P4 feature layer, namely 114, and the upper limit of the positive sample pixel range detected by the P5 feature layer, namely 227, can be obtained.

28, a lower limit of the range of positive sample pixels detected by the P3 feature layer (corresponding to an upper limit of the range of positive sample pixels detected by the P2 feature layer), 57, a lower limit of the range of positive sample pixels detected by the P4 feature layer (corresponding to an upper limit of the range of positive sample pixels detected by the P3 feature layer), and 114, a lower limit of the range of positive sample pixels detected by the P5 feature layer (corresponding to an upper limit of the range of positive sample pixels detected by the P4 feature layer).

In an alternative embodiment, the upper limit of the range of positive sample pixels detected by the feature layer located at the top is determined according to the size of the training image, and the upper limit of the range of positive sample pixels detected by the next feature layer located at the top is used as the lower limit of the range of positive sample pixels detected by the feature layer located at the top.

The P6 feature layer is the uppermost feature layer, the upper limit of the range of positive sample pixels detected by the P6 feature layer is determined according to the size of the training image, and assuming that the training image size is 640x640, the upper limit of the range of positive sample pixels detected by the P6 feature layer is generally 640 pixels. The P5P6 feature layer is the next feature layer to the P6 feature layer, so the lower limit of the range of positive sample pixels detected by the P6 feature layer is the upper limit 227 of the range of positive sample pixels detected by the P5 feature layer.

In summary, for the 640 × 640 image, the range of positive sample pixels detected by the P2 feature layer is 16-28, the range of positive sample pixels detected by the P3 feature layer is 28-57, the range of positive sample pixels detected by the P4 feature layer is 57-114, the range of positive sample pixels detected by the P5 feature layer is 114-227, and the range of positive sample pixels detected by the P6 feature layer is 227-640.

In the embodiment of the present disclosure, it is required to ensure that the ratio of the number of target training images between each feature layer is the same as the preset ratio of the number of negative samples between each feature layer, taking the above 5 feature layers P2-P6 as examples, that is, the ratio of the training images in the 5 positive sample pixel ranges is ensured to be 256:64:16:4:1, and specifically, the following process is implemented:

the number of the negative samples corresponding to each characteristic layer is in direct proportion to the scale of each characteristic layer. Firstly, determining a target training image corresponding to the topmost feature layer from the face detection data set, specifically:

in an alternative embodiment, training images with face dimensions in the range of positive sample pixels detected by the feature layer located at the uppermost position in the face detection data set are determined, and part or all of the determined training images are selected as target training images corresponding to the feature layer located at the uppermost position.

In the embodiment of the present disclosure, if a certain training image includes at least one face whose face scale is within the range of the positive sample pixels detected by the feature layer located at the uppermost position, it may be determined that the training image may be used as the target training image.

Taking a P6 feature layer as an example, the range of the detected positive sample pixels is 227-640, and if a certain training image contains 3 faces, the assumed scales are: 228x 250, 238x 240 and 116x240, wherein two face scales are in the range of 227-640, the training image can be determined to be a target training image; assume that another training image contains 2 faces, and the scales are: 116x240, 116x 140, and thus the training image may not be the target training image.

Taking the P6 feature layer as an example, the range of positive sample pixels detected by the P6 feature layer is: 227-640, for example, the face scale included in the training image shown in fig. 4A is within a range of 227-640, it is assumed that statistics indicates that the face scale included in 100 training images in the widerface is within a range of 227-640, and it is assumed that 50 training images are selected from the 100 training images as target training images corresponding to the P6 feature layer, and the number of the target training images is a, that is, denoted as a-50, based on this.

Optionally, the 100 training images may be all used as the target training image corresponding to the P6 feature layer, where a is 100.

In an optional embodiment, when determining a target training image corresponding to any one feature layer except the feature layer located at the top (i.e., the P6 feature layer), if the ratio of the number of images corresponding to the feature layer is greater than the ratio of the number of negative samples located between the feature layer located at the top and the feature layer, copying the training image in the face detection data set, and scaling the copied training image; and determining a target training image corresponding to the characteristic layer from the scaled training images.

The image quantity proportion corresponding to the feature layer is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the feature layer, and the candidate training images are training images of which the face scale is in the range of positive sample pixels detected by the feature layer from the face detection data set.

Taking table 2 as an example, the number of negative samples corresponding to the P6 feature layer is 100, the number of negative samples corresponding to the P5 feature layer is 400, and P6: p5 ═ 1/4.

Assuming that a is 50, for the P5 feature layer, the number of training images with face scales between the range of positive sample pixels 114 to 227 detected by the P5 feature layer is counted, for example, the face scale included in the training image shown in fig. 4B is within the range of 114 to 227, and assuming that there are 50 training images with face scales included in the widerface within the range of positive sample pixels detected by the P5 feature layer, that is, the number of candidate training images corresponding to the P5 feature layer is 50, and the number of target training images corresponding to the P6 feature layer is 50, so that 50:50 is 1>1/4, that is, 50 is less than 200 a, at this time, a training image can be selected from the face detection data set widerface to be copied, and the copied training image is scaled, and then a target image corresponding to the P5 feature layer is determined from the scaled training image.

Similarly, for the P4 feature layer, the number of training images with face scales between 57 to 114 of the positive sample pixels detected by the P4 feature layer is counted, for example, the face scale included in the training image shown in fig. 4C is within 57 to 114, and it is assumed that there are 50 training images with face scales within the positive sample pixels detected by the P4 feature layer included in the wildface, so that the number of candidate training images corresponding to the P4 feature layer is 50, and the number of target training images corresponding to the P6 feature layer is 50, so that 50:50 is 1>1/16, that is, 50 is less than 800 a, at this time, a training image can be selected from the detection data set wildface to copy, and the copied training image is scaled, and then the target training image corresponding to the P4 feature layer is determined from the scaled training images.

In an alternative embodiment, when a training image is selected from the face detection data set widerface for copying, the training image including a face scale not smaller than the upper limit of the positive sample pixel range detected by the feature layer may be copied at least once.

In the embodiment of the present disclosure, when the training image is copied, it is required to ensure that the training image includes at least one face whose face size is not smaller than the upper limit of the positive sample pixel range detected by the feature layer.

For example, for the P5 feature layer, training images including a face scale of not less than 227 pixels need to be copied, specifically, when copying is performed, 100 training images may all be copied, or only 50 selected training images serving as target training images may be copied at least once, and the number of times of copying may be determined according to the number of target training images (i.e., 4A) required by the P5 feature layer.

For example, when 100 training images are copied, if 4A is 200, and 50 candidate training images corresponding to the P5 feature layer are copied at least twice; when only 50 selected training images as target training images are copied, at least three times of copying are possible.

For the P4 feature layer, training images with a face size not less than 114 pixels need to be copied, because at least one copy is performed on training images with a face size not less than 227 pixels when a target training image corresponding to the P5 feature layer is determined, assuming that a total of 300 training images with a face size not less than 114 pixels are obtained, 300 training images can be copied at this time, or a part of the training images can be selected from the training images for at least one copy, and the number of copies can be determined according to the number (i.e., 16A) of target training images required by the P4 feature layer.

For example, when 300 training images are copied, if 16A is 800 and there are 50 candidate training images corresponding to the P4 feature layer, at least three copies are required.

After copying the training image, further scaling the copied training image, specifically: and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.

Or, a more appropriate scaling ratio is selected for scaling according to the size of the face scale contained in the copied training image and the range of the positive sample pixels detected by the P5 feature layer, and the scaling can be realized by a Resize function.

Fig. 5A shows a result obtained by reducing 1/2 the training image shown in fig. 4A, fig. 5B shows a result obtained by reducing 1/2 the training image shown in fig. 4B, and fig. 5C shows a result obtained by reducing 1/2 the training image shown in fig. 4C.

In an alternative embodiment, when determining the target training image corresponding to each feature layer, there are many determination methods, two of which are listed below:

and in the first determination mode, only the training image with the face size in the positive sample pixel range detected by the feature layer is selected from the zoomed training images, and the selected training image is used as a target training image corresponding to the feature layer.

Taking the P3 feature layer as an example, the range of positive sample pixels detected by the P3 feature layer is 32-57, it is assumed that the scaled training image includes fig. 5A, 5B, and 5C, it is assumed that the face scale included in fig. 5C is within 32-57, and the face scale included in fig. 5A and 5B is not within 32-57, so that it is determined that the conditions of fig. 5A and 5B are not satisfied at first during selection, and fig. 5C may or may not be selected, depending on the actual situation.

And determining a second mode, and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer and the scaled training images.

Taking the P2 feature layer as an example, assuming that there are 1 million candidate training images corresponding to the P2 feature layer and at least 12800 target training images corresponding to the P2 feature layer, 2800 training images with a face scale range of 16-28 can be selected from the scaled training images, and then 2800 selected from the 1 million training images are taken as the target training images corresponding to the P2 feature layer. Or, selecting part of training images from the candidate training images, selecting part of training images from the zoomed training images, and then taking the two selected training images as target training images corresponding to the P2 feature layer, wherein the number is required to be not less than 12800.

By analogy, the target training images corresponding to other feature layers, such as the feature layers P5 and P4, may be determined in the same manner.

It should be noted that, the manner of determining the target training image corresponding to the feature layer from the scaled training image in the embodiment of the present disclosure is only an example, and any manner of determining the target training image corresponding to the feature layer from the scaled training image is applicable to the embodiment of the present disclosure.

An alternative embodiment is: counting the number of training images with face scales within 227-640 aiming at the face, and taking the number as a reference and recording the number as A; counting the number of training images with face sizes between 114 and 227, if the number of training images with face sizes within the range of 227 to 640 is smaller than 4A, scaling all training images with face sizes within the range of 114 to 227 pixels until the number of all training images with face sizes within the range of 114 to 227 pixels is approximately equal to 4A, and counting the number of training images with face sizes within the range of 57 to 114 pixels, if the number of training images with face sizes above 114 pixels is smaller than 16A, scaling all training images with face sizes above 114 pixels until the number meets 16A, and so on, and completing the enhancement of the training images within all positive sample pixel ranges.

In the embodiment of the present disclosure, when the face detection model is trained, the model may be trained by using the target training image corresponding to each feature layer. Because the embodiment of the disclosure provides a way for enhancing face data, the target training images corresponding to all feature layers are proportioned according to the proportion of the number of negative samples of different feature layers, the balance of large-scale and small-scale face data is ensured, the balance of positive and negative samples is ensured at the same time, the proportion is the same, and the sample data is enhanced based on the proportion of the number of negative samples of all feature layers, so that different classifiers can be well trained. The number of the negative samples corresponding to different classifiers is different, so that the trained model can well detect images in all scale ranges.

Fig. 6 is a flowchart illustrating a complete method for training a face detection model according to an exemplary embodiment, which specifically includes the following steps:

step 600, collecting an open source face detection data set widget face;

601, analyzing the number of negative samples corresponding to different feature layers of the face detection model;

step 602, determining the pixel range of the positive sample detected by each characteristic layer according to the size of the anchor corresponding to each characteristic layer;

step 603, counting the number of training images with the face scale within the range of 227-640 according to the face, taking the training images as target training images corresponding to a P6 feature layer, and taking the target training images as a reference, wherein the number is marked as A;

step 604, counting the number of training images with the included face scale of 114-227, and judging whether the number is smaller than 4A, if so, executing step 605, otherwise, executing step 606;

605, scaling all training images with the face size of more than 227, and ensuring that the scaled face size is within a range of 114-227 pixels until the number of all training images with the face size of 114-227 pixels is approximately equal to 4A;

step 606, selecting 4A training images from the training images with the face scale between 114 and 227 as target training images corresponding to a P5 feature layer;

step 607, counting the number of training images with the face scale of 57-114, and judging whether the number is less than 16A, if so, executing step 608, otherwise, executing step 609;

step 608, scaling all training images with the face size of more than 114, and ensuring that the face size is within the range of 57-114 after scaling until the number of all training images with the face size within the range of 57-114 pixels is approximately equal to 16A;

step 609, selecting 16A training images from the training images with the face scale of 57-114 as target training images corresponding to the P4 feature layer;

step 610, counting the number of training images with the face scale between 32 and 57, and judging whether the number is smaller than 64A, if so, executing step 611, otherwise, executing step 612;

611, zooming all training images with the face size of more than 57 to ensure that the zoomed face size is within the range of 32-57 pixels until the number of all training images with the face size of 32-57 pixels is approximately equal to 64A;

step 612, selecting 64A training images from the training images with the face scale between 32 and 57 as target training images corresponding to a P3 feature layer;

step 613, counting the number of training images with the face scale of 16-32, and judging whether the number is less than 256A, if so, executing step 614, otherwise, executing step 615;

614, scaling all training images with the face size of more than 32 to ensure that the scaled face size is within the range of 16-32 pixels until the number of all training images with the face size within the range of 16-32 pixels is approximately equal to 256A;

and 615, selecting 256A training images from the training images with the face scale between 16 and 32 as target training images corresponding to the P2 feature layer.

FIG. 7 is a block diagram illustrating an apparatus for training a face detection model according to an exemplary embodiment. Referring to fig. 7, the apparatus includes a first determining unit 700, a second determining unit 701, and a training unit 702.

The first determining unit 700 is configured to perform determining a positive sample pixel range detected by each feature layer according to an anchor point corresponding to each feature layer in a face detection model, where the face detection model is a neural network model including at least two feature layers, each feature layer corresponds to at least one anchor point, and the anchor point is used for representing a range of a face scale detected by a feature layer;

the second determining unit 701 is configured to perform determining, from the training images of the face detection data set, a target training image corresponding to each feature layer according to the positive sample pixel range detected by the feature layer, so that a ratio of the number of target training images between the feature layers is the same as a preset ratio of the number of negative samples between the feature layers, where the target training images corresponding to the feature layers include a face scale within the positive sample pixel range detected by the feature layer;

the training unit 702 is configured to perform training on the face detection model by using the target training images corresponding to the feature layers.

In an optional implementation, the first determining unit 700 is specifically configured to perform:

the second determining unit 701 is specifically configured to perform:

In an optional implementation, the second determining unit 701 is specifically configured to perform:

In an optional implementation, the second determining unit 701 is further configured to perform:

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit executes the request has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 8 is a block diagram illustrating an electronic device 800 for training of a face detection model, according to an example embodiment, the apparatus comprising:

a processor 810;

a memory 820 for storing instructions executable by the processor 810;

wherein the processor 810 is configured to execute the instructions to implement the training method of the face detection model in the embodiment of the present disclosure.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 820 comprising instructions, executable by the processor 810 of the electronic device 800 to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The embodiment of the present disclosure further provides a computer program product, which when running on an electronic device, causes the electronic device to execute a method for implementing any one of the above training methods for a face detection model or any one of the above training methods for a face detection model according to the embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a face detection model is characterized by comprising the following steps:

2. The training method of the face detection model according to claim 1, wherein the step of determining the range of the positive sample pixels detected by each feature layer according to the anchor points corresponding to each feature layer in the face detection model comprises:

3. The training method of the face detection model according to claim 1, wherein the step of determining the range of the positive sample pixels detected by each feature layer according to the anchor points corresponding to each feature layer in the face detection model comprises:

4. The training method of the face detection model according to claim 1, wherein the step of determining the range of the positive sample pixels detected by each feature layer according to the anchor points corresponding to each feature layer in the face detection model comprises:

5. The training method of the face detection model according to claim 2, wherein the step of determining the upper limit of the range of the positive sample pixels detected by the middle feature layer according to the maximum anchor point corresponding to the middle feature layer and the minimum anchor point corresponding to the last feature layer of the middle feature layer comprises:

6. The training method of the face detection model according to claim 4, wherein the step of determining the upper limit of the range of the positive sample pixels detected by the feature layer located at the lowest level according to the maximum anchor point corresponding to the feature layer located at the lowest level and the minimum anchor point corresponding to the feature layer located at the last level of the feature layer located at the lowest level comprises:

7. The training method of the face detection model according to claim 1, wherein the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;

8. The method for training a face detection model according to claim 7, wherein the step of copying the training images in the face detection data set and scaling the copied training images comprises:

9. The method for training a face detection model according to claim 7, wherein the step of determining the target training image corresponding to the feature layer from the scaled training images further comprises:

10. The method for training the face detection model according to claim 7, wherein the step of determining the target training image corresponding to the feature layer from the scaled training images comprises:

11. A training device for a face detection model is characterized by comprising:

12. The apparatus for training a face detection model according to claim 11, wherein the first determining unit is specifically configured to perform:

13. The apparatus for training a face detection model according to claim 11, wherein the first determining unit is specifically configured to perform:

14. The apparatus for training a face detection model according to claim 11, wherein the first determining unit is specifically configured to perform:

15. The apparatus for training a face detection model according to claim 12, wherein the first determining unit is specifically configured to perform:

16. The apparatus for training a face detection model according to claim 14, wherein the first determining unit is specifically configured to perform:

17. The training device of the face detection model according to claim 11, wherein the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;

the second determination unit is specifically configured to perform:

18. The apparatus for training a face detection model according to claim 17, wherein the second determining unit is specifically configured to perform:

19. The apparatus for training a face detection model according to claim 17, wherein the second determining unit is further configured to perform:

20. The apparatus for training a face detection model according to claim 17, wherein the second determining unit is specifically configured to perform:

21. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the training method of the face detection model according to any one of claims 1 to 10.

22. A storage medium, wherein instructions in the storage medium, when executed by a processor of a training apparatus for a face detection model, enable the training apparatus for a face detection model to perform the training method for a face detection model according to any one of claims 1 to 10.