CN110490115B - Training method and device of face detection model, electronic equipment and storage medium - Google Patents

Training method and device of face detection model, electronic equipment and storage medium Download PDF

Info

Publication number
CN110490115B
CN110490115B CN201910746118.1A CN201910746118A CN110490115B CN 110490115 B CN110490115 B CN 110490115B CN 201910746118 A CN201910746118 A CN 201910746118A CN 110490115 B CN110490115 B CN 110490115B
Authority
CN
China
Prior art keywords
feature
layer
training
face detection
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910746118.1A
Other languages
Chinese (zh)
Other versions
CN110490115A (en
Inventor
杨帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201910746118.1A priority Critical patent/CN110490115B/en
Publication of CN110490115A publication Critical patent/CN110490115A/en
Application granted granted Critical
Publication of CN110490115B publication Critical patent/CN110490115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure relates to a training method, a device, an electronic device and a storage medium for a face detection model, relates to the technical field of image processing, and aims to solve the problem that the detection effect of the face detection model in the related technology on a small-scale face is poor, and the method comprises the following steps: determining the pixel range of the positive sample detected by each characteristic layer according to the anchor point corresponding to each characteristic layer in the face detection model; determining target training images corresponding to the feature layers from training images of the face detection data set according to the proportion of the number of negative samples among the feature layers, wherein the target training images corresponding to the feature layers are training images with face scales in the range of positive sample pixels detected by the feature layers; the target training images corresponding to the feature layers are adopted to train the face detection model, and the target training images corresponding to the feature layers are determined based on the proportion of the number of the negative samples, so that the detection effect of the small-scale face is enhanced.

Description

Training method and device of face detection model, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to a training method and apparatus for a face detection model, an electronic device, and a storage medium.
Background
The face detection is the first step of all face-related tasks, and face recognition, face attribute analysis and face key point extraction can be performed only after the face is detected. The main task is to judge whether a face exists on a given face image, and if the face exists, the position and the size of the face are given. Compared with a general object, the scale change of the face is larger, and for a 640x640 picture, the pixel size of the face is changed from the interval of 10-640.
In the field of face detection, the detection of small-scale faces is a difficult problem, and the learning of a small-scale face detector needs a large amount of small-scale face data, but the detection effect of a common face or a large-scale face cannot be reduced. The face detection model in the related technology can detect faces of different scales, but has a poor detection effect on small-scale faces.
Disclosure of Invention
The present disclosure provides a training method and apparatus for a face detection model, an electronic device, and a storage medium, so as to at least solve the problem that a face detection model in the related art has a poor detection effect on a small-scale face. The technical scheme of the disclosure is as follows:
determining the range of positive sample pixels detected by each characteristic layer according to anchor points corresponding to each characteristic layer in a face detection model, wherein the face detection model is a neural network model comprising at least two characteristic layers, each characteristic layer corresponds to at least one anchor point, and the anchor points are used for representing the range of face scales detected by the characteristic layers;
determining target training images corresponding to the feature layers from training images of a face detection data set according to the pixel range of the positive samples detected by the feature layers, so that the proportion of the number of the target training images among the feature layers is the same as the proportion of the number of the negative samples among the preset feature layers, wherein the face scale contained in the target training images corresponding to the feature layers is in the pixel range of the positive samples detected by the feature layers;
and training the face detection model by adopting the target training images corresponding to the characteristic layers.
In an optional implementation manner, the step of determining, according to anchor points corresponding to each feature layer in the face detection model, a positive sample pixel range detected by each feature layer includes:
determining the upper limit of the pixel range of the positive sample detected by the middle characteristic layer according to the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer; and taking the upper limit of the positive sample pixel range detected by the next feature layer of the middle feature layer as the lower limit of the positive sample pixel range detected by the middle feature layer.
In an optional implementation manner, the step of determining, according to anchor points corresponding to each feature layer in the face detection model, a positive sample pixel range detected by each feature layer includes:
and determining the upper limit of the pixel range of the positive sample detected by the uppermost characteristic layer according to the size of the training image, and taking the upper limit of the pixel range of the positive sample detected by the next characteristic layer of the uppermost characteristic layer as the lower limit of the pixel range of the positive sample detected by the uppermost characteristic layer.
In an optional implementation manner, the step of determining, according to anchor points corresponding to each feature layer in the face detection model, a positive sample pixel range detected by each feature layer includes:
determining the upper limit of the pixel range of the positive sample detected by the characteristic layer located at the lowest according to the maximum anchor point corresponding to the characteristic layer located at the lowest and the minimum anchor point corresponding to the last characteristic layer of the characteristic layer located at the lowest; and taking the minimum anchor point corresponding to the feature layer positioned at the lowest as the lower limit of the pixel range of the positive sample detected by the feature layer positioned at the lowest.
In an optional embodiment, the determining, according to the largest anchor point corresponding to the middle feature layer and the smallest anchor point corresponding to the last feature layer of the middle feature layer, the upper limit of the range of positive sample pixels detected by the middle feature layer includes:
and taking the average value of the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer as the upper limit of the positive sample pixel range detected by the middle characteristic layer.
In an optional implementation manner, the determining, according to the largest anchor point corresponding to the feature layer located at the lowermost position and the smallest anchor point corresponding to the feature layer above the feature layer located at the lowermost position, the upper limit of the range of positive sample pixels detected by the feature layer located at the lowermost position includes:
and taking the average value of the maximum anchor point corresponding to the feature layer positioned at the lowest and the minimum anchor point corresponding to the last feature layer of the feature layer positioned at the lowest as the upper limit of the range of the positive sample pixels detected by the feature layer positioned at the lowest.
In an alternative embodiment, the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;
the step of determining the target training image corresponding to each feature layer from the face detection data set according to the positive sample pixel range detected by each feature layer comprises:
determining a target training image corresponding to the topmost feature layer from the face detection dataset;
if the image quantity proportion corresponding to other feature layers is larger than the quantity proportion of the negative samples between the uppermost feature layer and the other feature layers, copying the training images in the face detection data set, and scaling the copied training images;
determining target training images corresponding to the other feature layers from the scaled training images;
the image quantity proportion corresponding to the other feature layers is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the other feature layers, and the candidate training images are training images with the face scale in the range of the positive sample pixels detected by the other feature layers from the face detection data set.
In an optional implementation manner, the copying the training image in the face detection data set, and scaling the copied training image includes:
copying the training image of which the face scale is not less than the upper limit of the pixel range of the positive sample detected by the feature layer at least once;
and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.
In an optional implementation manner, the step of determining the target training image corresponding to the feature layer from the scaled training images further includes:
and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer.
In an optional embodiment, the step of determining the target training image corresponding to the feature layer from the scaled training images includes:
and selecting a training image with the face size in the positive sample pixel range detected by the characteristic layer from the zoomed training image, and taking the selected training image as a target training image corresponding to the characteristic layer.
According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a face detection model, including:
the first determination unit is configured to determine a positive sample pixel range detected by each feature layer according to an anchor point corresponding to each feature layer in a face detection model, wherein the face detection model is a neural network model comprising at least two feature layers, each feature layer corresponds to at least one anchor point, and the anchor point is used for representing a range of a face scale detected by the feature layer;
a second determining unit, configured to perform determining, from training images of a face detection data set, a target training image corresponding to each feature layer according to the positive sample pixel range detected by the feature layer, so that a ratio of the number of target training images between the feature layers is the same as a preset ratio of the number of negative samples between the feature layers, where the target training images corresponding to the feature layers include a face scale within the positive sample pixel range detected by the feature layer;
and the training unit is configured to execute training of the face detection model by using the target training images corresponding to the feature layers.
In an optional implementation, the first determining unit is specifically configured to perform:
determining the upper limit of the pixel range of the positive sample detected by the middle characteristic layer according to the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer; and taking the upper limit of the positive sample pixel range detected by the next feature layer of the middle feature layer as the lower limit of the positive sample pixel range detected by the middle feature layer.
In an optional implementation, the first determining unit is specifically configured to perform:
and determining the upper limit of the pixel range of the positive sample detected by the uppermost characteristic layer according to the size of the training image, and taking the upper limit of the pixel range of the positive sample detected by the next characteristic layer of the uppermost characteristic layer as the lower limit of the pixel range of the positive sample detected by the uppermost characteristic layer.
In an optional implementation, the first determining unit is specifically configured to perform:
determining the upper limit of the pixel range of the positive sample detected by the characteristic layer located at the lowest according to the maximum anchor point corresponding to the characteristic layer located at the lowest and the minimum anchor point corresponding to the last characteristic layer of the characteristic layer located at the lowest; and taking the minimum anchor point corresponding to the feature layer positioned at the lowest as the lower limit of the pixel range of the positive sample detected by the feature layer positioned at the lowest.
In an optional implementation, the first determining unit is specifically configured to perform:
and taking the average value of the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer as the upper limit of the positive sample pixel range detected by the middle characteristic layer.
In an optional implementation, the first determining unit is specifically configured to perform:
and taking the average value of the maximum anchor point corresponding to the feature layer positioned at the lowest and the minimum anchor point corresponding to the last feature layer of the feature layer positioned at the lowest as the upper limit of the range of the positive sample pixels detected by the feature layer positioned at the lowest.
In an alternative implementation, the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;
the second determination unit is specifically configured to perform:
determining a target training image corresponding to the topmost feature layer from the face detection dataset;
if the image quantity proportion corresponding to other feature layers is larger than the quantity proportion of the negative samples between the uppermost feature layer and the other feature layers, copying the training images in the face detection data set, and scaling the copied training images;
determining target training images corresponding to the other feature layers from the scaled training images;
the image quantity proportion corresponding to the other feature layers is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the other feature layers, and the candidate training images are training images with the face scale in the range of the positive sample pixels detected by the other feature layers from the face detection data set.
In an optional implementation, the second determining unit is specifically configured to perform:
copying the training image of which the face scale is not less than the upper limit of the pixel range of the positive sample detected by the feature layer at least once;
and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.
In an optional implementation, the second determining unit is further configured to perform:
and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer.
In an optional implementation, the second determining unit is specifically configured to perform:
and selecting a training image with the face size in the positive sample pixel range detected by the characteristic layer from the zoomed training image, and taking the selected training image as a target training image corresponding to the characteristic layer.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the training method of the face detection model according to any one of the first aspect of the embodiments of the present disclosure.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-volatile readable storage medium, where instructions of the storage medium, when executed by a processor of a training apparatus for a face detection model, enable the training apparatus for the face detection model to perform the training method for the face detection model according to any one of the first aspect of the embodiments of the present disclosure.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, which, when run on an electronic device, causes the electronic device to perform a method of implementing the first aspect of embodiments of the present disclosure as described above and any one of the possible aspects to which the first aspect relates.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the training of the image face detection model is carried out according to the proportion of the number of the negative samples of different feature layers, the pixel ranges of the positive samples detected by the feature layers are determined according to the anchor points corresponding to the feature layers, and the different pixel ranges of the positive samples can be used for detecting faces of different scales; and the proportion of the number of the target training images among all the characteristic layers is the same as the proportion of the number of the preset negative samples among all the characteristic layers, so that the balance of different-scale face data is ensured, the different characteristic layers can be well trained, a more robust multi-scale face detector can be trained according to the requirement of multi-scale face detection, and the effect of small-scale face detection is enhanced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a schematic diagram illustrating a face detection according to an exemplary embodiment.
FIG. 2 is a flow diagram illustrating a method of training a face detection model, according to an example embodiment.
FIG. 3 is a network architecture diagram illustrating a face detection model according to an exemplary embodiment.
FIG. 4A is a schematic diagram illustrating a first type of training image, according to an example embodiment.
FIG. 4B is a schematic diagram illustrating a second type of training image, according to an example embodiment.
FIG. 4C is a schematic diagram illustrating a third type of training image, according to an example embodiment.
FIG. 5A is a schematic diagram illustrating a first type of scaled training image, according to an example embodiment.
FIG. 5B is a schematic diagram illustrating a second type of scaled training image, according to an example embodiment.
FIG. 5C is a schematic diagram illustrating a third scaled training image in accordance with an exemplary embodiment.
FIG. 6 is a flow diagram illustrating a complete method of training of a face detection model, according to an exemplary embodiment.
FIG. 7 is a block diagram illustrating an apparatus for training a face detection model according to an exemplary embodiment.
FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Some of the words that appear in the text are explained below:
1. the term "and/or" in the embodiments of the present disclosure describes an association relationship of associated objects, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
2. The term "electronic equipment" in the embodiments of the present disclosure refers to equipment that is composed of electronic components such as integrated circuits, transistors, and electronic tubes, and functions by applying electronic technology (including) software, and includes electronic computers, robots controlled by the electronic computers, numerical control or program control systems, and the like.
3. The term "positive sample" in the embodiments of the present disclosure refers to a target object to be detected by a task, and the target object in face detection is a face in an image, such as faces of different ethnic ages, faces of different expressions, faces wearing different decorations, and the like, and may be a face of any scale.
4. In the embodiment of the present disclosure, the term "negative sample" refers to a background where an object to be detected by a task is located, and more than 99% of the background is a non-face area, for example, a face may appear in different environments, such as a street, a room, and the like.
5. The term "multi-scale face" in the embodiments of the present disclosure refers to faces of different scales, for example, a face included in a training image in a face detection data set has a relatively large scale variation range, and has a relatively large-scale face and a relatively small-scale face.
6. The term "RetinaFace" in the disclosed embodiments is a powerful single-stage face detector that performs pixel-wise face localization on various face scales using joint supervised and self-supervised multitask learning.
The application scenario described in the embodiment of the present disclosure is for more clearly illustrating the technical solution of the embodiment of the present disclosure, and does not form a limitation on the technical solution provided in the embodiment of the present disclosure, and as a person having ordinary skill in the art knows, with the occurrence of a new application scenario, the technical solution provided in the embodiment of the present disclosure is also applicable to similar technical problems. Wherein, in the description of the present disclosure, unless otherwise indicated, "plurality" means.
Face detection is a very hot and challenging algorithmic problem in the field of computer vision. Face detection is also one of the most important business scenarios for artificial intelligence algorithms. In order to improve the Artificial Intelligence algorithm and externally display the technical strength of the Artificial Intelligence, many AI (Artificial Intelligence) companies select an open data set to verify the algorithm capability of the AI. Among many data sets, the face is the face detection data set with the largest scale and the highest detection difficulty disclosed in the industry at present, and is established by hong Kong Chinese university in 2016. The data set contains 2 ten thousand training images, which contain about 18 thousands of faces with different face sizes.
The data set of the widget face is very difficult to recognize and is closer to a real scene, and various influence factors such as the size of a human face, various photographing angles and posture changes of the human face, human face shielding and expression changes in different degrees, different types of illumination pollution and intensity differences, various makeup styles and the like are gathered in the data set.
In recent years, with the development of deep learning, some face detection methods based on deep neural networks are gradually appeared, for example, in the case of Cascade Convolutional Neural Network (Cascade Convolutional Neural Network), fast Regions Convolutional Neural Network (fast Regions Convolutional Neural Network), etc., compared with the conventional face detection method, the features extracted by the deep Neural Network have stronger robustness and description capability, as shown in fig. 1, a schematic diagram of face detection based on the deep Neural Network, wherein two faces are detected, and the proportion of the two faces in the image is larger, the face size is larger, however, because the human face has a large variation range and different scales, especially a small-scale human face (the proportion of the human face in the image is small) in a multi-scale human face, the detection effect of the current human face detection model for human face detection is poor.
Fig. 2 is a flowchart illustrating a training method of a face detection model according to an exemplary embodiment, and as shown in fig. 2, the method includes the following steps.
In step S21, determining a positive sample pixel range detected by each feature layer according to an anchor point corresponding to each feature layer in a face detection model, where the face detection model is a neural network model including at least two feature layers, each feature layer corresponds to at least one anchor point, and the anchor point is used to represent a range of a face scale detected by the feature layer;
in step S22, determining a target training image corresponding to each feature layer from the training images of the face detection data set according to the positive sample pixel range detected by each feature layer, so that the ratio of the number of target training images between each feature layer is the same as the preset ratio of the number of negative samples between each feature layer, where the target training images corresponding to the feature layers include face scales within the positive sample pixel range detected by the feature layers;
in step S23, the face detection model is trained using the target training images corresponding to the feature layers.
In the embodiments of the present disclosure, the face detection model refers to a neural network model for detecting a face.
By the scheme, training of an image face detection model is carried out according to the proportion of the number of negative samples of different feature layers, the pixel range of a positive sample detected by the feature layer is determined according to the size of each feature layer anchor, and different pixel ranges of the positive sample can be used for detecting faces of different scales; and the proportion of the number of the target training images among all the characteristic layers is the same as the proportion of the number of the preset negative samples among all the characteristic layers, so that the balance of different-scale face data is ensured, the different characteristic layers can be well trained, a more robust multi-scale face detector can be trained according to the requirement of multi-scale face detection, and the effect of small-scale face detection is enhanced.
The anchor represents the range of the face scale detected by the feature layer, the larger the anchor, the larger the range of the detected face scale, for example, the anchor is 16, which means that a face with a size of 16 pixels can be detected, that is, the range of the detectable face scale is about 16 × 16, and can be detected by a 16 × 16 anchor box (anchor box); and anchor 25 means that a face with a size of 25 pixels can be detected, that is, the detectable face scale range is about 25 × 25.
In an actual process, the arrangement of multi-scale training data depends on the selection of a model, the human face detection model in the embodiment of the present disclosure adopts the retinaFace with the best effect at present as a basic model, and a basic framework of the human face detection model is shown in fig. 3.
The basic Network structure selects resnet50(Residual Neural Network), block1(p2), block2(p3), block3(p4), block4(p5), and a convolutional layer (c6/p6) with stride (step size) of 2 is made on the basis of block4 to output as a feature layer.
In the embodiment of the disclosure, blocks 1-4 are 4 blocks included in a face detection model, correspond to 4 feature layers, and are p 2-p 5, where c 2-c 5 represent convolutional layers, and c6/p6 is both convolutional layers and feature layers.
Optionally, assuming that the number of anchors corresponding to each feature layer is 3, the size of the anchors is selected according to the scale of each feature layer, as shown in the following table.
Figure BDA0002165622460000121
Table 1 Anchor settings table
Wherein, the setting of the anchors in the table 1 is selected according to the dimension of each feature layer, and the larger the dimension of the feature layer is, the larger the anchors are.
In the embodiment of the present disclosure, each feature layer corresponds to at least one anchor, and the number of anchors between feature layers may be the same, for example, each feature layer corresponds to 3 anchors. In an alternative embodiment, the number of anchors may be different between each feature layer, for example, P2 feature layer for 2 anchors and P3 feature layer for 3 anchors.
In the embodiment of the present disclosure, when a feature layer corresponds to only one anchor, assuming that the feature layer corresponds to the anchor1, the largest anchor and the smallest anchor of the feature layer are both anchors 1.
Aiming at the network structure and the Anchor setting, determining a target training image for multi-scale face detection according to the following steps:
firstly, an open source face detection data set widget face is collected.
And then, analyzing the number of negative samples corresponding to different feature layers in the face detection model, wherein the number of negative samples is determined according to the dimensions of the feature layers, the number of negative samples corresponding to each feature layer is about the dimensions of the feature layer, taking the P2 feature layer as an example, the dimensions of the P2 feature layer are 160x 160-25600, and thus the number of negative samples corresponding to the P2 feature layer is 25600.
In the embodiment of the present disclosure, the number of negative samples corresponding to each feature layer is approximately the scale of the feature layer, and the larger the scale of the feature layer is, the larger the area is, the larger the number of negative samples corresponding to the feature layer is, as shown in the following table:
feature layer Dimension Number of negative examples
P2 160x 160 25600
P3 80x 80 6400
P4 40x 40 1600
P5 20x 20 400
P6 10x 10 100
TABLE 2 characteristic layer Scale Table
Table 2 is an example of determining the number of negative samples corresponding to each feature layer according to the scale of each feature layer, which is enumerated in the embodiment of the present disclosure.
In the embodiment of the disclosure, each anchor corresponds to one classifier, taking a P2 feature layer as an example, three anchors (16, 20.16, 25.40) on the P2 feature layer correspond to 3 classifiers, the number of negative samples corresponding to the 3 classifiers is 25600, the number of negative samples corresponding to the 3 anchors on the P3 feature layer is 6400, and so on, the number of negative samples corresponding to the 3 anchors on the P4 feature layer is 1600, the number of negative samples corresponding to the classifiers on the P5 feature layer is 400, and the number of negative samples corresponding to the anchors on the P6 feature layer is 100. The ratio of the number of negative samples between the feature layers is 25600:6400:1600:400:100, that is, 256:64:16:4: 1.
In the embodiment of the present disclosure, negative samples refer to non-face regions in an image, taking as an example that three anchors (16, 20.16, 25.40) on a P2 feature layer correspond to 3 classifiers, the number of the negative samples corresponding to the 3 classifiers is 25600, and for a classifier corresponding to 16 anchors, the 25600 negative samples corresponding to the classifier, that is, 25600 non-face regions with a scale of about 16 × 16; for the classifier corresponding to anchor ═ 20.16, 25600 negative samples corresponding to the classifier, that is, 25600 non-face regions with a scale of about 20.16 × 20.16, where multiple negative samples may exist in the same target training image, and the number of negative samples refers to the number of non-face regions, but not the number of non-training images.
In the embodiment of the disclosure, in order to ensure that the classifier corresponding to each anchor can be well trained, it is necessary to ensure that the positive and negative samples of each classifier are relatively balanced, so the ratio of the number of positive samples on each feature layer should also be 256:64:16:4: 1.
In an optional implementation manner, when determining the positive sample pixel range detected by each feature layer according to the anchor point corresponding to each feature layer, specifically:
in fig. 3, the feature layer located at the bottom is P2, the feature layer located at the top is P6, and the feature layers in the middle are P3, P4 and P5, wherein the feature layer P6 and the convolutional layer C6 are the same layer.
In an alternative embodiment, the upper limit of the positive sample pixel range detected by the feature layer located at the lowest position is determined according to the maximum anchor point corresponding to the feature layer located at the lowest position and the minimum anchor point corresponding to the feature layer located at the last feature layer of the feature layer located at the lowest position; and the minimum anchor point corresponding to the characteristic layer positioned at the lowest position is taken as the lower limit of the pixel range of the positive sample detected by the characteristic layer positioned at the lowest position.
The P2 feature layer is the feature layer located at the bottom, and the 3 anchors corresponding to the P2 feature layer are 16, 20.16, and 25.40, respectively, so that the lower limit of the positive sample pixel range detected by the P2 feature layer is 16.
The upper limit of positive sample pixels detected by the P2 feature layer is the average value of the maximum anchor (25.40) corresponding to the P2 feature layer and the minimum anchor (32) corresponding to the P3 feature layer:
(25.40+32)/2=28.7。
in an alternative embodiment, the upper limit of the positive sample pixel range detected by the middle feature layer is determined according to the maximum anchor point corresponding to the middle feature layer and the minimum anchor point corresponding to the last feature layer of the middle feature layer; and the upper limit of the positive sample pixel range detected by the next feature layer of the feature layer located in the middle is taken as the lower limit of the positive sample pixel range detected by the feature layer located in the middle.
For example, the upper limit of positive sample pixels detected by the P3 feature layer is the average of the maximum anchor (50.80) corresponding to the P3 feature layer and the minimum anchor (64) corresponding to the P4 feature layer:
(50.80+64)/2=57.4。
in an alternative embodiment, if the average of the largest anchor corresponding to a feature layer and the smallest anchor corresponding to the previous feature layer of the feature layer is not an integer, the average may be rounded, where the rounding includes, but is not limited to, some or all of the following:
rounding up, rounding down, etc.
For example, when the average is 57.4, round down, i.e., 57, round up, i.e., 58; when the average is 28.7, it is rounded down, i.e., 28, and rounded up, i.e., 29.
Taking a rounding-down example, the upper limit of the positive sample pixel range detected by the P2 feature layer, namely 28, the upper limit of the positive sample pixel range detected by the P3 feature layer, namely 57, the upper limit of the positive sample pixel range detected by the P4 feature layer, namely 114, and the upper limit of the positive sample pixel range detected by the P5 feature layer, namely 227, can be obtained.
28, a lower limit of the range of positive sample pixels detected by the P3 feature layer (corresponding to an upper limit of the range of positive sample pixels detected by the P2 feature layer), 57, a lower limit of the range of positive sample pixels detected by the P4 feature layer (corresponding to an upper limit of the range of positive sample pixels detected by the P3 feature layer), and 114, a lower limit of the range of positive sample pixels detected by the P5 feature layer (corresponding to an upper limit of the range of positive sample pixels detected by the P4 feature layer).
In an alternative embodiment, the upper limit of the range of positive sample pixels detected by the feature layer located at the top is determined according to the size of the training image, and the upper limit of the range of positive sample pixels detected by the next feature layer located at the top is used as the lower limit of the range of positive sample pixels detected by the feature layer located at the top.
The P6 feature layer is the uppermost feature layer, the upper limit of the range of positive sample pixels detected by the P6 feature layer is determined according to the size of the training image, and assuming that the training image size is 640x640, the upper limit of the range of positive sample pixels detected by the P6 feature layer is generally 640 pixels. The P5P6 feature layer is the next feature layer to the P6 feature layer, so the lower limit of the range of positive sample pixels detected by the P6 feature layer is the upper limit 227 of the range of positive sample pixels detected by the P5 feature layer.
In summary, for the 640 × 640 image, the range of positive sample pixels detected by the P2 feature layer is 16-28, the range of positive sample pixels detected by the P3 feature layer is 28-57, the range of positive sample pixels detected by the P4 feature layer is 57-114, the range of positive sample pixels detected by the P5 feature layer is 114-227, and the range of positive sample pixels detected by the P6 feature layer is 227-640.
In the embodiment of the present disclosure, it is required to ensure that the ratio of the number of target training images between each feature layer is the same as the preset ratio of the number of negative samples between each feature layer, taking the above 5 feature layers P2-P6 as examples, that is, the ratio of the training images in the 5 positive sample pixel ranges is ensured to be 256:64:16:4:1, and specifically, the following process is implemented:
the number of the negative samples corresponding to each characteristic layer is in direct proportion to the scale of each characteristic layer. Firstly, determining a target training image corresponding to the topmost feature layer from the face detection data set, specifically:
in an alternative embodiment, training images with face dimensions in the range of positive sample pixels detected by the feature layer located at the uppermost position in the face detection data set are determined, and part or all of the determined training images are selected as target training images corresponding to the feature layer located at the uppermost position.
In the embodiment of the present disclosure, if a certain training image includes at least one face whose face scale is within the range of the positive sample pixels detected by the feature layer located at the uppermost position, it may be determined that the training image may be used as the target training image.
Taking a P6 feature layer as an example, the range of the detected positive sample pixels is 227-640, and if a certain training image contains 3 faces, the assumed scales are: 228x 250, 238x 240 and 116x240, wherein two face scales are in the range of 227-640, the training image can be determined to be a target training image; assume that another training image contains 2 faces, and the scales are: 116x240, 116x 140, and thus the training image may not be the target training image.
Taking the P6 feature layer as an example, the range of positive sample pixels detected by the P6 feature layer is: 227-640, for example, the face scale included in the training image shown in fig. 4A is within a range of 227-640, it is assumed that statistics indicates that the face scale included in 100 training images in the widerface is within a range of 227-640, and it is assumed that 50 training images are selected from the 100 training images as target training images corresponding to the P6 feature layer, and the number of the target training images is a, that is, denoted as a-50, based on this.
Optionally, the 100 training images may be all used as the target training image corresponding to the P6 feature layer, where a is 100.
In an optional embodiment, when determining a target training image corresponding to any one feature layer except the feature layer located at the top (i.e., the P6 feature layer), if the ratio of the number of images corresponding to the feature layer is greater than the ratio of the number of negative samples located between the feature layer located at the top and the feature layer, copying the training image in the face detection data set, and scaling the copied training image; and determining a target training image corresponding to the characteristic layer from the scaled training images.
The image quantity proportion corresponding to the feature layer is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the feature layer, and the candidate training images are training images of which the face scale is in the range of positive sample pixels detected by the feature layer from the face detection data set.
Taking table 2 as an example, the number of negative samples corresponding to the P6 feature layer is 100, the number of negative samples corresponding to the P5 feature layer is 400, and P6: p5 ═ 1/4.
Assuming that a is 50, for the P5 feature layer, the number of training images with face scales between the range of positive sample pixels 114 to 227 detected by the P5 feature layer is counted, for example, the face scale included in the training image shown in fig. 4B is within the range of 114 to 227, and assuming that there are 50 training images with face scales included in the widerface within the range of positive sample pixels detected by the P5 feature layer, that is, the number of candidate training images corresponding to the P5 feature layer is 50, and the number of target training images corresponding to the P6 feature layer is 50, so that 50:50 is 1>1/4, that is, 50 is less than 200 a, at this time, a training image can be selected from the face detection data set widerface to be copied, and the copied training image is scaled, and then a target image corresponding to the P5 feature layer is determined from the scaled training image.
Similarly, for the P4 feature layer, the number of training images with face scales between 57 to 114 of the positive sample pixels detected by the P4 feature layer is counted, for example, the face scale included in the training image shown in fig. 4C is within 57 to 114, and it is assumed that there are 50 training images with face scales within the positive sample pixels detected by the P4 feature layer included in the wildface, so that the number of candidate training images corresponding to the P4 feature layer is 50, and the number of target training images corresponding to the P6 feature layer is 50, so that 50:50 is 1>1/16, that is, 50 is less than 800 a, at this time, a training image can be selected from the detection data set wildface to copy, and the copied training image is scaled, and then the target training image corresponding to the P4 feature layer is determined from the scaled training images.
In an alternative embodiment, when a training image is selected from the face detection data set widerface for copying, the training image including a face scale not smaller than the upper limit of the positive sample pixel range detected by the feature layer may be copied at least once.
In the embodiment of the present disclosure, when the training image is copied, it is required to ensure that the training image includes at least one face whose face size is not smaller than the upper limit of the positive sample pixel range detected by the feature layer.
For example, for the P5 feature layer, training images including a face scale of not less than 227 pixels need to be copied, specifically, when copying is performed, 100 training images may all be copied, or only 50 selected training images serving as target training images may be copied at least once, and the number of times of copying may be determined according to the number of target training images (i.e., 4A) required by the P5 feature layer.
For example, when 100 training images are copied, if 4A is 200, and 50 candidate training images corresponding to the P5 feature layer are copied at least twice; when only 50 selected training images as target training images are copied, at least three times of copying are possible.
For the P4 feature layer, training images with a face size not less than 114 pixels need to be copied, because at least one copy is performed on training images with a face size not less than 227 pixels when a target training image corresponding to the P5 feature layer is determined, assuming that a total of 300 training images with a face size not less than 114 pixels are obtained, 300 training images can be copied at this time, or a part of the training images can be selected from the training images for at least one copy, and the number of copies can be determined according to the number (i.e., 16A) of target training images required by the P4 feature layer.
For example, when 300 training images are copied, if 16A is 800 and there are 50 candidate training images corresponding to the P4 feature layer, at least three copies are required.
After copying the training image, further scaling the copied training image, specifically: and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.
Or, a more appropriate scaling ratio is selected for scaling according to the size of the face scale contained in the copied training image and the range of the positive sample pixels detected by the P5 feature layer, and the scaling can be realized by a Resize function.
Fig. 5A shows a result obtained by reducing 1/2 the training image shown in fig. 4A, fig. 5B shows a result obtained by reducing 1/2 the training image shown in fig. 4B, and fig. 5C shows a result obtained by reducing 1/2 the training image shown in fig. 4C.
In an alternative embodiment, when determining the target training image corresponding to each feature layer, there are many determination methods, two of which are listed below:
and in the first determination mode, only the training image with the face size in the positive sample pixel range detected by the feature layer is selected from the zoomed training images, and the selected training image is used as a target training image corresponding to the feature layer.
Taking the P3 feature layer as an example, the range of positive sample pixels detected by the P3 feature layer is 32-57, it is assumed that the scaled training image includes fig. 5A, 5B, and 5C, it is assumed that the face scale included in fig. 5C is within 32-57, and the face scale included in fig. 5A and 5B is not within 32-57, so that it is determined that the conditions of fig. 5A and 5B are not satisfied at first during selection, and fig. 5C may or may not be selected, depending on the actual situation.
And determining a second mode, and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer and the scaled training images.
Taking the P2 feature layer as an example, assuming that there are 1 million candidate training images corresponding to the P2 feature layer and at least 12800 target training images corresponding to the P2 feature layer, 2800 training images with a face scale range of 16-28 can be selected from the scaled training images, and then 2800 selected from the 1 million training images are taken as the target training images corresponding to the P2 feature layer. Or, selecting part of training images from the candidate training images, selecting part of training images from the zoomed training images, and then taking the two selected training images as target training images corresponding to the P2 feature layer, wherein the number is required to be not less than 12800.
By analogy, the target training images corresponding to other feature layers, such as the feature layers P5 and P4, may be determined in the same manner.
It should be noted that, the manner of determining the target training image corresponding to the feature layer from the scaled training image in the embodiment of the present disclosure is only an example, and any manner of determining the target training image corresponding to the feature layer from the scaled training image is applicable to the embodiment of the present disclosure.
An alternative embodiment is: counting the number of training images with face scales within 227-640 aiming at the face, and taking the number as a reference and recording the number as A; counting the number of training images with face sizes between 114 and 227, if the number of training images with face sizes within the range of 227 to 640 is smaller than 4A, scaling all training images with face sizes within the range of 114 to 227 pixels until the number of all training images with face sizes within the range of 114 to 227 pixels is approximately equal to 4A, and counting the number of training images with face sizes within the range of 57 to 114 pixels, if the number of training images with face sizes above 114 pixels is smaller than 16A, scaling all training images with face sizes above 114 pixels until the number meets 16A, and so on, and completing the enhancement of the training images within all positive sample pixel ranges.
In the embodiment of the present disclosure, when the face detection model is trained, the model may be trained by using the target training image corresponding to each feature layer. Because the embodiment of the disclosure provides a way for enhancing face data, the target training images corresponding to all feature layers are proportioned according to the proportion of the number of negative samples of different feature layers, the balance of large-scale and small-scale face data is ensured, the balance of positive and negative samples is ensured at the same time, the proportion is the same, and the sample data is enhanced based on the proportion of the number of negative samples of all feature layers, so that different classifiers can be well trained. The number of the negative samples corresponding to different classifiers is different, so that the trained model can well detect images in all scale ranges.
Fig. 6 is a flowchart illustrating a complete method for training a face detection model according to an exemplary embodiment, which specifically includes the following steps:
step 600, collecting an open source face detection data set widget face;
601, analyzing the number of negative samples corresponding to different feature layers of the face detection model;
step 602, determining the pixel range of the positive sample detected by each characteristic layer according to the size of the anchor corresponding to each characteristic layer;
step 603, counting the number of training images with the face scale within the range of 227-640 according to the face, taking the training images as target training images corresponding to a P6 feature layer, and taking the target training images as a reference, wherein the number is marked as A;
step 604, counting the number of training images with the included face scale of 114-227, and judging whether the number is smaller than 4A, if so, executing step 605, otherwise, executing step 606;
605, scaling all training images with the face size of more than 227, and ensuring that the scaled face size is within a range of 114-227 pixels until the number of all training images with the face size of 114-227 pixels is approximately equal to 4A;
step 606, selecting 4A training images from the training images with the face scale between 114 and 227 as target training images corresponding to a P5 feature layer;
step 607, counting the number of training images with the face scale of 57-114, and judging whether the number is less than 16A, if so, executing step 608, otherwise, executing step 609;
step 608, scaling all training images with the face size of more than 114, and ensuring that the face size is within the range of 57-114 after scaling until the number of all training images with the face size within the range of 57-114 pixels is approximately equal to 16A;
step 609, selecting 16A training images from the training images with the face scale of 57-114 as target training images corresponding to the P4 feature layer;
step 610, counting the number of training images with the face scale between 32 and 57, and judging whether the number is smaller than 64A, if so, executing step 611, otherwise, executing step 612;
611, zooming all training images with the face size of more than 57 to ensure that the zoomed face size is within the range of 32-57 pixels until the number of all training images with the face size of 32-57 pixels is approximately equal to 64A;
step 612, selecting 64A training images from the training images with the face scale between 32 and 57 as target training images corresponding to a P3 feature layer;
step 613, counting the number of training images with the face scale of 16-32, and judging whether the number is less than 256A, if so, executing step 614, otherwise, executing step 615;
614, scaling all training images with the face size of more than 32 to ensure that the scaled face size is within the range of 16-32 pixels until the number of all training images with the face size within the range of 16-32 pixels is approximately equal to 256A;
and 615, selecting 256A training images from the training images with the face scale between 16 and 32 as target training images corresponding to the P2 feature layer.
FIG. 7 is a block diagram illustrating an apparatus for training a face detection model according to an exemplary embodiment. Referring to fig. 7, the apparatus includes a first determining unit 700, a second determining unit 701, and a training unit 702.
The first determining unit 700 is configured to perform determining a positive sample pixel range detected by each feature layer according to an anchor point corresponding to each feature layer in a face detection model, where the face detection model is a neural network model including at least two feature layers, each feature layer corresponds to at least one anchor point, and the anchor point is used for representing a range of a face scale detected by a feature layer;
the second determining unit 701 is configured to perform determining, from the training images of the face detection data set, a target training image corresponding to each feature layer according to the positive sample pixel range detected by the feature layer, so that a ratio of the number of target training images between the feature layers is the same as a preset ratio of the number of negative samples between the feature layers, where the target training images corresponding to the feature layers include a face scale within the positive sample pixel range detected by the feature layer;
the training unit 702 is configured to perform training on the face detection model by using the target training images corresponding to the feature layers.
In an optional implementation, the first determining unit 700 is specifically configured to perform:
determining the upper limit of the pixel range of the positive sample detected by the middle characteristic layer according to the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer; and taking the upper limit of the positive sample pixel range detected by the next feature layer of the middle feature layer as the lower limit of the positive sample pixel range detected by the middle feature layer.
In an optional implementation, the first determining unit 700 is specifically configured to perform:
and determining the upper limit of the pixel range of the positive sample detected by the uppermost characteristic layer according to the size of the training image, and taking the upper limit of the pixel range of the positive sample detected by the next characteristic layer of the uppermost characteristic layer as the lower limit of the pixel range of the positive sample detected by the uppermost characteristic layer.
In an optional implementation, the first determining unit 700 is specifically configured to perform:
determining the upper limit of the pixel range of the positive sample detected by the characteristic layer located at the lowest according to the maximum anchor point corresponding to the characteristic layer located at the lowest and the minimum anchor point corresponding to the last characteristic layer of the characteristic layer located at the lowest; and taking the minimum anchor point corresponding to the feature layer positioned at the lowest as the lower limit of the pixel range of the positive sample detected by the feature layer positioned at the lowest.
In an optional implementation, the first determining unit 700 is specifically configured to perform:
and taking the average value of the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer as the upper limit of the positive sample pixel range detected by the middle characteristic layer.
In an optional implementation, the first determining unit 700 is specifically configured to perform:
and taking the average value of the maximum anchor point corresponding to the feature layer positioned at the lowest and the minimum anchor point corresponding to the last feature layer of the feature layer positioned at the lowest as the upper limit of the range of the positive sample pixels detected by the feature layer positioned at the lowest.
In an alternative implementation, the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;
the second determining unit 701 is specifically configured to perform:
determining a target training image corresponding to the topmost feature layer from the face detection dataset;
if the image quantity proportion corresponding to other feature layers is larger than the quantity proportion of the negative samples between the uppermost feature layer and the other feature layers, copying the training images in the face detection data set, and scaling the copied training images;
determining target training images corresponding to the other feature layers from the scaled training images;
the image quantity proportion corresponding to the other feature layers is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the other feature layers, and the candidate training images are training images with the face scale in the range of the positive sample pixels detected by the other feature layers from the face detection data set.
In an optional implementation, the second determining unit 701 is specifically configured to perform:
copying the training image of which the face scale is not less than the upper limit of the pixel range of the positive sample detected by the feature layer at least once;
and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.
In an optional implementation, the second determining unit 701 is further configured to perform:
and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer.
In an optional implementation, the second determining unit 701 is specifically configured to perform:
and selecting a training image with the face size in the positive sample pixel range detected by the characteristic layer from the zoomed training image, and taking the selected training image as a target training image corresponding to the characteristic layer.
With regard to the apparatus in the above-described embodiment, the specific manner in which each unit executes the request has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 8 is a block diagram illustrating an electronic device 800 for training of a face detection model, according to an example embodiment, the apparatus comprising:
a processor 810;
a memory 820 for storing instructions executable by the processor 810;
wherein the processor 810 is configured to execute the instructions to implement the training method of the face detection model in the embodiment of the present disclosure.
In an exemplary embodiment, a storage medium comprising instructions, such as the memory 820 comprising instructions, executable by the processor 810 of the electronic device 800 to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The embodiment of the present disclosure further provides a computer program product, which when running on an electronic device, causes the electronic device to execute a method for implementing any one of the above training methods for a face detection model or any one of the above training methods for a face detection model according to the embodiment of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (22)

1. A training method of a face detection model is characterized by comprising the following steps:
determining the range of positive sample pixels detected by each characteristic layer according to anchor points corresponding to each characteristic layer in a face detection model, wherein the face detection model is a neural network model comprising at least two characteristic layers, each characteristic layer corresponds to at least one anchor point, and the anchor points are used for representing the range of face scales detected by the characteristic layers;
determining target training images corresponding to the feature layers from training images of a face detection data set according to the pixel range of the positive samples detected by the feature layers, so that the proportion of the number of the target training images among the feature layers is the same as the proportion of the number of the negative samples among the preset feature layers, wherein the face scale contained in the target training images corresponding to the feature layers is in the pixel range of the positive samples detected by the feature layers;
and training the face detection model by adopting the target training images corresponding to the characteristic layers.
2. The training method of the face detection model according to claim 1, wherein the step of determining the range of the positive sample pixels detected by each feature layer according to the anchor points corresponding to each feature layer in the face detection model comprises:
determining the upper limit of the pixel range of the positive sample detected by the middle characteristic layer according to the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer; and taking the upper limit of the positive sample pixel range detected by the next feature layer of the middle feature layer as the lower limit of the positive sample pixel range detected by the middle feature layer.
3. The training method of the face detection model according to claim 1, wherein the step of determining the range of the positive sample pixels detected by each feature layer according to the anchor points corresponding to each feature layer in the face detection model comprises:
and determining the upper limit of the pixel range of the positive sample detected by the uppermost characteristic layer according to the size of the training image, and taking the upper limit of the pixel range of the positive sample detected by the next characteristic layer of the uppermost characteristic layer as the lower limit of the pixel range of the positive sample detected by the uppermost characteristic layer.
4. The training method of the face detection model according to claim 1, wherein the step of determining the range of the positive sample pixels detected by each feature layer according to the anchor points corresponding to each feature layer in the face detection model comprises:
determining the upper limit of the pixel range of the positive sample detected by the characteristic layer located at the lowest according to the maximum anchor point corresponding to the characteristic layer located at the lowest and the minimum anchor point corresponding to the last characteristic layer of the characteristic layer located at the lowest; and taking the minimum anchor point corresponding to the feature layer positioned at the lowest as the lower limit of the pixel range of the positive sample detected by the feature layer positioned at the lowest.
5. The training method of the face detection model according to claim 2, wherein the step of determining the upper limit of the range of the positive sample pixels detected by the middle feature layer according to the maximum anchor point corresponding to the middle feature layer and the minimum anchor point corresponding to the last feature layer of the middle feature layer comprises:
and taking the average value of the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer as the upper limit of the positive sample pixel range detected by the middle characteristic layer.
6. The training method of the face detection model according to claim 4, wherein the step of determining the upper limit of the range of the positive sample pixels detected by the feature layer located at the lowest level according to the maximum anchor point corresponding to the feature layer located at the lowest level and the minimum anchor point corresponding to the feature layer located at the last level of the feature layer located at the lowest level comprises:
and taking the average value of the maximum anchor point corresponding to the feature layer positioned at the lowest and the minimum anchor point corresponding to the last feature layer of the feature layer positioned at the lowest as the upper limit of the range of the positive sample pixels detected by the feature layer positioned at the lowest.
7. The training method of the face detection model according to claim 1, wherein the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;
the step of determining the target training image corresponding to each feature layer from the face detection data set according to the positive sample pixel range detected by each feature layer comprises:
determining a target training image corresponding to the topmost feature layer from the face detection dataset;
if the image quantity proportion corresponding to other feature layers is larger than the quantity proportion of the negative samples between the uppermost feature layer and the other feature layers, copying the training images in the face detection data set, and scaling the copied training images;
determining target training images corresponding to the other feature layers from the scaled training images;
the image quantity proportion corresponding to the other feature layers is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the other feature layers, and the candidate training images are training images with the face scale in the range of the positive sample pixels detected by the other feature layers from the face detection data set.
8. The method for training a face detection model according to claim 7, wherein the step of copying the training images in the face detection data set and scaling the copied training images comprises:
copying the training image of which the face scale is not less than the upper limit of the pixel range of the positive sample detected by the feature layer at least once;
and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.
9. The method for training a face detection model according to claim 7, wherein the step of determining the target training image corresponding to the feature layer from the scaled training images further comprises:
and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer.
10. The method for training the face detection model according to claim 7, wherein the step of determining the target training image corresponding to the feature layer from the scaled training images comprises:
and selecting a training image with the face size in the positive sample pixel range detected by the characteristic layer from the zoomed training image, and taking the selected training image as a target training image corresponding to the characteristic layer.
11. A training device for a face detection model is characterized by comprising:
the first determination unit is configured to determine a positive sample pixel range detected by each feature layer according to an anchor point corresponding to each feature layer in a face detection model, wherein the face detection model is a neural network model comprising at least two feature layers, each feature layer corresponds to at least one anchor point, and the anchor point is used for representing a range of a face scale detected by the feature layer;
a second determining unit, configured to perform determining, from training images of a face detection data set, a target training image corresponding to each feature layer according to the positive sample pixel range detected by the feature layer, so that a ratio of the number of target training images between the feature layers is the same as a preset ratio of the number of negative samples between the feature layers, where the target training images corresponding to the feature layers include a face scale within the positive sample pixel range detected by the feature layer;
and the training unit is configured to execute training of the face detection model by using the target training images corresponding to the feature layers.
12. The apparatus for training a face detection model according to claim 11, wherein the first determining unit is specifically configured to perform:
determining the upper limit of the pixel range of the positive sample detected by the middle characteristic layer according to the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer; and taking the upper limit of the positive sample pixel range detected by the next feature layer of the middle feature layer as the lower limit of the positive sample pixel range detected by the middle feature layer.
13. The apparatus for training a face detection model according to claim 11, wherein the first determining unit is specifically configured to perform:
and determining the upper limit of the pixel range of the positive sample detected by the uppermost characteristic layer according to the size of the training image, and taking the upper limit of the pixel range of the positive sample detected by the next characteristic layer of the uppermost characteristic layer as the lower limit of the pixel range of the positive sample detected by the uppermost characteristic layer.
14. The apparatus for training a face detection model according to claim 11, wherein the first determining unit is specifically configured to perform:
determining the upper limit of the pixel range of the positive sample detected by the characteristic layer located at the lowest according to the maximum anchor point corresponding to the characteristic layer located at the lowest and the minimum anchor point corresponding to the last characteristic layer of the characteristic layer located at the lowest; and taking the minimum anchor point corresponding to the feature layer positioned at the lowest as the lower limit of the pixel range of the positive sample detected by the feature layer positioned at the lowest.
15. The apparatus for training a face detection model according to claim 12, wherein the first determining unit is specifically configured to perform:
and taking the average value of the maximum anchor point corresponding to the middle characteristic layer and the minimum anchor point corresponding to the last characteristic layer of the middle characteristic layer as the upper limit of the positive sample pixel range detected by the middle characteristic layer.
16. The apparatus for training a face detection model according to claim 14, wherein the first determining unit is specifically configured to perform:
and taking the average value of the maximum anchor point corresponding to the feature layer positioned at the lowest and the minimum anchor point corresponding to the last feature layer of the feature layer positioned at the lowest as the upper limit of the range of the positive sample pixels detected by the feature layer positioned at the lowest.
17. The training device of the face detection model according to claim 11, wherein the number of negative samples detected by each feature layer is proportional to the scale of each feature layer;
the second determination unit is specifically configured to perform:
determining a target training image corresponding to the topmost feature layer from the face detection dataset;
if the image quantity proportion corresponding to other feature layers is larger than the quantity proportion of the negative samples between the uppermost feature layer and the other feature layers, copying the training images in the face detection data set, and scaling the copied training images;
determining target training images corresponding to the other feature layers from the scaled training images;
the image quantity proportion corresponding to the other feature layers is the ratio of the quantity of the target training images corresponding to the feature layer positioned at the top to the quantity of the candidate training images corresponding to the other feature layers, and the candidate training images are training images with the face scale in the range of the positive sample pixels detected by the other feature layers from the face detection data set.
18. The apparatus for training a face detection model according to claim 17, wherein the second determining unit is specifically configured to perform:
copying the training image of which the face scale is not less than the upper limit of the pixel range of the positive sample detected by the feature layer at least once;
and scaling the copied training image in equal proportion according to the pixel range of the positive sample detected by the characteristic layer.
19. The apparatus for training a face detection model according to claim 17, wherein the second determining unit is further configured to perform:
and selecting a target training image corresponding to the characteristic layer from the candidate training images corresponding to the characteristic layer.
20. The apparatus for training a face detection model according to claim 17, wherein the second determining unit is specifically configured to perform:
and selecting a training image with the face size in the positive sample pixel range detected by the characteristic layer from the zoomed training image, and taking the selected training image as a target training image corresponding to the characteristic layer.
21. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the training method of the face detection model according to any one of claims 1 to 10.
22. A storage medium, wherein instructions in the storage medium, when executed by a processor of a training apparatus for a face detection model, enable the training apparatus for a face detection model to perform the training method for a face detection model according to any one of claims 1 to 10.
CN201910746118.1A 2019-08-13 2019-08-13 Training method and device of face detection model, electronic equipment and storage medium Active CN110490115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910746118.1A CN110490115B (en) 2019-08-13 2019-08-13 Training method and device of face detection model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910746118.1A CN110490115B (en) 2019-08-13 2019-08-13 Training method and device of face detection model, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110490115A CN110490115A (en) 2019-11-22
CN110490115B true CN110490115B (en) 2021-08-13

Family

ID=68549821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910746118.1A Active CN110490115B (en) 2019-08-13 2019-08-13 Training method and device of face detection model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110490115B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554045B (en) * 2020-04-23 2024-04-09 国家广播电视总局广播电视科学研究院 Data set manufacturing method, device, equipment and storage medium
CN111540203B (en) * 2020-04-30 2021-09-17 东华大学 Method for adjusting green light passing time based on fast-RCNN
CN111626193A (en) * 2020-05-26 2020-09-04 北京嘀嘀无限科技发展有限公司 Face recognition method, face recognition device and readable storage medium
CN112906789A (en) * 2021-02-19 2021-06-04 阳光保险集团股份有限公司 Method and device for selecting negative examples of area generation network and computer equipment
CN114926447B (en) * 2022-06-01 2023-08-29 北京百度网讯科技有限公司 Method for training a model, method and device for detecting a target

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407908A (en) * 2016-08-31 2017-02-15 广州市百果园网络科技有限公司 Training model generation method and human face detection method and device
CN107220618A (en) * 2017-05-25 2017-09-29 中国科学院自动化研究所 Method for detecting human face and device, computer-readable recording medium, equipment
CN107871134A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 A kind of method for detecting human face and device
CN108694401A (en) * 2018-05-09 2018-10-23 北京旷视科技有限公司 Object detection method, apparatus and system
CN108985135A (en) * 2017-06-02 2018-12-11 腾讯科技(深圳)有限公司 A kind of human-face detector training method, device and electronic equipment
CN109376637A (en) * 2018-10-15 2019-02-22 齐鲁工业大学 Passenger number statistical system based on video monitoring image processing
CN109409210A (en) * 2018-09-11 2019-03-01 北京飞搜科技有限公司 A kind of method for detecting human face and system based on SSD frame
CN109409252A (en) * 2018-10-09 2019-03-01 杭州电子科技大学 A kind of traffic multi-target detection method based on modified SSD network
CN109711406A (en) * 2018-12-25 2019-05-03 中南大学 A kind of multidirectional image Method for text detection based on multiple dimensioned rotation anchor mechanism

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407908A (en) * 2016-08-31 2017-02-15 广州市百果园网络科技有限公司 Training model generation method and human face detection method and device
CN107871134A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 A kind of method for detecting human face and device
CN107220618A (en) * 2017-05-25 2017-09-29 中国科学院自动化研究所 Method for detecting human face and device, computer-readable recording medium, equipment
CN108985135A (en) * 2017-06-02 2018-12-11 腾讯科技(深圳)有限公司 A kind of human-face detector training method, device and electronic equipment
CN108694401A (en) * 2018-05-09 2018-10-23 北京旷视科技有限公司 Object detection method, apparatus and system
CN109409210A (en) * 2018-09-11 2019-03-01 北京飞搜科技有限公司 A kind of method for detecting human face and system based on SSD frame
CN109409252A (en) * 2018-10-09 2019-03-01 杭州电子科技大学 A kind of traffic multi-target detection method based on modified SSD network
CN109376637A (en) * 2018-10-15 2019-02-22 齐鲁工业大学 Passenger number statistical system based on video monitoring image processing
CN109711406A (en) * 2018-12-25 2019-05-03 中南大学 A kind of multidirectional image Method for text detection based on multiple dimensioned rotation anchor mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于卷积神经网络的行人检测方法研究;刘键;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180215;全文 *
面向多尺度目标检测的改进Faster R-CNN算法;李晓光;《计算机辅助设计与图形学学报》;20190731;第31卷(第7期);全文 *

Also Published As

Publication number Publication date
CN110490115A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN110490115B (en) Training method and device of face detection model, electronic equipment and storage medium
CN110532984B (en) Key point detection method, gesture recognition method, device and system
US20180068461A1 (en) Posture estimating apparatus, posture estimating method and storing medium
CN104202547B (en) Method, projection interactive approach and its system of target object are extracted in projected picture
CN108292362A (en) Gesture identification for cursor control
CN108960163A (en) Gesture identification method, device, equipment and storage medium
CN111144215B (en) Image processing method, device, electronic equipment and storage medium
CN111738231A (en) Target object detection method and device, computer equipment and storage medium
CN110796029B (en) Face correction and model training method and device, electronic equipment and storage medium
CN104657709B (en) Facial image recognition method, device and server
CN111860494A (en) Optimization method and device for image target detection, electronic equipment and storage medium
CN109271848A (en) A kind of method for detecting human face and human face detection device, storage medium
US20230060211A1 (en) System and Method for Tracking Moving Objects by Video Data
CN109284673A (en) Method for tracing object and device, electronic equipment and storage medium
CN111259823A (en) Pornographic image identification method based on convolutional neural network
Liao et al. A two-stage method for hand-raising gesture recognition in classroom
Chuang et al. Saliency-guided improvement for hand posture detection and recognition
US11367206B2 (en) Edge-guided ranking loss for monocular depth prediction
Liu et al. Salient object detection for RGB-D images by generative adversarial network
CN113487610A (en) Herpes image recognition method and device, computer equipment and storage medium
CN115223239A (en) Gesture recognition method and system, computer equipment and readable storage medium
JP6623851B2 (en) Learning method, information processing device and learning program
CN113557546B (en) Method, device, equipment and storage medium for detecting associated objects in image
CN112734747B (en) Target detection method and device, electronic equipment and storage medium
Guraya et al. Neural networks based visual attention model for surveillance videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant