WO2020140772A1 - 一种人脸检测方法、装置、设备以及存储介质 - Google Patents

一种人脸检测方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2020140772A1
WO2020140772A1 PCT/CN2019/127003 CN2019127003W WO2020140772A1 WO 2020140772 A1 WO2020140772 A1 WO 2020140772A1 CN 2019127003 W CN2019127003 W CN 2019127003W WO 2020140772 A1 WO2020140772 A1 WO 2020140772A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
feature
scale
network
face detection
Prior art date
Application number
PCT/CN2019/127003
Other languages
English (en)
French (fr)
Inventor
武文琦
叶泽雄
肖万鹏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19906810.7A priority Critical patent/EP3910551A4/en
Publication of WO2020140772A1 publication Critical patent/WO2020140772A1/zh
Priority to US17/325,862 priority patent/US12046012B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]

Definitions

  • the present application relates to the field of image processing, and in particular, to a face detection method, device, equipment, and storage medium.
  • Face detection is an important research hotspot in the field of computer vision. Its main task is to detect the faces in the image from the image.
  • An embodiment of the present application provides a face detection method, which is executed by a computing device, and the method includes:
  • the face detection model includes a multi-layer convolution network
  • the size parameter of the face candidate region is less than the first scale condition, it is determined that the face candidate region corresponds to a small-scale face
  • the at least two layers of convolution The network includes a first convolution network and a second convolution network, the feature resolution of the feature map output by the first convolution network is suitable for the size parameter, and the adjacent layer convolution network of the first convolution network For the second convolutional network, the feature resolution of the feature map output by the second convolution network is lower than the feature resolution of the feature map output by the first convolution network;
  • An embodiment of the present application provides a face detection method, which is executed by a computing device, and the method includes:
  • the target scale of the face corresponding to the face candidate area is determined according to the size relationship between the size parameter of the face candidate area and the scale condition, the target scale is one of multiple scales, and faces of different scales correspond Different detection models;
  • Face detection is performed on the candidate face region according to the detection model corresponding to the face on the target scale.
  • An embodiment of the present application provides a face detection device.
  • the device includes:
  • a first determining unit configured to determine a face candidate region in the image to be detected according to a face detection model;
  • the face detection model includes a multi-layer convolution network;
  • a second determining unit configured to determine that the face candidate region corresponds to a small-scale face if the size parameter of the face candidate region is less than the first scale condition
  • the first detection unit is used to:
  • the at least two layers of convolution The network includes a first convolutional network and a second convolutional network, the feature resolution of the feature map output by the first convolutional network is suitable for the size parameter, and the adjacent layer convolutional network of the first convolutional network For the second convolutional network, the feature resolution of the feature map output by the second convolution network is lower than the feature resolution of the feature map output by the first convolution network;
  • An embodiment of the present application provides a face detection device.
  • the device includes:
  • the first determining module is used to determine the candidate face region in the image to be detected according to the face detection model
  • the second determining module is used to determine the target scale of the face corresponding to the face candidate area according to the size relationship between the size parameter of the face candidate area and the scale condition, and the target scale is one of multiple scales , Faces of different scales correspond to different detection models;
  • the detection module is configured to perform face detection on the face candidate region according to the detection model corresponding to the face of the target scale.
  • An embodiment of the present application provides a face detection device.
  • the device includes a processor and a memory:
  • the memory is used to store program code and transmit the program code to the processor
  • the processor is used to execute the face detection method described above according to the instructions in the program code.
  • An embodiment of the present application provides a computer-readable storage medium, in which a program code is stored, and the program code may be executed by a processor to implement the face detection method described above.
  • FIG. 1a is a schematic diagram of an image to be detected provided by an embodiment of this application.
  • FIG. 1b is a schematic diagram of an implementation environment of a face detection method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an exemplary scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a face detection method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for performing face detection using a first detection model provided by an embodiment of the present application
  • FIG. 5 is a schematic flowchart of a method for determining a projection feature of a first convolution network provided by an embodiment of this application;
  • FIG. 6 is a schematic flowchart of a method for determining a candidate region for a face provided by an embodiment of the present application
  • FIG. 7 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of results of a face detection model provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a detection model provided by an embodiment of this application.
  • 10b is another precision-recall curve chart provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the detection effect of the face detection method provided by the embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a face detection method provided by an embodiment of the present application.
  • FIG. 13a is a schematic structural diagram of a face detection device according to an embodiment of the present application.
  • FIG. 13b is a schematic structural diagram of a face detection device according to an embodiment of the present application.
  • 14a is a schematic structural diagram of a face detection device according to an embodiment of the present application.
  • 14b is a schematic structural diagram of a face detection device according to an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a face detection device according to an embodiment of this application.
  • 16 is a schematic structural diagram of a face detection device according to an embodiment of the present application.
  • the detection accuracy for the small-scale face in the image is still not high.
  • a multi-layer convolution network can be used to extract the features of the face candidate region, and face recognition based on the features output by the last layer of the convolution network . Since the features of the face candidate region are extracted using a multi-layer convolutional network, the layer-by-layer extraction method is generally adopted. The latter layer of the convolutional network continues to extract based on the features output by the previous layer of the convolutional network, so as to carry more Characteristics of the semantic information. Then the convolutional network of the next layer continues to extract features based on the features output by the convolutional network of the previous layer, in fact, it downsamples the features output by the convolutional network of the previous layer.
  • the feature resolution corresponding to the features output by the convolutional network of the latter layer is lower than the feature resolution corresponding to the features output by the convolutional network of the previous layer.
  • the multi-layer convolutional network uses the above layer-by-layer extraction method when extracting features, when using the multi-layer convolutional network to extract the features of the face candidate region, the feature resolution corresponding to the features output by the general low-layer convolutional network The rate is higher, but the semantic information carried is less; and the feature resolution corresponding to the features output by the high-level convolutional network is relatively low, but the semantic information carried is more.
  • the low-level convolutional network first extracts the features of the face candidate region, and the second layer convolution network continues to extract the features of the face candidate region based on the features output by the first layer convolution network.
  • the first layer convolution network is relatively
  • the second layer convolutional network is a low-level convolutional network, and the second layer convolutional network is a high-level convolutional network relative to the first layer convolutional network.
  • the features of the face candidate regions output by at least two adjacent layer convolutional networks can be used.
  • the features of the face candidate region output by the convolutional network of at least two adjacent layers may be feature-fused, and the fusion feature obtained after fusion may be used as the output feature of the low-level convolutional network, and then combined with the The output features of the convolutional network of at least two adjacent layers perform face detection on the face candidate region.
  • the fusion features obtained by fusion not only have the higher feature resolution embodied by the features extracted by the low-level convolutional network, but also carry the semantic information carried by the features extracted by the high-level convolutional network, it is helpful to detect small scales human face.
  • FIG. 1b is a schematic diagram of an implementation environment of a face detection method provided by an embodiment of the present application.
  • the terminal device 10 and the server device 20 are communicatively connected through a network 30, and the network 30 may be a wired network or a wireless network.
  • the face detection device provided by any embodiment of the present application is integrated on the terminal device 10 and the server device 20 for implementing the face detection method provided by any embodiment of the present application.
  • the terminal device 10 may directly execute the face detection method provided in any embodiment of the present application; or, the terminal device 10 may send the image to be detected to the server device 20, and the server device 20 may perform the embodiment provided in the present application. Face detection method and return the detection result to the terminal device 10.
  • the face detection model 202 may be used to determine the face candidate region 203 in the image to be detected 201.
  • the face detection model 202 may be configured on a face detection device, such as a computing device such as a server that can be used to detect a face.
  • a face detection device such as a computing device such as a server that can be used to detect a face.
  • the face candidate region mentioned in the embodiments of the present application refers to a region that may contain a face in the image to be detected. It can be understood that, an image to be detected 201 may include several face candidate regions 203. One face candidate area 203 may correspond to one face.
  • the face detection model 202 mentioned in the embodiment of the present application includes a multi-layer convolutional network.
  • the embodiment of the present application does not specifically limit the number of layers of the convolutional neural network included in the face detection model 202. In FIG. 2, three layers are used as an example for description, but this does not constitute a limitation on the embodiments of the present application.
  • the number of layers of the convolutional network included in the face detection model 202 may also be other numbers.
  • the face detection model 202 may include a 5-layer convolutional network like the VGG16 network.
  • the embodiment of the present application does not specifically limit the specific implementation manner of the face detection model 202 determining the face candidate region in the image to be detected 201.
  • the face detection model 202 may extract image features of the image to be detected 201, and use the image features to determine the face candidate region 203.
  • the size parameter of the face candidate region 203 is not much different from the size parameter of the face that may be included in the face candidate region 203. Therefore, the size parameter of the face candidate region 203 is to a certain extent
  • the size parameter of the face contained in the face candidate region 203 can be characterized.
  • the embodiment of the present application does not specifically limit the size parameter.
  • the size parameter may be, for example, the area of the face candidate region 203, and the size parameter may, for example, be the number of pixels included in the face candidate region 203 and The ratio of the number of pixels included in the feature map output by the face detection model 202.
  • the small-scale human face mentioned in the embodiments of the present application refers to the condition that the size parameter is less than the first scale condition. It is a relative concept with the large-scale face that can be detected by traditional face detection methods.
  • the small-scale human face is a general term for a human face other than the large-scale human face that can be detected by the conventional face detection method.
  • the first detection model 204 is used to perform face detection on the face candidate area 203.
  • the convolutional networks of each layer of the face detection model 202 can extract image features of the image to be detected , And corresponding output feature map.
  • the characteristics of the face candidate region 203 also need to be combined. Therefore, in the embodiment of the present application, when performing face recognition on the face candidate region 203, the first detection model 204 may identify the person in the image to be detected 201 in combination with the face detection model 202 In the face candidate region, the extracted features in the face candidate region 203 perform face recognition on the face candidate region 203.
  • the first detection model 204 may project the face candidate region 203 onto the feature map output by the first convolution network of the face detection model 202, A first projection feature 205 is obtained, and the face candidate region 203 is projected onto a feature map output by the second convolution network of the face detection model 202, to obtain a second projection feature 206. Then, the first projection feature 205 and the second projection feature 206 are used to perform face recognition on the face candidate region 203.
  • the first projection feature is the feature of the face candidate region 203 extracted by the first convolution network
  • the second projection feature is the feature extracted by the second convolution network Describe the features of the face candidate region 203.
  • the first convolutional network and the second convolutional network are adjacent layer convolutional networks.
  • the feature resolution of the feature map output by the first convolutional network is suitable for the size parameter of the face candidate region 203. That is to say, by using the first projection feature, the resolution requirement for performing face recognition on the face candidate region 203 can be satisfied.
  • the feature resolution of the feature map output by the second convolution network is lower than the feature resolution of the feature map output by the first convolution network.
  • the second convolutional network is a high-level convolutional network relative to the first convolutional network.
  • the semantic information carried by the feature map output by the second convolution network is higher than the semantic information carried by the feature map output by the first convolution network.
  • the first detection model 204 uses the first projection feature 205 and the second projection feature 206 to perform face recognition on the face candidate region 203, in order to be able to identify the face more accurately, the first projection feature 205 and the second projection feature 206 perform feature fusion to obtain a fusion feature 207 that has a higher feature resolution and carries more semantic information, and uses the fusion feature 207 and the second projection feature 206 to the face candidate
  • the area 203 performs face detection. Compared with the method of detecting a small-scale human face by using a conventional technique, the accuracy of face detection of the face candidate region 203 is improved, that is, the detection accuracy of the small-scale face is improved.
  • the embodiment of the present application does not specifically limit a specific implementation manner of performing face detection on the face candidate region 203 by using the fusion feature 207 and the second projection feature 206.
  • the fusion feature 207 and the second projection feature 206 may be used as the input of the ROI pooling layer to obtain the corresponding detection result.
  • FIG. 3 is a schematic flowchart of a face detection method provided by an embodiment of the present application.
  • the face detection method provided in this embodiment of the present application may be implemented by the following steps S301-S303, for example.
  • Step S301 Determine a face candidate region in the image to be detected according to a face detection model, where the face detection model includes a multi-layer convolution network.
  • Step S302 If the size parameter of the face candidate region is less than the first scale condition, it is determined that the face candidate region corresponds to a small-scale face.
  • the first ratio condition may be, for example, a first ratio threshold.
  • the size parameter may be a ratio of the number of pixels included in the face candidate region to the number of pixels included in the feature map output by the face detection model.
  • the size parameter of the face candidate region is smaller than the first scale condition, for example, the size parameter of the face candidate region may be smaller than the first scale threshold.
  • the size parameter is less than the first scale condition, for example, the number of pixels included in the face candidate region w p *h p and the number of pixels included in the feature map output by the face detection model w
  • the ratio of oi *h oi is less than 1/10, that is
  • the face candidate area can be regarded as a rectangular area
  • w p is the number of pixels included in the width of the face candidate area
  • h p is the number of pixels included in the height of the face candidate area.
  • w oi is the number of pixels included in the width of the feature map output by the face detection model
  • h oi is the number of pixels included in the height of the feature map output by the face detection model.
  • Step S303 Perform face detection on the face candidate region through the first detection model corresponding to the small-scale face.
  • the traditional technology cannot accurately detect small-scale faces, it is because the feature resolution corresponding to the features used to detect small-scale faces is relatively low. Therefore, in the embodiment of the present application, when recognizing a small-scale human face, the projection feature of the convolutional network whose feature resolution is applied to the size parameter of the face candidate region is to be used.
  • the feature resolution is suitable for the projection feature of the convolutional network of the candidate region of the face, and often does not carry too much semantic information. Therefore, if only using the projection feature of the convolutional network whose feature resolution is suitable for the face candidate region, the small-scale human face may not be accurately identified.
  • the higher-level convolutional neural network projection features carry higher semantic information.
  • the projection features of at least two layers of convolutional networks may be combined to perform face detection on the face candidate region.
  • the at least two-layer convolutional network includes a projection feature of the first convolutional network that can meet the resolution requirement for face detection on the face candidate region, and a feature that can satisfy the face candidate region
  • the projected features of the second convolutional network required by the semantic information for face detection to perform face detection on the face candidate region.
  • the first convolutional network and the second convolutional network are adjacent layer convolutional networks.
  • the embodiment of the present application does not limit other convolutional networks except the first convolutional network and the second convolutional network in the at least two layers of convolutional networks.
  • the other convolutional networks may be, for example, a high-level convolutional network adjacent to the second convolutional network.
  • step S303 When step S303 is specifically implemented, it may be implemented by the face detection method described in FIG. 4. Specifically, it can be implemented through the following steps S401-S403.
  • Step S401 Obtain the projection feature of the face candidate region on the feature map output by at least two layers of convolutional networks in the face detection model through the first detection model.
  • the at least two-layer convolutional network includes a first convolutional network and a second convolutional network, the feature resolution of the feature map output by the first convolutional network is applicable to the size parameter, and the first convolutional network
  • the convolutional network of adjacent layers is the second convolutional network, and the feature resolution of the feature map output by the second convolutional network is lower than that of the first convolutional network.
  • the feature resolution of the feature map output by the first convolutional network is applicable to the size parameter, which means that the feature resolution of the output feature map of the first convolutional network satisfies the The resolution requirements for face recognition in the face candidate region.
  • the size parameter when the size parameter is different, the applicable feature resolution is also different. Accordingly, the number of layers of the first convolution network corresponding to the size parameter in the multi-layer convolution network is also different. Therefore, in the embodiment of the present application, it can be determined according to the size parameter of the face candidate region that the first convolution network is specifically the layer convolution network in the multi-layer convolution network.
  • the first convolution network corresponding to the size range may be determined according to the correspondence between the size parameter range and the number of convolutional network layers.
  • the face detection model includes a 5-layer convolutional network, where the 5-layer convolutional network is a layer 1 convolutional network to a layer 5 convolutional network in order from a low layer to a high layer.
  • the lower layer in the 5-layer convolution network can be convolved A network such as a layer 3 convolutional network is determined to be the first convolutional network; when the size parameter is a larger size parameter than the first size parameter, such as a second size parameter, the second size
  • the convolutional network higher than the layer 3 convolutional network in the 5-layer convolutional network, such as the layer 4 convolutional network can be determined as The first convolutional network.
  • the features output by the high-level convolutional network carry more semantic information than the features output by the low-level convolutional network.
  • the feature resolution corresponding to the features output by the high-level convolutional network is lower than the feature resolution corresponding to the features output by the low-level convolutional network. Therefore, the feature resolution of the feature map output by the second convolution network is lower than the feature resolution of the feature map output by the first convolution network, and the second convolution network can be characterized as the first volume
  • the second convolutional network outputs features carrying more semantic information than the features output by the first convolutional network.
  • the semantic information carried by the features output by the second convolutional network can be characterized, which can meet the semantic information requirements for face recognition of the face candidate region.
  • the feature map output by the convolutional neural network in the face detection model includes not only the features corresponding to the face candidate region, but also the features corresponding to other parts of the image to be detected.
  • face detection is performed on the face candidate region
  • face detection is performed in combination with features corresponding to the face candidate region.
  • the face candidate region may be projected onto the feature map output by the convolution network to obtain the volume of the face candidate region in the face detection model
  • the projection feature on the feature map output by the product network, and the projection feature is the feature corresponding to the face candidate region.
  • Step S402 Fusion features obtained by fusing the projection features of the first convolutional network with the projection features of the second convolutional network are used as the projection features of the first convolutional network.
  • the projection features of the first convolutional network mentioned in the embodiments of the present application can be understood as the features corresponding to the projection area of the face candidate region in the feature map output by the first convolutional network;
  • the second The projection feature of the convolutional network can be understood as the feature corresponding to the projection area of the face candidate region in the feature map output by the second convolutional network. It can be understood that the projection feature of the first convolutional network corresponds to a higher feature resolution, and the projection feature of the second convolutional network carries more semantic information. Therefore, if the projection features of the first convolutional network and the projection features of the second convolutional network are fused, fusion features that have both higher feature resolution and more semantic information can be obtained.
  • first convolutional network and the second convolutional network are adjacent layer convolutional networks, between the projection features of the first convolutional network and the projection features of the second convolutional network Has a relatively high feature correlation, so that the fusion processing of the projection feature of the first convolutional network and the projection feature of the second convolutional network has a better processing effect, which is more conducive to accurately detecting the person A small-scale human face in the face candidate area.
  • Step S403 Perform face detection on the face candidate region according to the projection feature on the feature map output by the at least two-layer convolutional network.
  • the fusion feature is used as the projection feature of the first convolutional network. Therefore, the feature map output by projecting the face candidate region onto the first convolutional network Compared with the projection features obtained above, in addition to having a relatively high feature resolution, it also carries relatively high semantic information. Therefore, after using the fusion feature as the projection feature of the first convolutional network, and then using the projection feature on the feature map output from the at least two layers of convolutional networks to perform face detection on the face candidate region, The face in the face candidate region can be accurately detected.
  • the aforementioned small-scale human face may be divided into several small-scale human faces of different scales.
  • the parameter range interval in which the size parameter is located may be used as a basis for dividing the small-scale human face.
  • the size resolution within the parameter range interval is applicable to the feature resolution of the output feature map of the Nth layer convolution network in the multi-layer convolution network.
  • the size parameter is within the first parameter range interval, and the maximum value of the parameter in the first parameter range interval is a relatively small value
  • the The face candidate area corresponds to the smaller scale face in the small scale face
  • the applicable feature resolution is the feature resolution of the output feature map of the layer 3 convolution network
  • the face candidate region corresponds to a larger-scale face in a small-scale face, such as a medium-scale face
  • the applicable feature resolution is the feature resolution of the output feature map of the layer 4 convolution network.
  • the several small-scale human faces of different scales have respective corresponding first detection models, so as to realize the face detection of the small-scale human faces of the several scales.
  • the smaller-scale face in the aforementioned small-scale face corresponds to a first detection model
  • the network structure of the first detection model can be shown in FIG. 9(a);
  • the larger-scale person in the aforementioned small-scale face A face, such as a mesoscale human face, corresponds to a first detection model.
  • the network structure of the first detection model see FIG. 9(b).
  • FIG. 9(b) For a specific description of the network structure of the first detection model corresponding to the smaller-scale human face in the small-scale human face, reference may be made to the description part of FIG. 9 below, and details are not described here.
  • the small-scale human face may be divided into two-scale human faces, one of which corresponds to the first parameter range interval, such as the size parameter Located in the first parameter range interval [0, 1/100]; another scale of the face can correspond to the second parameter range interval, such as the size parameter Located in the second parameter range (1/100, 1/10).
  • the size parameter For the description of, please refer to the description of the relevant content in step S302 above, which will not be repeated here.
  • the corresponding detection model is determined based on the size parameter of the face candidate region to perform face detection on the face candidate region. That is, according to the size parameter of the face candidate region, it is determined that the corresponding first detection model performs face detection on the face candidate region. That is to say, in the embodiment of the present application, for several face candidate regions included in the image to be detected, the detection model corresponding to the size parameter can be adaptively selected according to the size parameter of each face candidate region according to the size parameter Performing face detection on the face candidate region, which improves the detection accuracy for faces of different scales, and accurately and effectively detects faces of various scales. Instead of using the same detection model to detect all face candidate regions as in the conventional technology, small-scale faces cannot be accurately recognized.
  • the size parameter when the size parameter is greater than the second scale condition, it may also be determined that the face candidate region corresponds to a large-scale face.
  • the embodiment of the present application does not specifically limit the second ratio condition, and the second ratio condition may be determined according to actual conditions.
  • the size parameter is greater than the second scale condition, for example, the number of pixels included in the face candidate region w p *h p and the number of pixels included in the feature map output by the face detection model w
  • the ratio of oi *h oi is greater than 1/10, ie
  • the size parameter For the description of, please refer to the description of the relevant content in step S302 above, which will not be repeated here.
  • face detection may be performed on the face candidate region through a second detection model corresponding to the large-scale face.
  • the second detection model performs face detection on the face candidate region in specific implementation, it can be implemented through the following steps A-B.
  • Step A Obtain the projected features of the face candidate region on the feature map output by at least two layers of convolutional networks in the face detection model through the second detection model; the at least two layers of convolutional networks include Three convolutional networks, the feature resolution of the feature map output by the third convolutional network is suitable for the size parameter.
  • Step B Perform face detection on the face candidate region according to the projection features on the feature map output by the at least two-layer convolutional network.
  • the feature resolution of the feature map output by the third convolution network is applicable to the size parameter
  • the feature resolution of the feature map output by the first convolution network is applicable to the size "Parameter” is similar, so for related content, please refer to the above description section on "The feature resolution of the feature map output by the first convolution network is applicable to the size parameter", which will not be repeated here.
  • step B use the projection features on the feature map output by at least two layers of convolutional networks in the face detection model to perform face detection.
  • steps S301-S303 please refer to the description of steps S301-S303 above, which will not be repeated here.
  • the difference between the two is that, when face detection is performed on the face candidate region corresponding to the small-scale face, because the corresponding size parameter is relatively small, the semantics carried by the projection feature of the first convolutional network There may be less information. Therefore, the fusion feature obtained by fusing the projection feature of the first convolutional network with the projection feature of the second convolutional network is used as the projection feature of the first convolutional network to compensate for the projection of the first convolutional network
  • the third convolution network is likely to be a high-level convolution network in the multi-layer convolution network included in the face detection model.
  • the third convolutional network can not only meet the feature resolution requirements for large-scale face recognition, but also carry more semantic information itself. Therefore, when performing face detection on the face candidate region corresponding to the large-scale face, there is no need to process the projection features of the third convolutional network, but directly according to the at least two layers of convolutional networks.
  • the projection feature on the output feature map performs face detection on the face candidate region.
  • the image to be detected may contain multiple face candidate regions, and the face scales corresponding to the multiple face candidate regions may also be different.
  • the traditional face detection method for multiple face candidate regions, only one face detection model is included, and this face detection model is used to perform face detection on face candidate regions corresponding to faces of various scales.
  • the corresponding detection model can be selected for face detection, and the first detection The model and the second detection model can be detected in parallel, thereby improving the efficiency of detecting faces in the image to be detected.
  • the first detection model can be used to The face candidate area is used for face detection, and the second detection model is used to detect the second face candidate area, thereby realizing recognition of faces of different scales.
  • the two can be executed at the same time, thereby improving the efficiency of face detection on the two face candidate regions.
  • the projection features of the at least two layers of convolutional networks are set with weighting coefficients, which are used to reflect the importance of the projection features of each convolutional network in face detection.
  • the projection features of the first convolutional network are fusion features. Compared to the projection features of other convolutional networks, It not only has a feature resolution suitable for small-scale face size, but also carries more semantic information. Therefore, the projection features of the first convolutional network are more important in detecting the small-scale human face than the projection features of other convolutional networks.
  • the projection features of the aforementioned second detection model for detecting large-scale faces both have a feature resolution that is more suitable for large-scale face sizes, and carry more Semantic information. Therefore, the projection features of the third convolutional network are more important in detecting the large-scale human face than the projection features of other convolutional networks.
  • the first convolutional network is a size parameter with a feature resolution applicable to face candidate regions corresponding to small-scale faces
  • the third convolutional network is a feature resolution with a feature resolution suitable for large-scale faces
  • the size parameter of the candidate region of the face Therefore, in the embodiment of the present application, when setting the weighting coefficient, the weighting coefficient of the convolutional network whose feature resolution is suitable for the size parameter is larger than the weighting coefficient of other convolutional networks to characterize that the feature resolution is suitable for all
  • the projected features of the convolutional network of the size parameter are of the highest importance. Therefore, among the features of performing face detection on the face candidate region, important features account for a larger proportion, which is more helpful for accurately identifying the face in the face candidate region.
  • the weight coefficient of the projection feature of the first convolutional network is higher than that of other convolutional networks such as the second The weight coefficient of the projected feature of the convolutional network.
  • the weight coefficients of the projection features of the third convolutional network are higher than the weight coefficients of the projection features of other convolutional networks.
  • the embodiments of the present application do not specifically limit the specific values of the weight coefficients corresponding to the projection features of the at least two-layer convolutional network.
  • the specific values of the weight coefficients corresponding to the projection features of the at least two-layer convolutional network can be based on The actual situation is determined.
  • the "face projection candidates on the feature map output from the at least two-layer convolutional network for the face candidates according to step S303 and step B" face detection can be performed on the face candidate regions according to the projection features on the feature map output by the at least two layers of convolutional networks and the corresponding weight coefficients.
  • the embodiment of the present application does not specifically limit the implementation method of performing face detection on the face candidate region based on the projection features on the feature map output by the at least two-layer convolutional network and the corresponding weight coefficients, as an example .
  • the projection features on the feature map output by the at least two layers of convolutional networks can be multiplied by their corresponding weight coefficients respectively, and then the projected features after being multiplied by the weight coefficients can be used as the input of the ROI pooling layer to obtain the corresponding Test results.
  • FIG. 5 is a schematic flowchart of a method for determining a projection feature of a first convolution network provided by an embodiment of the present application.
  • the method may be implemented by the following steps S501-S504, for example.
  • Step S501 The first feature is obtained by reducing the number of channels in the projected feature of the first convolutional network.
  • the fusion feature carries the semantic information lacking in the first projection feature
  • the fusion feature is used as the projection feature of the first convolutional network for the face candidate region
  • the computational complexity is increased accordingly.
  • the number of channels in the projected feature of the first convolutional network is reduced to obtain the first feature, and then the first feature is merged with the projected feature of the second convolutional network, Compared with the fusion feature obtained by "fusing the projection feature of the first convolution network and the projection feature of the second convolution network directly", the computational complexity is greatly reduced.
  • Step S502 The second feature is obtained by increasing the feature resolution of the projection feature of the second convolution network to be consistent with the feature resolution of the projection feature of the first convolution network.
  • step S502 it should be noted that the feature resolution of the projection feature of the second convolutional network is lower than that of the projection feature of the first convolutional network.
  • the feature resolution of the first feature is the same as the feature resolution of the projected feature of the first convolutional network. Therefore, the feature resolution of the projected feature of the second convolutional network is lower than that of the first feature.
  • the feature fusion is performed based on pixels. Therefore, in the embodiment of the present application, before fusing the projection features of the first feature and the second convolutional network, the projection features of the second convolutional network are required. Processing is performed so that the feature resolution corresponding to the feature obtained after the processing is the same as the feature resolution of the first feature.
  • the feature resolution of the projection feature of the second convolutional network may be increased to be consistent with the feature resolution of the projection feature of the first convolutional network to obtain the second feature.
  • the projection feature of the second convolutional network is the projection feature of the first convolutional network as the second convolutional network
  • the input of the product network is obtained by performing down-sampling processing on the second convolution network. Therefore, in the embodiment of the present application, the projection features of the second convolutional network may be up-sampled to obtain second features that are consistent with the feature resolution of the projection features of the first convolutional network.
  • Step S503 Perform pixel addition operation on the first feature and the second feature to obtain the fusion feature.
  • the feature resolution of the first feature is the same as the feature resolution of the second feature, and therefore, feature fusion may be performed on the first feature and the second feature.
  • the pixel addition operation mentioned in the embodiment of the present application refers to adding the features of each pixel in the first feature to the features of the pixel corresponding to the pixel in the second feature.
  • Step S504 Use the fusion feature as the projection feature of the first convolutional network.
  • the fusion feature since the fusion feature is obtained by performing a pixel addition operation on the first feature and the second feature, for each pixel in the fusion feature, it carries both the The feature information of a feature also carries the feature information of the second feature. Therefore, the fusion feature not only has higher feature resolution, but also carries more semantic information.
  • step S301 describes an implementation manner of "determining a face candidate region in an image to be detected according to a face detection model" in step S301 with reference to the drawings.
  • the face candidate region when determining the face candidate region, can be determined by generating anchor frames uniformly in the image to be detected.
  • generating an anchor frame means that a certain pixel in the image to be detected is used as the center point of the anchor frame to generate a pixel frame including several pixels.
  • the number of face candidate regions determined by the traditional method is relatively large. Therefore, the number of face candidate regions to be detected when the face detection model performs face detection on the image to be detected is relatively large. The efficiency of face detection by detecting images is relatively low.
  • the method for determining the face candidate regions is improved, so that the number of determined face candidate regions is reduced, thereby improving the efficiency of face detection for the image to be detected.
  • FIG. 6 is a schematic flowchart of a method for determining a face candidate region provided by an embodiment of the present application.
  • the method may be implemented by the following steps S601-S604, for example.
  • Step S601 Acquire the human interest area in the image to be detected.
  • the region of interest of the face mentioned here is a similar concept to the candidate region of the face, and all refer to the region that may contain the face.
  • the face interest region may be used to determine a face candidate region.
  • the embodiment of the present application does not specifically limit the implementation manner of acquiring the region of interest of the face.
  • a cascade boosting-based face detector can be used to obtain the face interest region in the image to be detected.
  • Step S602 Project the interest region of the face onto the feature map output according to the face detection model to obtain a first feature map.
  • Step S603 Generate an anchor frame on the first feature map to obtain a second feature map; in the process of generating the anchor frame, if the center point of the target anchor frame does not overlap with the face interest area, increase the Describe the window step of target anchor frame.
  • the feature map output by the face detection model may be the feature map output by the last layer of the convolution network in the multi-layer convolution network included in the face detection model.
  • the first feature map can be understood as an image feature map where the feature map output from the face detection model corresponds to the region of interest of the face.
  • the region of interest of the face is a candidate region of the face is relatively likely, therefore, combining the image feature extracted by the region of interest of the face and the face detection model to When determining the face candidate region.
  • the image features corresponding to the region of interest of the human face can be analyzed emphatically.
  • the face candidate region is determined by generating an anchor frame on the first feature map. Specifically, since the possibility area of the face interest area is a face candidate area is relatively high, when the center point of the anchor frame overlaps the face interest area, the anchor frame can be uniformly generated. The area outside the face interest area is less likely to be a face candidate area. In view of this, in the embodiment of the present application, in order to reduce the number of determined face candidate areas, the center point of the target anchor frame When it does not overlap with the region of interest of the face, the window step size of the target anchor frame may be increased. That is to say, for the entire image to be detected, the generated anchor boxes are non-uniformly distributed.
  • the distribution density of the anchor frame is lower than the distribution density of the region of interest of the face, thereby reducing the number of anchor frames and correspondingly reducing the number of identified candidate regions of the face.
  • the anchor frame can be uniformly generated in step size 1.
  • the window step of the target anchor frame can be set to 2.
  • the aspect ratio of each anchor frame can be set to 1:1 and 1:2, and the anchor frame size is set to include 128 2 and 256 2
  • the anchor frame size is set to include 128 2 and 256 2
  • Step S604 Calculate the face candidate region in the second feature map according to the loss functions of multiple face detection tasks, and use the determined face candidate region as the face candidate region of the image to be detected.
  • the second feature map may be used as an input of the loss function, so as to determine a face candidate region, and use the determined face candidate region as a person of the image to be detected Face candidate area.
  • the loss function may be obtained through joint training based on the multiple face detection tasks. Among them, the multiple tasks have a high correlation.
  • the multiple face detection tasks include a classification task for a face target, a position regression task for a face target frame, and a position regression task for key points of a face.
  • the classification task for face targets refers to the detection of human faces and non-human faces
  • the position regression task for face target frames refers to the detection of human faces under the premise of detecting human faces
  • the position regression task for key points of the face refers to the detection of the key position on the face of the person under the premise of detecting the face, for example, the nose, eyes, mouth and eyebrows Any one or combination of.
  • the classification task for the face target and the position regression task for the face target frame may be considered necessary.
  • the position regression task for key points of the face although it has a relatively high correlation with face detection, it is not necessary.
  • the classification task for the face target and the position regression task for the face target frame can be used as the main task, and the position regression task for the key points of the face can be used as an auxiliary task. Train the corresponding loss functions.
  • the loss function obtained based on the aforementioned main task and auxiliary task training can be expressed by the following formula (1):
  • the formula (1) consists of four parts added, of which the first part It is the loss function of the aforementioned classification task for face targets, the second part It is the loss function of the aforementioned position regression task for the face target frame, the third part For the loss function of the aforementioned position regression task for key points of the face, the fourth part For weight.
  • the first two parts are similar to the traditional loss function, so they will not be described in detail here. Just need to emphasize that in the first part and the second part And r in w r represents the main task.
  • w a represents the auxiliary task, that is, the position regression task of the key points of the face.
  • the subscript i represents the input data label
  • N represents the total data
  • ⁇ a represents the important coefficient of the a auxiliary task
  • x represents the input sample
  • y represents the actual output result corresponding to the input sample
  • f(x i ; w a ) represents Enter the model prediction result corresponding to the sample.
  • the training process provided by the embodiment of the present application can ensure the convergence of the model while improving the accuracy of face detection through face key point detection.
  • the face detection model is obtained through training.
  • the present invention uses 60k random lowering and lowering method SGD to fine-tune the model, the initial learning rate is set to 0.001, and the learning rate is reduced to 0.0001 after 20k iterations.
  • the Momentum and Weight Decay are set to 0.9 and 0.0005, respectively, and the Mini-batch size is set to 128.
  • the indivisible sample mining refers to sorting all negative samples by the highest confidence score, selecting only the negative samples with the highest score, and continuously iterating the process to achieve a ratio of positive and negative samples of 1: 3.
  • Such hard-to-divide sample mining method can accelerate the speed of network optimization and make the network training process more stable.
  • the data augmentation processing may include the following three cases.
  • FIG. 7 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • the scenario shown in FIG. 7 includes two first detection models, namely a first detection model (a) and a first detection model (b).
  • the size parameter in the size range interval applicable to the first detection model (a) is smaller than the size parameter in the size range interval applicable to the first detection model (b).
  • the method shown in FIG. 7 can be implemented through the following steps S701-S709.
  • Step S701 Based on the cascade boosting face detector, the face interest region in the image to be detected is acquired.
  • Step S702 Project the interest region of the face onto the feature map output according to the face detection model to obtain a first feature map.
  • Step S703 Generate an anchor frame on the first feature map to obtain a second feature map.
  • Step S704 determine whether the center point of the target anchor frame overlaps with the face interest area, if yes, perform step S705a, if no, execute step S705b.
  • Step S705a Set the window step size to 1.
  • Step S705b Set the window step size to 2.
  • Step S706 Calculate the face candidate region in the second feature map according to the loss functions of multiple face detection tasks.
  • Step S707 determine whether the size parameter of the face candidate region is smaller than the first scale condition.
  • step S708 is performed. If the size parameter of the face candidate area is smaller than the first scale condition, it can be determined that the face candidate area corresponds to a large-scale face, and then the The second detection model performs face detection on the face candidate region.
  • Step S708 It is determined that the size parameter of the face candidate region is located in the first parameter range interval.
  • the size parameter of the face candidate region is in the interval of the first parameter range, it is determined that the face candidate region corresponds to a smaller-scale face in the small-scale face, and then the first detection model (a) is used to detect the face candidate Face detection in the area.
  • the size parameter of the face candidate area is not within the range of the first parameter range, it is determined that the face candidate area corresponds to a larger-scale face in the small-scale face, and then the first detection model (b) is used to detect the face Candidate regions are used for face detection.
  • Step S709 Combine the detection results.
  • the detection results of the first detection model (a), the first detection model (b), and the second detection model are combined to realize the detection of human faces at various scales in the image to be detected.
  • first detection model (a), the first detection model (b), and the second detection model can be processed in parallel, that is, at most three face candidate models can be detected at the same time, which improves the The efficiency of face recognition in the image to be detected is described.
  • FIG. 8 is a schematic structural diagram of a face detection model provided by an embodiment of the present application.
  • the face detection model shown in FIG. 8 adopts a network structure similar to VGG16, which includes five layers of convolutional networks, respectively conv1, conv2, conv3, conv4 and conv5.
  • conv1 includes two convolutional layers, respectively 801 and 802
  • conv2 includes two convolutional layers, respectively 803 and 804
  • conv3, conv4 and conv5 include three convolutional layers, as shown in Figure 8 805 -813.
  • the cascading Boosting detector 814 can be used to obtain the face interest region, and then, the face interest region is projected onto the feature map output by the convolution layer 814 to obtain the first feature map (FIG. (Not shown in 8); generate an anchor box on the first feature map to obtain a second feature map (not shown in FIG. 8), and use the second feature map as an input to the loss function layer 815 to obtain Face candidate area.
  • the loss function of the loss function layer 815 includes the loss function softmax for the classification task of the face target, the loss function bbox regressor for the position regression task of the face target frame, and the loss function for the position regression task of the key points of the face landmark regressor.
  • the projection features of the three-layer convolutional network are used to perform face detection on the face candidate regions, specifically, the projection features of conv3, conv4 and conv5 are used to perform face detection on the face candidate regions .
  • the projection feature 816 of conv3, the projection feature 817 of conv4 and the projection feature 818 of conv5 are input into the ROI Pooling layer, and the projection feature 816, 817 and 818 are processed by the ROI Pooling layer to obtain the feature 819, and then the feature 819 is normalized After processing, the feature 820 is obtained, and finally the feature 820 is input to a two-layer fully connected layer (FC layer for short) to obtain a face detection result.
  • FC layer fully connected layer for short
  • the detection results include: whether it is a face (corresponding to the classification result 821 of the classification task of the face target in FIG. 8) and the position of the face frame (result 822 of the position regression task for the position of the face target frame in FIG. 8).
  • the "smaller face in the small-scale face” described in step S709 is called a small-scale face
  • the "larger-scale face in the small-scale face described in step S709" Is called a mesoscale human face.
  • FIG. 9 is a schematic diagram of a structural schematic diagram of a detection model provided by an embodiment of the present application.
  • FIG. 9 shows a schematic diagram of recognizing small-scale human faces using the projection features of conv3, conv4, and conv5.
  • conv3 is a first convolution network
  • conv4 is a second convolution network.
  • the projected features of conv3_3 are down-channel processed by a 1 ⁇ 1 convolution layer (1 ⁇ 1 conv shown in (a)). Since conv4_3 and conv3_3 include two convolutional layers, conv4_1 and conv4_2 (ie, 808 and 809 shown in FIG.
  • conv3_3 represents the third convolution layer of conv3, that is, the convolution layer 807 shown in FIG. 8
  • conv4_3 represents the third convolution layer of conv4, that is, the convolution layer 810 shown in FIG. 8
  • conv5_3 represents the conv5
  • the third convolution layer is the convolution layer 813 shown in FIG. 8.
  • (b) is used to detect mesoscale faces, so the convolution network matching its size parameter is conv4, so in (b), the conv4 is the first convolution network and the conv5 is the second convolution The internet.
  • the specific values of the weight coefficients ⁇ small , ⁇ small and ⁇ small in (c) can be determined according to the actual situation. For example, if the conv5 can be applied to the size parameter of the large-scale face, the weight coefficient ⁇ small corresponding to conv5 may be set to the maximum. If the conv5 can not be applied to the size parameters of the large-scale face, since the projection feature of conv4 carries more semantic information than cov3, the weighting factor ⁇ small corresponding to conv4 can be set to The maximum is to make the detected face more accurate.
  • FIG. 10a illustrates the use of the face detection method and the conventional face detection method provided by the embodiment of the present application.
  • the verification set may include multiple images, and the multiple images may be images including human faces of different scales, for example. Multiple images in the verification set can be used to detect the face detection effect of the face detection model during the iterative training process.
  • the curve 1 in Figure 10a is the precision-recall curve obtained by performing face detection on the verification set using the ACF-WIDER face detection method
  • the curve 2 in Figure 10a is the precision-recall curve obtained by performing face detection on the verification set using the two-stage-CNN face detection method
  • the curve 3 in Figure 10a is the precision-recall curve obtained by performing face detection on the verification set using the Faceness-WIDER face detection method
  • the curve 4 in Figure 10a is the precision-recall curve obtained by performing face detection on the verification set using the Multiscale Cascade CNN face detection method
  • the curve 5 in Fig. 10a is the precision-recall curve obtained by performing face detection on the verification set using the LDCF+face detection method
  • the curve 6 in Figure 10a is the precision-recall curve obtained by performing face detection on the verification set using the Multitask Cascade CNN face detection method
  • the curve 7 in Figure 10a is the precision-recall curve obtained by performing face detection on the verification set using the CMS-RCNN face detection method
  • the curve 8 in Fig. 10a is the precision-recall curve obtained by performing face detection on the verification set using the HR face detection method
  • the curve 9 in FIG. 10a is a precision-recall curve obtained by performing face detection on the verification set using the face detection method provided by the embodiment of the present application.
  • the face detection method provided by the embodiment of the present application has a higher face detection accuracy; when the detection accuracy is the same, the recall rate of the face detection method provided by the embodiment of the present application higher.
  • the face detection method provided by the embodiments of the present application has better effect than the traditional face detection method in terms of detection accuracy and recall rate.
  • the detection accuracy and recall rate of the face detection model in the embodiments of the present application during the iteration process are relatively high.
  • FIG. 10b shows the accuracy obtained by performing face detection on the test set used in the face detection model training process using the face detection method provided by the embodiment of the present application and the traditional face detection method ⁇ Recall (precision-recall) curve.
  • the test set may include multiple images, and the multiple images may be images including human faces of different scales, for example.
  • the multiple images can be used to detect the face detection effect of the trained face detection model.
  • the curve 1 in Figure 10b is the precision-recall curve obtained by performing face detection on the test set using the ACF-WIDER face detection method
  • the curve 2 in Figure 10b is the precision-recall curve obtained by performing face detection on the test set using the two-stage-CNN face detection method
  • the curve 3 in Fig. 10b is the precision-recall curve obtained by performing face detection on the test set using the Faceness-WIDER face detection method
  • the curve 4 in Figure 10b is the precision-recall curve obtained by performing face detection on the test set using the Multiscale Cascade CNN face detection method
  • the curve 5 in Figure 10b is the precision-recall curve obtained by performing face detection on the test set using the LDCF+face detection method
  • the curve 6 in Figure 10b is the precision-recall curve obtained by performing face detection on the test set using the Multitask Cascade CNN face detection method
  • Curve 7 in Figure 10b is the precision-recall curve obtained by performing face detection on the test set using the CMS-RCNN face detection method
  • the curve 8 in Figure 10b is the precision-recall curve obtained by performing face detection on the test set using the HR face detection method
  • the curve 9 in FIG. 10b is a precision-recall curve obtained by performing face detection on the test set using the face detection method provided by the embodiment of the present application.
  • the face detection method provided by the embodiment of the present application has a higher face detection accuracy; when the detection accuracy is the same, the recall rate of the face detection method provided by the embodiment of the present application higher.
  • the face detection method provided by the embodiments of the present application has better effect than the traditional face detection method in terms of detection accuracy and recall rate.
  • the trained face detection model used in the embodiments of the present application has a relatively high precision and recall rate for performing face detection on the image to be detected.
  • the face detection model provided by the embodiment of the present application is used for face recognition, whether in the iterative process of training the face detection model or using the face detection model obtained by training, Compared with traditional face detection methods, they have higher accuracy and higher recall rate.
  • the aforementioned verification set and test set are all image sets containing multiple images.
  • the image in the verification set (or test set) may be an image containing faces of multiple scales, and the face detection method provided in the embodiment of the present application can effectively detect images containing multi-scale faces Human faces at all scales. It can be understood in conjunction with FIG. 11.
  • FIG. 11 shows the detection effect of the face detection method provided in the embodiment of the present application.
  • a small frame in FIG. 11 represents a recognized face. It can be seen from FIG. 11 that, using the face detection method provided by the embodiment of the present application, faces at various scales can be detected, for example, small-scale faces near the stairs and those sitting on the sofa in the image in the upper left corner of FIG. 11 Large-scale faces can be accurately detected.
  • FIG. 12 is a schematic flowchart of another face detection method provided by an embodiment of the present application. This method can be implemented by the following steps S1201-S1203, for example.
  • Step S1201 Determine a face candidate region in the image to be detected according to the face detection model.
  • the face detection model mentioned here may be the same as the face detection model mentioned in step S301 of the foregoing embodiment, and the face detection model may include a multi-layer convolution network.
  • the implementation of determining the face candidate region in the image to be detected according to the face detection model is the same as “determining the face candidate region in the image to be detected according to the face detection model” in step S301 of the foregoing embodiment, and reference may be made to the foregoing implementation For example, the description of related content in step S301 will not be repeated here.
  • Step S1202 determine the target scale of the face corresponding to the face candidate area according to the size relationship between the size parameter of the face candidate area and the proportionality condition, the target scale is one of multiple scales, different scales Faces correspond to different detection models.
  • Step S1203 Perform face detection on the face candidate region according to the detection model corresponding to the face of the target scale.
  • the face detection method provided in the embodiment of the present application can detect faces of multiple scales in the image to be detected.
  • the size parameter of the face candidate region considering the size parameter of the face candidate region, the size parameter of the face included in the face candidate region can be characterized to a certain extent, therefore, the size parameter and the scale condition of the face candidate region can be used Size relationship, determine the target scale of the human face corresponding to the candidate face area.
  • the target scale may be, for example, a small scale or a large scale.
  • the size parameter of the face candidate area is less than or equal to the first scale condition, it is determined that the target scale of the face corresponding to the face candidate area is a small scale; if the person The size parameter of the face candidate area is greater than the second scale condition, and it is determined that the target scale of the face corresponding to the face candidate area is a large scale.
  • multiple face detection models are included, which are used to detect faces of various scales. Therefore, after determining the target scale of the face corresponding to the face candidate region, the detection model corresponding to the face of the target scale can be used to perform face detection on the face candidate region.
  • the first detection model mentioned in the foregoing embodiment may be used to detect the face candidate area.
  • the first detection model For a specific implementation manner of detecting the small-scale human face by using the first detection model, reference may be made to the description part of the foregoing embodiment, and details are not described here.
  • the second detection model mentioned in the foregoing embodiment may be used to detect the face candidate area.
  • the second detection model may be used to detect a large-scale face.
  • the small scale can be further subdivided into multiple small scales of different scales.
  • the plurality of small-scale human faces of different scales have respective corresponding first face detection models, so as to realize face detection of the small-scale human faces of the plurality of scales.
  • the image to be detected may include multiple face detection areas.
  • the methods of steps S1202-S1203 may be performed to perform face detection on the face detection area.
  • the methods of steps S1202-S1203 can be used to perform face detection on the multiple face detection areas, and then the multiple of the multiple face candidate areas can be obtained respectively.
  • the corresponding detection model can be selected to perform face detection, thereby realizing the recognition of faces of different scales.
  • this embodiment provides a face detection device 1300.
  • the device 1300 includes: a first determination unit 1301, a second The determination unit 1302 and the first detection unit 1303.
  • the first determining unit 1301 is configured to determine a face candidate region in the image to be detected according to a face detection model; the face detection model includes a multi-layer convolution network;
  • the second determining unit 1302 is configured to determine that the face candidate region corresponds to a small-scale face if the size parameter of the face candidate region is less than the first scale condition;
  • the first detection unit 1303 is configured to perform face detection on the face candidate region through the first detection model corresponding to the small-scale face, including:
  • the at least two layers of convolutional networks include a first convolution A network and a second convolutional network, the feature resolution of the feature map output by the first convolutional network is applicable to the size parameter, and the adjacent layer convolutional network of the first convolutional network is the second volume A product network, the feature resolution of the feature map output by the second convolution network is lower than the feature resolution of the feature map output by the first convolution network;
  • the device 1300 further includes: a third determination unit 1304 and a second detection unit 1305.
  • the third determining unit 1304 is configured to determine that the face candidate region corresponds to a large-scale face
  • the second detection unit 1305 is configured to perform face detection on the face candidate region through a second detection model corresponding to the large-scale face, including:
  • the at least two layers of convolutional networks include a third convolution Network, the feature resolution of the feature map output by the third convolutional network is applicable to the size parameter;
  • the at least two layers of convolutional networks are respectively set with weighting coefficients.
  • the weighting coefficients of the convolutional networks with feature resolutions applicable to the size parameters are greater than those of other The weight coefficient of the convolutional network;
  • the performing face detection on the face candidate region according to the projection feature on the feature map output by the at least two-layer convolutional network includes:
  • the fusion feature obtained by fusing the projection feature of the first convolutional network and the projection feature of the second convolutional network as the projection feature of the first convolutional network includes:
  • the fusion feature is used as the projection feature of the first convolutional network.
  • the first determining unit 1301 is specifically configured to:
  • a face candidate region in the second feature map is calculated according to loss functions of multiple face detection tasks, and the determined face candidate region is used as the face candidate region of the image to be detected.
  • the plurality of face detection tasks include a classification task for a face target, a position regression task for a face target frame, and a position regression task for key points of a face, the plurality of face detection tasks
  • the loss function of is trained in the following way:
  • the classification task for the face target and the position regression task for the face target frame are used as main tasks, and the position regression task for key points of the face are used as auxiliary tasks to jointly train the corresponding loss functions.
  • the face candidate region in the image to be detected is determined according to the face detection model including the multi-layer convolution network, and whether the face candidate region corresponds to the small scale is determined according to the size parameter of the face candidate region Face, if it is, perform face detection on the face candidate area through the first detection model for identifying small-scale faces, in face detection on the face candidate area, the face candidate area is obtained in the face detection model
  • the projection features on the feature map output by at least two layers of convolutional networks in at least two layers, the at least two layers of convolutional networks include a first convolutional network and a second convolutional network, where the first convolutional network is based on the size parameters of the candidate face regions It is determined, so the feature resolution of the feature map output by the first convolution network is relatively high, suitable for detecting face candidate regions with this size parameter, and the second convolution network is the adjacent layer of the first convolution network Convolutional network, although the feature resolution is not as high as the first convolutional network, but based on
  • this embodiment provides a face detection device 1400.
  • the device 1400 includes a first determination module 1401 and a second determination module 1402. ⁇ 1403.
  • the first determining module 1401 is configured to determine a face candidate region in the image to be detected according to the face detection model
  • the second determining module 1402 is configured to determine the target scale of the face corresponding to the face candidate area according to the size relationship between the size parameter of the face candidate area and the scale condition, the target scale being among multiple scales One, faces of different scales correspond to different detection models;
  • the detection module 1403 is configured to perform face detection on the face candidate region according to the detection model corresponding to the face of the target scale.
  • the second determining module 1402 is specifically configured to:
  • the target scale of the face corresponding to the face candidate area is a small scale
  • the target scale of the face corresponding to the face candidate area is a large scale.
  • the image to be detected includes a plurality of face candidate regions.
  • the device 1400 further includes: an obtaining unit 1404 and a merging unit 1405.
  • An obtaining unit 1404 configured to respectively obtain a plurality of face detection results of the plurality of face candidate regions
  • the merging unit 1405 is configured to merge the multiple face detection results as the face detection results of the image to be detected.
  • the corresponding detection model can be selected to perform face detection, thereby realizing the recognition of faces of different scales.
  • an embodiment of the present application also provides a face detection device.
  • the face detection device will be described below with reference to the drawings.
  • an embodiment of the present application provides a face detection device 1500.
  • the device 1500 may be a computing device such as a server, and may have a relatively large difference due to different configurations or performance, and may include one or more than one Central Processing Units (CPU for short) 1522 (for example, one or more processors) and memory 1532, one or more storage media 1530 for storing application programs 1542 or data 1544 (for example, one or one mass storage device in Shanghai) ).
  • the memory 1532 and the storage medium 1530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the central processor 1522 may be configured to communicate with the storage medium 1530 and execute a series of instruction operations in the storage medium 1530 on the face detection device 1500 to implement the face detection method described in any embodiment of the present application .
  • the face detection device 1500 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input output interfaces 1558, and/or one or more operating systems 1541, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM and so on.
  • operating systems 1541 such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM and so on.
  • the method for performing the face detection described in FIGS. 2 to 9 in the above embodiment may be based on the server structure shown in FIG. 15.
  • the CPU 1522 is used to perform the following steps:
  • the face detection model includes a multi-layer convolution network
  • the size parameter of the face candidate region is less than the first scale condition, it is determined that the face candidate region corresponds to a small-scale face
  • Face detection of the face candidate region through the first detection model corresponding to the small-scale face includes:
  • the at least two layers of convolutional networks include a first convolution A network and a second convolutional network, the feature resolution of the feature map output by the first convolutional network is applicable to the size parameter, and the adjacent layer convolutional network of the first convolutional network is the second volume A product network, the feature resolution of the feature map output by the second convolution network is lower than the feature resolution of the feature map output by the first convolution network;
  • the implementation of the face detection method described in FIG. 12 in the above embodiment may be based on the server structure shown in FIG. 15.
  • the CPU 1522 is used to perform the following steps:
  • the target scale of the face corresponding to the face candidate area is determined according to the size relationship between the size parameter of the face candidate area and the scale condition, the target scale is one of multiple scales, and faces of different scales correspond Different detection models;
  • Face detection is performed on the candidate face region according to the detection model corresponding to the face on the target scale.
  • an embodiment of the present application provides a face detection device 1600.
  • the device 1600 may also be a computing device such as a terminal device.
  • the terminal device may include a mobile phone, a tablet computer, and a personal digital assistant (Personal Digital) Assistant (PDA for short), Point of Sales (POS for short), in-vehicle computer and other arbitrary terminal devices, taking the terminal device as a mobile phone as an example.
  • PDA Personal Digital assistant
  • POS Point of Sales
  • in-vehicle computer in-vehicle computer
  • the mobile phone includes: a radio frequency (Radio Frequency) circuit 1610, a memory 1620, an input unit 1630, a display unit 1640, a sensor 1650, an audio circuit 1660, a wireless fidelity (WiFi) module 1670, and processing 1680, power supply 1690 and other components.
  • a radio frequency (Radio Frequency) circuit 1610 for a radio frequency (Radio Frequency) circuit
  • a memory 1620 for a wireless fidelity
  • WiFi wireless fidelity
  • processing 1680 power supply 1690 and other components.
  • FIG. 16 does not constitute a limitation on the mobile phone, and may include more or less components than shown, or combine some components, or arrange different components.
  • the RF circuit 1610 can be used to receive and send signals during the sending and receiving of information or during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 1680; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 1610 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 1610 can also communicate with other devices through a wireless communication network.
  • the above wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile (GSM), General Packet Radio Service (GPRS), and Code Division Multiple Access (GPRS) Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Message Service (SMS) Wait.
  • GSM Global System of Mobile
  • GPRS General Packet Radio Service
  • GPRS Code Division Multiple Access
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • Email Short Message Service
  • the memory 1620 may be used to store software programs and modules.
  • the processor 1680 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1620.
  • the memory 1620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one function required application programs (such as sound playback function, image playback function, etc.); the storage data area may store Data created by the use of mobile phones (such as audio data, phone books, etc.), etc.
  • the memory 1620 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the input unit 1630 may be used to receive input numeric or character information, and generate key signal input related to user settings and function control of the mobile phone.
  • the input unit 1630 may include a touch panel 1631 and other input devices 1632.
  • the touch panel 1631 also known as a touch screen, can collect user's touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc. on the touch panel 1631 or near the touch panel 1631. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 1631 may include a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into contact coordinates, and then sends To the processor 1680, and can receive the command sent by the processor 1680 and execute it.
  • the touch panel 1631 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 1630 may also include other input devices 1632.
  • other input devices 1632 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, and so on.
  • the display unit 1640 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 1640 may include a display panel 1641.
  • the display panel 1641 may be configured in the form of a liquid crystal display (Liquid Crystal (Display), LCD for short), an organic light-emitting diode (Organic Light-Emitting Diode, OLED for short), or the like.
  • the touch panel 1631 may cover the display panel 1641, and when the touch panel 1631 detects a touch operation on or near it, it is transmitted to the processor 1680 to determine the type of touch event, and then the processor 1680 according to the touch event The type provides corresponding visual output on the display panel 1641.
  • the touch panel 1631 and the display panel 1641 are implemented as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 1631 and the display panel 1641 may be integrated to Realize the input and output functions of the mobile phone.
  • the mobile phone may also include at least one sensor 1650, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1641 according to the brightness of the ambient light, and the proximity sensor may close the display panel 1641 and/or when the mobile phone moves to the ear Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (generally three axes), and can detect the magnitude and direction of gravity when at rest, and can be used to identify mobile phone gesture applications (such as horizontal and vertical screen switching, related Games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tap), etc.
  • other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. can be configured here. Repeat.
  • the audio circuit 1660, the speaker 1661, and the microphone 1662 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 1660 can convert the received audio data into electrical signals, which are transmitted to the speaker 1661, which is converted into a sound signal output by the speaker 1661; on the other hand, the microphone 1662 converts the collected sound signals into electrical signals, which are converted by the audio circuit 1660 After receiving, it is converted into audio data, and then processed by the audio data output processor 1680, and then sent to another mobile phone through the RF circuit 1610, or the audio data is output to the memory 1620 for further processing.
  • WiFi is a short-range wireless transmission technology.
  • the mobile phone can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 1670. It provides users with wireless broadband Internet access.
  • FIG. 16 shows the WiFi module 1670, it can be understood that it is not a necessary component of a mobile phone, and can be omitted as needed without changing the essence of the invention.
  • the processor 1680 is the control center of the mobile phone, and uses various interfaces and lines to connect the various parts of the entire mobile phone, by running or executing the software programs and/or modules stored in the memory 1620, and calling the data stored in the memory 1620 to execute Various functions of the mobile phone and processing data, so as to monitor the mobile phone as a whole
  • the processor 1680 may include one or more processing units; in the embodiment of the present application, the processor 1680 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system , User interface and application programs, etc., the modem processor mainly handles wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 1680.
  • the mobile phone also includes a power supply 1690 (such as a battery) that supplies power to various components.
  • a power supply 1690 (such as a battery) that supplies power to various components.
  • the power supply can be logically connected to the processor 1680 through the power management system, thereby managing charge, discharge, and power consumption management through the power management system And other functions.
  • the mobile phone may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • Embodiments of the present application also provide a computer-readable storage medium for storing program code, which can be executed by a processor to perform any of the face detection methods described in the foregoing embodiments Kind of implementation.
  • the foregoing program may be stored in a computer readable storage medium.
  • the execution includes The steps of the above method embodiments; and the foregoing storage medium may be at least one of the following media: read-only memory (English: read-only memory, abbreviation: ROM), RAM, magnetic disk, or optical disk, etc.
  • Program code medium may be at least one of the following media: read-only memory (English: read-only memory, abbreviation: ROM), RAM, magnetic disk, or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种人脸检测方法、装置、设备和存储介质,根据包括多层卷积网络的人脸检测模型确定待检测图像中的人脸候选区域,根据人脸候选区域的尺寸参数确定人脸候选区域是对应的小尺度人脸时,通过第一检测模型对人脸候选区域进行人脸检测,在对人脸候选区域进行人脸检测中,获取人脸候选区域在人脸检测模型中至少两层卷积网络所输出特征图上的投影特征,将第一卷积网络的投影特征与第二卷积网络的投影特征融合得到的融合特征作为第一卷积网络的投影特征并根据至少两层卷积网络的投影特征对人脸候选区域进行人脸检测。

Description

一种人脸检测方法、装置、设备以及存储介质
本申请要求于2019年1月2日提交国家知识产权局、申请号为201910002499.2,申请名称为“一种人脸检测方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理领域,特别是涉及一种人脸检测方法、装置、设备以及存储介质。
背景技术
人脸检测是计算机视觉领域的一个重要的研究热点问题,它的主要任务是从图像中检测到图像中存在的人脸。
目前有不少传统的人脸检测方式,可从不同角度提高人脸检测的精度和速度。
发明内容
本申请实施例提供了一种人脸检测方法,由计算设备执行,所述方法包括:
根据人脸检测模型确定待检测图像中的人脸候选区域;所述人脸检测模型包括多层卷积网络;
若所述人脸候选区域的尺寸参数小于第一比例条件,确定所述人脸候选区域对应小尺度人脸;
通过对应所述小尺度人脸的第一检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第一卷积网络和第二卷积网络,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,所述第一卷积网络的相邻层卷积网络为所述第二卷积网络,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率;
将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征;
根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
本申请实施例提供了一种人脸检测方法,由计算设备执行,所述方法包括:
根据人脸检测模型确定待检测图像中的人脸候选区域;
根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,所述目标尺度为多个尺度中的一个,不同尺度的人脸对应不同的检测模型;
根据所述目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
本申请实施例提供了一种人脸检测装置,所述装置包括:
第一确定单元,用于根据人脸检测模型确定待检测图像中的人脸候选区域;所述人脸检测模型包括多层卷积网络;
第二确定单元,用于若所述人脸候选区域的尺寸参数小于第一比例条件,确定所述人脸候选区域对应小尺度人脸;
第一检测单元,用于:
通过对应所述小尺度人脸的第一检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第一卷积网络和第二卷积网络,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,所述第一卷积网络的相邻层卷积网络为所述第二卷积网络,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率;
将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征;
根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
本申请实施例提供了一种人脸检测装置,所述装置包括:
第一确定模块,用于根据人脸检测模型确定待检测图像中的人脸候选区域;
第二确定模块,用于根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,所述目标尺度为多个尺度中的一个,不同尺度的人脸对应不同的检测模型;
检测模块,用于根据所述目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
本申请实施例提供了一种人脸检测设备,所述设备包括处理器以及存储器:
所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
所述处理器用于根据所述程序代码中的指令执行以上所述的人脸检测方法。
本申请实施例提供了一种计算机可读存储介质,所述存储介质中存储有程序代码,所述程序代码可以被处理器执行以实现以上所述的人脸检测方法。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。
图1a为本申请实施例提供的待检测图像的示意图;
图1b为本申请实施例提供的人脸检测方法的实施环境示意图;
图2为本申请实施例提供的一种示例性场景示意图;
图3为本申请实施例提供的一种人脸检测方法的流程示意图;
图4为本申请实施例提供的一种利用第一检测模型进行人脸检测的方法的流程示意图;
图5为本申请实施例提供的一种确定第一卷积网络的投影特征的方法的流程示意图;
图6为本申请实施例提供的一种确定人脸候选区域的方法的流程示意图;
图7为本申请实施例提供的一个应用场景示意图;
图8为本申请实施例提供的一种人脸检测模型的结果示意图;
图9为本申请实施例提供的一种检测模型的结构示意图;
图10a为本申请实施例提供的一种precision-recall曲线图;
图10b为本申请实施例提供的又一种precision-recall曲线图;
图11为本申请实施例提供的人脸检测的方法的检测效果示意图;
图12为本申请实施例提供的一种人脸检测方法的流程示意图;
图13a为本申请实施例提供的一种人脸检测装置的结构示意图;
图13b为本申请实施例提供的一种人脸检测装置的结构示意图;
图14a为本申请实施例提供的一种人脸检测装置的结构示意图;
图14b为本申请实施例提供的一种人脸检测装置的结构示意图;
图15为本申请实施例提供的一种人脸检测设备的结构示意图;
图16为本申请实施例提供的一种人脸检测设备的结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述。
相关技术中,针对图像中小尺度人脸的检测精度仍然不高,例如在图1所示的图像中,传统方式难以检测到图1中看台处的小尺度人脸。
可见,在人脸检测中,对图像中小尺度人脸的检测是目前亟需解决的问题。
本发明的发明人在研究中发现,在传统的人脸检测方式中,可以利用多层卷积网络来提取人脸候选区域的特征,并基于最后一层卷积网络输出的特征进行人脸识别。由于采用多层卷积网络提取人脸候选区域的特征时,一般是采用逐层提取的方式,后一层卷积网络基于前一层卷积网络所输出的特征继续提取,以得到携带更多的语义信息的特征。而后 一层卷积网络在前一层卷积网络所输出的特征的基础上继续提取特征的过程,实际上是对前一层卷积网络所输出的特征进行降采样。因此,后一层卷积网络所输出的特征对应的特征分辨率,低于所述前一层卷积网络所输出的特征对应的特征分辨率。这就导致了最后一层卷积网络所输出的特征对应的特征分辨率,在所述多层卷积网络所输出的特征对应的特征分辨率中,是最低的。而对小尺度人脸进行识别时,对特征分辨率的要求比较高,而最后一层卷积网络输出的特征对应的特征分辨率往往不高。也就是说,最后一层卷积网络输出的特征分辨率,往往不能满足小尺度人脸进行识别时的特征分辨率要求。因此,采用传统的方式往往不能很好地识别出小尺度的人脸。
由于多层卷积网络在提取特征时是采用上述逐层提取的方式,因此,利用多层卷积网络来提取人脸候选区域的特征时,一般低层卷积网络所输出的特征对应的特征分辨率比较高,但携带的语义信息比较少;而高层卷积网络所输出的特征对应的特征分辨率相对比较低,但携带的语义信息比较多。
关于低层卷积网络和高层低层卷积网络,需要说明的是,此处提及的低层和高层是一个相对的概念。例如,第一层卷积网络首先提取人脸候选区域的特征,第二层卷积网络基于第一层卷积网络输出的特征,继续提取人脸候选区域的特征,第一层卷积网络相对于第二层卷积网络来讲是低层卷积网络,第二层卷积网络相对于第一层卷积网络来讲是高层卷积网络。
鉴于此,在本申请实施例中,考虑到相邻层卷积网络所输出的特征的相关度比较高,因此可以利用至少两层相邻层卷积网络所输出的人脸候选区域的特征,来进行小尺度人脸的检测。具体地,可以对所述至少两层相邻层卷积网络输出的人脸候选区域的特征进行特征融合,将融合后得到的融合特征作为所述低层卷积网络的输出特征,再结合所述至少两层相邻层卷积网络的输出特征对所述人脸候选区域进行人脸检测。由于融合得到的融合特征不仅具有低层卷积网络所提取的特征所体现的较高的特征分辨率,而且携带有高层卷积网络所提取的特征所携带的语 义信息,故而有助于检测小尺度人脸。
图1b为本申请实施例提供的人脸检测方法的实施环境示意图。其中,终端设备10与服务器设备20之间通过网络30通信连接,所述网络30可以是有线网络,也可以是无线网络。在终端设备10与服务器设备20上集成有本申请任一实施例提供的人脸检测装置,用于实现本申请任一实施例提供的人脸检测方法。具体地,终端设备10可直接执行本申请任一实施例提供的人脸检测方法;或者,终端设备10可将待检测图像发送给服务器设备20,由服务器设备20执行本申请任一实施例提供的人脸检测方法,并将检测结果返回给终端设备10。
以下结合图2所示的场景,对本申请实施例提供的人脸检测方法进行介绍。
在图2所示的场景中,可以利用人脸检测模型202确定待检测图像201中的人脸候选区域203。
在本申请实施例中,所述人脸检测模型202可以配置于人脸检测设备例如可以用于检测人脸的服务器等计算设备上。
本申请实施例中提及的人脸候选区域,是指待检测图像中可能包含人脸的区域。可以理解的是,一个待检测图像201中可以包括若干个人脸候选区域203。一个人脸候选区域203可以对应一个人脸。
本申请实施例中提及的人脸检测模型202,包括多层卷积网络,本申请实施例不具体限定所述人脸检测模型202所包含的卷积神经网络的层数。图2中以3层为例进行说明,但这并不构成对本申请实施例的限定。所述人脸检测模型202所包含的卷积网络的层数也可以为其它数目,例如,所述人脸检测模型202可以如VGG16网络一样,包括5层卷积网络。
本申请实施例不具体限定所述人脸检测模型202确定待检测图像201中的人脸候选区域的具体实现方式。作为一种示例,所述人脸检测模型202可以提取所述待检测图像201的图像特征,并利用所述图像特征确定所述人脸候选区域203。
一般来讲,人脸候选区域203的尺寸参数,与该人脸候选区域203中可能包含的人脸的尺寸参数差别不大,因此,所述人脸候选区域203的尺寸参数,在一定程度上可以表征人脸候选区域203中包含的人脸的尺寸参数。鉴于此,在本申请实施例中,确定人脸候选区域203之后,可以根据所述人脸候选区域的尺寸参数,确定所述人脸候选区域203是否对应于小尺度人脸。本申请实施例不具体限定所述尺寸参数,所述尺寸参数例如可以为所述人脸候选区域203的面积,所述尺寸参数又例如可以为所述人脸候选区域203所包含的像素数目与所述人脸检测模型202所输出的特征图所包含的像素数目的比值。
本申请实施例中提及的小尺度人脸,指的是尺寸参数小于第一比例条件。与传统人脸检测方法可以检测的大尺度人脸是相对的概念。所述小尺度人脸是传统人脸检测方法可以检测的大尺度人脸之外的其它尺度的人脸的统称。
在本申请实施例中,确定所述人脸候选区域203对应小尺度人脸之后,则利用第一检测模型204对所述人脸候选区域203进行人脸检测。
可以理解的是,所述人脸检测模型202在确定待检测图像201中的人脸候选区域的时候,该人脸检测模型202的各层卷积网络,可以提取所述待检测图像的图像特征,并对应输出相应的特征图。而在对人脸候选区域203进行人脸检测时,也需要结合所述人脸候选区域203的特征。因此,在本申请实施例中,所述第一检测模型204在对所述人脸候选区域203进行人脸识别时,可以结合所述人脸检测模型202识别所述待检测图像201中的人脸候选区域时,所提取的所述人脸候选区域203中的特征,对所述人脸候选区域203进行人脸识别。
具体地,在图2所示的场景中,所述第一检测模型204可以将所述人脸候选区域203投影到所述人脸检测模型202的第一卷积网络所输出的特征图上,得到第一投影特征205,将所述人脸候选区域203投影到所述人脸检测模型202的第二卷积网络所输出的特征图上,得到第二投影特征206。然后利用所述第一投影特征205和所述第二投影特征206对所述人脸候选区域203进行人脸识别。可以理解的是,所述第一投影 特征即为所述第一卷积网络提取的所述人脸候选区域203的特征,所述第二投影特征即为所述第二卷积网络提取的所述人脸候选区域203的特征。
在本申请实施例中,第一卷积网络与第二卷积网络为相邻层卷积网络。所述第一卷积网络所输出特征图的特征分辨率适用于所述人脸候选区域203的尺寸参数。也就是说,利用所述第一投影特征,可以满足对所述人脸候选区域203进行人脸识别的分辨率要求。所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率。换言之,所述第二卷积网络相对于第一卷积网络而言,为高层卷积网络。相应地,所述第二卷积网络所输出特征图携带的语义信息,高于所述第一卷积网络所输出特征图携带的语义信息。利用所述第二投影特征,在一定程度上可以满足对所述人脸候选区域203进行人脸识别的语义信息要求。
第一检测模型204利用所述第一投影特征205和第二投影特征206对所述人脸候选区域203进行人脸识别时,为了能够更加准确的识别出所述人脸,将第一投影特征205和第二投影特征206进行特征融合,得到既具有较高的特征分辨率,又携带较多的语义信息的融合特征207,并利用融合特征207和第二投影特征206对所述人脸候选区域203进行人脸检测。相对于传统技术检测小尺度人脸的方式,提高对所述人脸候选区域203的人脸检测精度,即提高了小尺度人脸的检测精度。
本申请实施例不具体限定利用融合特征207和第二投影特征206对所述人脸候选区域203进行人脸检测的具体实现方式。作为一种示例,可以将所述融合特征207和第二投影特征206作为感兴趣区域池化(英文:ROI pooling)层的输入,从而得到相应的检测结果。
以下通过具体实施例对本申请提供的人脸检测方法进行介绍。
参见图3,该图为本申请实施例提供的一种人脸检测方法的流程示意图。
本申请实施例提供的人脸检测方法,例如可以通过如下步骤 S301-S303实现。
步骤S301:根据人脸检测模型确定待检测图像中的人脸候选区域,所述人脸检测模型包括多层卷积网络。
关于所述人脸检测模型以及确定人脸候选区域的描述,可以参考上文中相关内容的描述,此处不再赘述。
步骤S302:若所述人脸候选区域的尺寸参数小于第一比例条件,确定所述人脸候选区域对应小尺度人脸。
本申请实施例不具体限定所述第一比例条件,所述第一比例条件例如可以为第一比例阈值。如前文所述,所述尺寸参数又如可以为所述人脸候选区域所包含的像素数目与所述人脸检测模型所输出的特征图所包含的像素数目的比值。相应地,所述人脸候选区域的尺寸参数小于第一比例条件,例如可以为所述人脸候选区域的尺寸参数小于第一比例阈值。
举例说明,所述尺寸参数小于第一比例条件,例如可以为所述人脸候选区域所包含的像素数目w p*h p与所述人脸检测模型所输出的特征图所包含的像素数目w oi*h oi的比值,小于1/10,即
Figure PCTCN2019127003-appb-000001
其中,人脸候选区域可以看成一个长方形区域,w p为所述人脸候选区域的宽包括的像素数目,h p为所述人脸候选区域的高包括的像素数目。w oi为所述人脸检测模型所输出的特征图的宽包括的像素数目,h oi为所述人脸检测模型所输出的特征图的高包括的像素数目。
步骤S303:通过对应所述小尺度人脸的第一检测模型对所述人脸候选区域进行人脸检测。
一方面,考虑到传统技术中无法准确地检测出小尺度人脸,是因为用于检测小尺度人脸的特征对应的特征分辨率比较低的原因。因此,在本申请实施例中,在对小尺度人脸进行识别时,要利用特征分辨率适用于所述人脸候选区域的尺寸参数的卷积网络的投影特征。
另一方面,为了准确地检测出人脸,不仅要求用于进行人脸识别的特征对应的特征分辨率满足要求,还要求用于进行人脸识别的特征所携带的语义信息满足要求。而对于小尺度人脸而言,特征分辨率适用于所述人脸候选区域的卷积网络的投影特征,往往携带的语义信息不太多。 因此,若仅利用所述特征分辨率适用于所述人脸候选区域的卷积网络的投影特征,可能并不能准确地识别出所述小尺度人脸。
再一方面,在所述人脸检测模型包括的多层卷积神经网络中,越高层的卷积神经网络的投影特征所携带的语义信息越高。但是考虑到相邻层卷积网络所输出的特征之间的相关性比较高,因此,利用相邻层卷积网络所输出的特征来对所述人脸候选区域进行人脸识别,能够更加准确地识别出人脸。
故而在本申请实施例中,可以结合至少两层卷积网络的投影特征,来对所述人脸候选区域进行人脸检测。在所述至少两层卷积网络中,包括能够满足对所述人脸候选区域进行人脸检测的分辨率要求的第一卷积网络的投影特征,以及对能够满足对所述人脸候选区域进行人脸检测的语义信息要求的第二卷积网络的投影特征,来对所述人脸候选区域进行人脸检测。其中,第一卷积网络和第二卷积网络为相邻层卷积网络。
本申请实施例对所述至少两层卷积网络中除所述第一卷积网络和所述第二卷积网络之外的其它卷积网络不做限定,作为一种示例,所述其它卷积网络例如可以为与第二卷积网络相邻的高层卷积网络。
步骤S303在具体实现时,可以通过图4所述的人脸检测方法实现。具体地,可以通过如下步骤S401-S403实现。
步骤S401:通过第一检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征。所述至少两层卷积网络包括第一卷积网络和第二卷积网络,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,所述第一卷积网络的相邻层卷积网络为所述第二卷积网络,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络。
在本申请实施例中,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,则表示所述第一卷积网络所述输出特征图的特征分辨率,满足对所述人脸候选区域进行人脸识别的分辨率要求。
可以理解的是,所述尺寸参数不同,其适用的特征分辨率也不同,相应地,所述尺寸参数对应的第一卷积网络在所述多层卷积网络中的层 数也不同。因此在本申请实施例中,可以根据所述人脸候选区域的尺寸参数,确定所述第一卷积网络具体为所述多层卷积网络中的第几层卷积网络。例如,可以根据尺寸参数范围与卷积网络层数之间的对应关系,确定所述尺寸范围对应的第一卷积网络。举例说明,所述人脸检测模型包括5层卷积网络,其中,该5层卷积网络由低层到高层依次为第1层卷积网络至第5层卷积网络。当所述尺寸参数为一个比较小的尺寸参数例如第一尺寸参数时,考虑到所述第一尺寸参数对分辨率的要求比较高,因此可以将所述5层卷积网络中的低层卷积网络例如第3层卷积网络确定为所述第一卷积网络;当所述尺寸参数为一个比所述第一尺寸参数大的尺寸参数例如为第二尺寸参数时,则所述第二尺寸参数对分辨率的要求低于第一尺寸参数对分辨率的要求,因此可以将所述5层卷积网络中比第3层卷积网络高层的卷积网络例如第4层卷积网络确定为所述第一卷积网络。
如前文所述,高层卷积网络输出的特征携带的语义信息,比低层卷积网络输出的特征携带的语义信息多。而高层卷积网络输出的特征对应的特征分辨率,比低层卷积网络输出的特征对应的特征分辨率低。因此,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率,可以表征所述第二卷积网络为所述第一卷积网络的高层卷积网络,所述第二卷积网络所输出特征携带的语义信息,比第一卷积网络所输出特征携带的语义信息多。相应地,可以表征所述第二卷积网络所输出特征携带的语义信息,能够满足对人脸候选区域进行人脸识别的语义信息要求。
可以理解的是,所述人脸检测模型中的卷积神经网络所输出的特征图,不仅包括所述人脸候选区域所对应的特征,还包括所述待检测图像中其它部分对应的特征,而在对所述人脸候选区域进行人脸检测时,要结合所述人脸候选区域对应的特征进行人脸检测。鉴于此,在本申请实施例中,可以将所述人脸候选区域投影到所述卷积网络所输出的特征图上,以获取所述人脸候选区域在所述人脸检测模型中的卷积网络所输出特征图上的投影特征,该投影特征即为所述人脸候选区域对应的特征。
步骤S402:将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征。
本申请实施例中提及的第一卷积网络的投影特征,可以理解为所述第一卷积网络所输出的特征图中,所述人脸候选区域投影区域对应的特征;所述第二卷积网络的投影特征,可以理解为所述第二卷积网络所输出的特征图中,所述人脸候选区域投影区域对应的特征。可以理解的是,所述第一卷积网络的投影特征对应的特征分辨率比较高,所述第二卷积网络的投影特征携带的语义信息比较多。因此,若对第一卷积网络的投影特征与所述第二卷积网络的投影特征进行融合处理,则可以得到既具有较高的特征分辨率,又携带较多的语义信息的融合特征。
可以理解的是,由于所述第一卷积网络和所述第二卷积网络为相邻层卷积网络,因此,第一卷积网络的投影特征和第二卷积网络的投影特征之间的特征相关性比较高,从而使得对所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合处理的处理效果更好,从而更有利于准确的检测出所述人脸候选区域中的小尺度人脸。
步骤S403:根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
可以理解的是,由于在步骤S403中,将所述融合特征作为所述第一卷积网络的投影特征,因此,与将所述人脸候选区域投影到第一卷积网络所输出的特征图上得到的投影特征相比,除了具有比较高的特征分辨率之外,还携带有比较高的语义信息。因此,将所述融合特征作为所述第一卷积网络的投影特征之后,再利用所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测,可以准确的检测出所述人脸候选区域的人脸。
在本申请实施例的一种实现方式中,前述小尺度人脸可以划分成若干种不同尺度的小尺度人脸。具体地,可以以所述尺寸参数所处的参数范围区间作为对所述小尺度人脸进行划分的依据。对于一个参数范围区间,位于该参数范围区间内的尺寸参数,其所适用的特征分辨率均为所 述多层卷积网络中第N层卷积网络输出特征图的特征分辨率。例如,以前述包括5层卷积网络的人脸检测模型为例,所述尺寸参数在第一参数范围区间内,第一参数范围区间中的参数最大值为一个比较小的值,则所述人脸候选区域对应小尺度人脸中的较小尺度人脸,其所适用的特征分辨率为第3层卷积网络输出特征图的特征分辨率;若所述尺度范围在第二参数范围区间内,第二参数范围区间与第一参数范围区间不重叠,且第二参数范围区间中的参数最小值,大于第一参数范围区间中的参数最大值。则所述人脸候选区域对应小尺度人脸中的较大尺度人脸例如中尺度人脸,其所适用的特征分辨率为第4层卷积网络输出特征图的特征分辨率。
相应地,所述若干种不同尺度的小尺度人脸分别具有各自对应的第一检测模型,以实现对所述若干种尺度的小尺度人脸的人脸检测。例如,前述小尺度人脸中的较小尺度人脸对应一种第一检测模型,该第一检测模型的网络结构可以参见图9(a)所示;前述小尺度人脸中较大尺度人脸例如中尺度人脸对应一种第一检测模型,该第一检测模型的网络结构可以参见图9(b)所示。关于所述小尺度人脸中的较小尺度人脸对应的第一检测模型的网络结构的具体描述可以参考下文对于图9的描述部分,此处不再赘述。
举例说明,在本申请实施例中,所述小尺度人脸可以划分成两种尺度的人脸,其中一种尺度的人脸对应第一参数范围区间,例如所述尺寸参数
Figure PCTCN2019127003-appb-000002
位于第一参数范围区间[0,1/100];另外一种尺度的人脸可以对应第二参数范围区间,例如所述尺寸参数
Figure PCTCN2019127003-appb-000003
位于第二参数范围区间(1/100,1/10)。关于所述尺寸参数
Figure PCTCN2019127003-appb-000004
的描述,可以参考上文步骤S302中相关内容的描述,此处不再赘述。
由此可见,在本申请实施例中,是基于人脸候选区域的尺寸参数,确定对应的检测模型对所述人脸候选区域进行人脸检测。即根据人脸候选区域的尺寸参数,确定其对应的第一检测模型对所述人脸候选区域进 行人脸检测。也就是说,在本申请实施例中,对于待检测图像中包括的若干个人脸候选区域,可以根据各个人脸候选区域的尺寸参数,自适应的选择与该尺寸参数对应的检测模型有针对性的对该人脸候选区域进行人脸检测,提高了针对不同尺度人脸的检测精度,准确有效的检测出各种尺度的人脸。而不是如传统技术中那样,利用同一个检测模型检测所有的人脸候选区域,导致小尺度人脸不能被准确的识别出来。
在本申请实施例中,不仅可以实现小尺度人脸的人脸检测,也可以实现大尺度人脸的人脸检测。具体地,在本申请实施例的一个示例中,当所述尺寸参数大于第二比例条件时,还可以确定所述人脸候选区域对应大尺度人脸。
本申请实施例不具体限定所述第二比例条件,所述第二比例条件可以根据实际情况确定。
举例说明,所述尺寸参数大于第二比例条件,例如可以为所述人脸候选区域所包含的像素数目w p*h p与所述人脸检测模型所输出的特征图所包含的像素数目w oi*h oi的比值,大于1/10,即
Figure PCTCN2019127003-appb-000005
关于所述尺寸参数
Figure PCTCN2019127003-appb-000006
的描述,可以参考上文步骤S302中相关内容的描述,此处不再赘述。
在本申请实施例中,当确定所述人脸候选区域对应大尺度人脸之后,可以通过对应所述大尺度人脸的第二检测模型对所述人脸候选区域进行人脸检测。第二检测模型对所述人脸候选区域进行人脸检测在具体实现时,可以通过如下步骤A-B实现。
步骤A:通过所述第二检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第三卷积网络,所述第三卷积网络所输出特征图的特征分辨率适用于所述尺寸参数。
步骤B:根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
需要说明的是,关于所述第三卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,由于与前述“第一卷积网络所输出特征图的特征分 辨率适用于所述尺寸参数”类似,故相关内容可以参考上文关于“第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数”的描述部分,此处不再赘述。
关于步骤A和步骤B所述对应于大尺度人脸的人脸检测的方法,与前述步骤S301-S303所述对应于对小尺度人脸的人脸检测的方法相比,既有相同点,也有不同点。
其相同点在于,两者均是采用所述人脸检测模型中至少两层卷积网络所输出的特征图上的投影特征来进行人脸检测。关于两者相同的部分(步骤B),可以参考上文步骤S301-S303的描述,此处不再赘述。
两者的不同点在于,由于在对小尺度人脸对应的人脸候选区域进行人脸检测时,由于其对应的尺寸参数比较小,故所述第一卷积网络的投影特征所携带的语义信息可能比较少。因此,将第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征,以弥补所述第一卷积网络的投影特征携带语义信息不多的缺陷。而在对大尺度人脸对应的人脸候选区域进行人脸检测时,由于其对应的尺寸参数比较大。因此,所述第三卷积网络很有可能是所述人脸检测模型包含的多层卷积网络中的高层卷积网络。也就是说,所述第三卷积网络,不仅可以满足对大尺度人脸进行识别的特征分辨率要求,其本身也携带有比较多的语义信息。因此,在对所述大尺度人脸对应的人脸候选区域进行人脸检测时,无需对所述第三卷积网络的投影特征进行处理,而是直接根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
可以理解的是,在实际应用中,待检测图像中可能包含多个人脸候选区域,所述多个人脸候选区域所对应的人脸尺度也可能不同。在传统的人脸检测方式中,对于多个人脸候选区域,只包含一个人脸检测模型,利用这一个人脸检测模型对各个尺度人脸对应的人脸候选区域进行人脸检测。而在本申请实施例中,由于包括第一检测模型和第二检测模型,因此,确定人脸候选区域对应的人脸尺度之后,可以选择对应的检测模型进行人脸检测,而且,第一检测模型和第二检测模型可以并行检测, 从而提升了检测待检测图像中的人脸的效率。例如,所述待检测图像中包括两个人脸候选区域,其中第一人脸候选区域对应小尺度人脸,第二人脸候选区域对应大尺度人脸,则可以利用第一检测模型对第一人脸候选区域进行人脸进行检测,利用第二检测模型对第二人脸候选区域进行检测,实现了对不同尺度的人脸的识别。而且,二者可以同时执行,从而提升了对这两个人脸候选区域进行人脸检测的效率。
考虑到在对所述人脸候选区域进行人脸检测的过程中,所述至少两层卷积网络所输出特征图上的投影特征的重要程度不同。若在进行人脸检测时增加所述重要程度高的投影特征的比重,更有利于准确的检测出所述人脸候选区域的人脸。鉴于此,在本申请实施例的一种实现方式中,所述至少两层卷积网络的投影特征设置有权重系数,用于体现各卷积网络的投影特征在人脸检测中的重要程度。
可以理解的是,对于前述第一检测模型检测小尺度人脸的各个投影特征中,所述第一卷积网络的投影特征为融合特征,相较于其它卷积网络的投影特征而言,其既具有较适于小尺度人脸尺寸的特征分辨率,又携带有较多的语义信息。因此,第一卷积网络的投影特征在对所述小尺度人脸进行检测时的重要程度,比其它卷积网络的投影特征的重要程度高。
相应地,对于前述第二检测模型检测大尺度人脸的各个投影特征中,所述第三卷积网络的投影特征既具有较适于大尺度人脸尺寸的特征分辨率,又携带有较多的语义信息。因此,第三卷积网络的投影特征在对所述大尺度人脸进行检测时的重要程度,比其它卷积网络的投影特征的重要程度高。
如前文所述,所述第一卷积网络为特征分辨率适用于小尺度人脸对应的人脸候选区域的尺寸参数,所述第三卷积网络为特征分辨率适用于大尺度人脸对应的人脸候选区域的尺寸参数。因此,在本申请实施例中,在设置权重系数时,特征分辨率适用于所述尺寸参数的卷积网络的权重系数大于其它卷积网络的权重系数,以表征所述特征分辨率适用于所述 尺寸参数的卷积网络的投影特征的重要性最高。从而使得在对所述人脸候选区域进行人脸检测的特征中,重要的特征所占的比重更大,更有助于准确的识别出所述人脸候选区域中的人脸。
具体地,对于步骤S301-S303所述的对小尺度人脸对应的人脸候选区域进行人脸检测时,所述第一卷积网络的投影特征的权重系数高于其它卷积网络例如第二卷积网络的投影特征的权重系数。对于步骤A-B所述的对大尺度人脸对应的人脸候选区域进行人脸检测时,所述第三卷积网络的投影特征的权重系数高于其它卷积网络的投影特征的权重系数。
本申请实施例不具体限定所述至少两层卷积网络的投影特征分别对应的权重系数的具体取值,所述至少两层卷积网络的投影特征分别对应的权重系数的具体取值可以根据实际情况确定。
为所述至少两层卷积网络的投影特征设置有权重系数之后,步骤S303和步骤B所述的“根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测”在具体实现时,即可根据所述至少两层卷积网络所输出特征图上的投影特征,以及分别对应的权重系数对所述人脸候选区域进行人脸检测。
本申请实施例不具体限定根据所述至少两层卷积网络所输出特征图上的投影特征,以及分别对应的权重系数对所述人脸候选区域进行人脸检测的实现方式,作为一种示例,可以将所述至少两层卷积网络所输出特征图上的投影特征,分别乘以其分别对应的权重系数,然后将乘以权重系数之后的投影特征作为ROI pooling层的输入,从而得到相应的检测结果。
以上对本申请实施例提供的人脸检测方法进行了介绍,以下结合附图介绍以上步骤S402“将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征”的一种实现方式。
参见图5,该图为本申请实施例提供的一种确定第一卷积网络的投影特征的方法的流程示意图。
所述方法例如可以通过如下步骤S501-S504实现。
步骤S501:通过降低所述第一卷积网络的投影特征中的通道数量得到第一特征。
需要说明的是,考虑到融合特征所携带了所述第一投影特征中缺乏的语义信息,因此,将所述融合特征作为所述第一卷积网络的投影特征来对所述人脸候选区域进行人脸识别时,其计算复杂度相应的提高了。鉴于此,在本申请实施例中,降低所述第一卷积网络的投影特征中的通道数量,得到第一特征,而后第一特征与所述第二卷积网络的投影特征进行特征融合,使得得到的融合特征与“直接将第一卷积网络的投影特征和第二卷积网络的投影特征进行融合”得到的融合特征相比,计算复杂度大大降低了。
步骤S502:通过将所述第二卷积网络的投影特征的特征分辨率提高到与所述第一卷积网络的投影特征的特征分辨率一致,得到第二特征。
关于步骤S502,需要说明的是,由于所述第二卷积网络的投影特征的特征分辨率,比所述第一卷积网络的投影特征的特征分辨率低。而所述第一特征的特征分辨率与所述第一卷积网络的投影特征的特征分辨率相同。因此,所述第二卷积网络的投影特征的特征分辨率,比所述第一特征的特征分辨率低。而特征融合要是基于像素进行的,因此,在本申请实施例中,在对所述第一特征和第二卷积网络的投影特征进行融合之前,要对所述第二卷积网络的投影特征进行处理,使得处理后得到的特征对应的特征分辨率与所述第一特征的特征分辨率相同。
在本申请实施例中,可以将所述第二卷积网络的投影特征的特征分辨率提高到与所述第一卷积网络的投影特征的特征分辨率一致,得到第二特征。在本申请实施例中,考虑到第一卷积网络是所述第二卷积网络的低层网络,第二卷积网络的投影特征,是第一卷积网络的投影特征作为所述第二卷积网络的输入,由所述第二卷积网络进行降采样处理得到的。因此,在本申请实施例中,可以对所述第二卷积网络的投影特征进行上采样处理,从而得到与所述第一卷积网络的投影特征的特征分辨率一致的第二特征。
步骤S503:将所述第一特征和所述第二特征进行像素相加操作得到所述融合特征。
可以理解的是,所述第一特征的特征分辨率与所述第二特征的特征分辨率相同,因此,可以对所述第一特征和所述第二特征进行特征融合。
本申请实施例中提及的像素相加操作,指的是,将所述第一特征中各个像素的特征,与第二特征中该像素对应的像素的特征进行相加。
步骤S504:将所述融合特征作为所述第一卷积网络的投影特征。
可以理解的是,由于所述融合特征是所述第一特征和所述第二特征进行像素相加操作得到的,因此,对于所述融合特征中的每一个像素,其既携带有所述第一特征的特征信息,也携带有所述第二特征的特征信息。因此,所述融合特征不仅具有较高的特征分辨率,而且携带较多的语义信息。
以下结合附图介绍步骤S301中“根据人脸检测模型确定待检测图像中的人脸候选区域”的一种实现方式。
考虑到传统技术中,确定人脸候选区域时,可以在待检测图像中均匀的生成锚框的方式,确定人脸候选区域。其中,生成锚框是指,以待检测图像中的某个像素点作为锚框的中心点,生成包含若干像素点的像素框。但是,利用传统方式所确定的人脸候选区域的数量比较多,因此,使得人脸检测模型在对待检测图像进行人脸检测时要检测的人脸候选区域的数量比较多,导致对所述待检测图像进行人脸检测的效率比较低。
鉴于此,在本申请实施例中,对确定人脸候选区域的方式做出来改进,使得确定出的人脸候选区域的数量变少了,从而提升了对待检测图像进行人脸检测的效率。具体地,可参见图6,该图为本申请实施例提供的一种确定人脸候选区域的方法的流程示意图。
所述方法例如可以通过如下步骤S601-S604实现。
步骤S601:获取所述待检测图像中的人脸感兴趣区域。
此处提及的人脸感兴趣区域,与人脸候选区域是比较相似的概念,均是指可能包含人脸的区域。在本申请实施例中,所述人脸感兴趣区域 可以用于确定人脸候选区域。本申请实施例不具体限定获取人脸感兴趣区域的实现方式,作为一种示例,考虑到基于级联Boosting的人脸检测器,能够快速的确定出待处理图像中的人脸感兴趣区域。因此,可以利用基于级联Boosting的人脸检测器获取待检测图像中的人脸感兴趣区域。
步骤S602:将所述人脸感兴趣区域投影到根据所述人脸检测模型输出的特征图上,得到第一特征图。
步骤S603:在所述第一特征图上生成锚框,得到第二特征图;在生成锚框的过程中,若目标锚框的中心点未与所述人脸感兴趣区域重叠,增大所述目标锚框的划窗步长。
在本申请实施例中,所述人脸检测模型输出的特征图,可以为所述人脸检测模型所包括的多层卷积网络中,最后一层卷积网络所输出的特征图。所述第一特征图,可以理解能够体现为所述人脸检测模型输出的特征图中,何处对应所述人脸感兴趣区域的图像特征图。
在本申请实施例中,考虑到所述人脸感兴趣区域是人脸候选区域的可能性比较大,因此,在结合所述人脸感兴趣区域和所述人脸检测模型提取的图像特征来确定所述人脸候选区域时。可以重点分析所述人脸感兴趣区域对应的图像特征。
在本申请实施例中,采用在在所述第一特征图上生成锚框的方式确定所述人脸候选区域。具体地,由于所述人脸感兴趣区域为人脸候选区域的可能性比较大,因此,当锚框的中心点与所述人脸感兴趣区域重叠时,可以均匀地生成锚框。而人脸感兴趣区域之外的区域为人脸候选区域的可能性比较小,鉴于此,在本申请实施例中,为了减少确定的所述人脸候选区域的数量,当目标锚框的中心点未与所述人脸感兴趣区域重叠时,可以增大所述目标锚框的划窗步长。也就是说,对于整个待检测图像而言,所生成的锚框是非均匀分布的。在人脸感兴趣区域之外,锚框的分布密度相对于人脸感兴趣区域内的分布密度低,从而减少了锚框数量,相应的减少了确定出的人脸候选区域的数量。举例说明,当锚框的中心点与所述人脸感兴趣区域重叠时,可以以步长1均匀地生成锚框,在均匀生成锚框的过程中,若目标锚框的中心点位于所述人脸感兴趣区 域之外,则可以将所述目标锚框的划窗步长设置为2。
在本申请实施例中,考虑到人脸目标的形状特性,可以将每个锚框的长宽比设定为1:1和1:2,而锚框尺度设定为包含128 2、256 2和512 2个像素的三种像素框,因此对于人脸感兴趣区域内每一个位置都对应6个不同的锚框,从而有利于准确的确定出人脸候选区域。
步骤S604:根据多个人脸检测任务的损失函数计算所述第二特征图中的人脸候选区域,并将确定出的人脸候选区域作为所述待检测图像的人脸候选区域。
在本申请实施例中,可以将所述第二特征图作为所述损失函数的输入,从而确定出人脸候选区域,并将所述确定出的人脸候选区域作为所述待检测图像的人脸候选区域。
在本申请实施例中,考虑到在多任务损失函数中,加入高相关度的任务时会提高主任务的精度。因此,所述损失函数,可以是基于所述多个人脸检测任务进行联合训练得到的。其中,所述多个任务之间具有高相关度。
在本申请实施例中,所述多个人脸检测任务包括针对人脸目标的分类任务、针对人脸目标框的位置回归任务以及针对人脸关键点的位置回归任务。其中所述针对人脸目标的分类任务,是指检测出人脸和非人脸;所述针对人脸目标框的位置回归任务,是指在检测出人脸的前提下,要检测出人脸所处的位置;所述针对人脸关键点的位置回归任务,是指检测出人脸的前提下,检测出人脸上的关键位置,所述关键位置例如可以为鼻子、眼睛、嘴巴以及眉毛中的任意一个或组合。
可以理解的是,在对人脸进行检测时,首先要检测出人脸和非人脸;其次,在检测出人脸的前提下,要检测出人脸所处的位置,即确定出人脸目标框的位置。因此,在本申请实施例中,在训练所述损失函数时,所述针对人脸目标的分类任务和针对人脸目标框的位置回归任务可以认为是必须的。而对于所述针对人脸关键点的位置回归任务,其虽然与人脸检测具有比较高的相关度,但是其并不是必须的。
因此,在本申请实施例中,可以将所述针对人脸目标的分类任务和 针对人脸目标框的位置回归任务作为主任务,将所述针对人脸关键点的位置回归任务作为辅助任务联合训练各自对应的损失函数。
在本申请实施例中,基于前述主任务和辅助任务训练得到的损失函数,可以用以下公式(1)来表示:
Figure PCTCN2019127003-appb-000007
所述公式(1)由四部分相加组成,其中第一部分
Figure PCTCN2019127003-appb-000008
即为前述针对人脸目标的分类任务的损失函数,第二部分
Figure PCTCN2019127003-appb-000009
即为前述针对人脸目标框的位置回归任务的损失函数,第三部分
Figure PCTCN2019127003-appb-000010
为前述针对人脸关键点的位置回归任务的损失函数,第四部分
Figure PCTCN2019127003-appb-000011
为权重。
关于前两部分由于与传统损失函数的表示方式类似,故在此不再详细介绍。只是需要强调的是第一部分和第二部分中
Figure PCTCN2019127003-appb-000012
以及w r中的r表示主任务。
关于第三部分
Figure PCTCN2019127003-appb-000013
需要说明的是,
Figure PCTCN2019127003-appb-000014
以及w a中的上标a表示辅助任务,即人脸关键点的位置回归任务。下标i表示输入数据标号,N表示总的数据,λ a表示第a个辅助任务的重要系数,x表示输入样本,y表示输入样本对应的实际输出结果,f(x i;w a)表示输入样本对应的模型预测的结果。
需要说明的是,虽然在人脸检测主任务中加入了人脸关键点检测的辅助任务,可以有效地提高人脸检测主任务的检测精度。但在损失函数中辅助任务的加入会导致整个模型难以收敛,并出现模型参数陷入局部极小值的情况,从而无法得到最优解。因此本申请实施例提供的训练过程可以在保证了模型很好收敛的同时通过人脸关键点检测来提高人脸检测的准确率。
在本申请实施例中,所述人脸检测模型是通过训练得到的。本发明在训练所述人脸检测模型的过程中,使用60k次随机提低下降方法SGD对模型进行微调,起始学习率设定为0.001,在经过20k次迭代后将学习率下降为0.0001。另外,将动量(Momentum)和权重衰减(Weight decay)分别设置为0.9和0.0005,Mini-batch的大小设置为128。
为了提高人脸检测模型对于各个尺度人脸以及小尺度人脸的检测效果,在训练人脸检测模型的过程中,采用了难分样本挖掘和数据增广操作,以实现对训练样本的扩增,从而加快训练得到所述人脸检测模型的速度。
在本申请实施例中,所述难分样本挖掘是指,通过最高置信得分将所有的负样本进行排序,只选得分最高的负样本,通过不断迭代该过程实现正负样本的比例为1:3,这样的难分样本挖掘方法可以加快网络优化的速度,并且使网络训练过程更加稳定。
在本申请实施例中,所述数据增广处理可以包括以下三种情况。
(1)对原始图像进行翻折操作。
(2)随机采样一个样本碎片,对于每个样本随便的尺度设定在原始图像的[0.5,1]内,并且设定矩形框的长宽比例关系在原始图像的[0.5,2]内,从而生成新的训练样本。
(3)随机对原始图像进行剪裁操作。
以上对本申请实施例提供的人脸检测方法进行了介绍,以下结合具体场景,对以上实施例介绍的人脸检测方法进行介绍。
参见图7,该图为本申请实施例提供的一个应用场景示意图。在图7所示的场景中,包括两个第一检测模型,分别为第一检测模型(a)和第一检测模型(b)。第一检测模型(a)适用的尺寸范围区间内的尺寸参数,小于第一检测模型(b)适用的尺寸范围区间内的尺寸参数。
图7所示的方法,可以通过如下步骤S701-S709实现。
步骤S701:基于级联Boosting的人脸检测器获取待检测图像中的人脸感兴趣区域。
步骤S702:将所述人脸感兴趣区域投影到根据所述人脸检测模型输出的特征图上,得到第一特征图。
步骤S703:在所述第一特征图上生成锚框得到第二特征图。
步骤S704:判断目标锚框的中心点是否与所述人脸感兴趣区域重叠,如果是,执行步骤S705a,如果否,执行步骤S705b。
步骤S705a:将划窗步长设置为1。
步骤S705b:将划窗步长设置为2。
步骤S706:根据多个人脸检测任务的损失函数计算所述第二特征图中的人脸候选区域。
步骤S707:判断人脸候选区域的尺寸参数是否小于第一比例条件。
如果人脸候选区域的尺寸参数小于第一比例条件,执行步骤S708,如果人脸候选区域的尺寸参数大于第一比例条件,则可以确定所述人脸候选区域对应大尺度人脸,则利用第二检测模型对所述人脸候选区域进行人脸检测。
步骤S708:判断人脸候选区域的尺寸参数位于第一参数范围区间。
如果人脸候选区域的尺寸参数位于第一参数范围区间,则确定所述人脸候选区域对应小尺度人脸中较小尺度人脸,则利用第一检测模型(a)对所述人脸候选区域进行人脸检测。
如果人脸候选区域的尺寸参数不位于第一参数范围区间,则确定所述人脸候选区域对应小尺度人脸中较大尺度人脸,则利用第一检测模型(b)对所述人脸候选区域进行人脸检测。
步骤S709:合并检测结果。
对所述第一检测模型(a)、第一检测模型(b)以及第二检测模型的检测结果进行合并,实现对所述待检测图像中各个尺度的人脸的检测。
而且,所述第一检测模型(a)、第一检测模型(b)以及第二检测模型可以并行处理,也就是说,同时最多可以对三个人脸候选模型进行人脸检测,提升了对所述待检测图像进行人脸识别的效率。
关于图7所示的方法,以下结合图8所示的人脸检测模型来进行介绍。
参见图8,该图为本申请实施例提供的一种人脸检测模型的结构示意图。
图8所示的人脸检测模型采用类似VGG16的网络结构,其包括5层卷积网络,分别为conv1、conv2、conv3、conv4和conv5。其中,conv1 包括两个卷积层,分别为801和802;conv2包括两个卷积层,分别为803和804;conv3、conv4和conv5分别包括三个卷积层,如图8中所示805-813。
如图8所示,可以利用级联Boosting检测器814获取人脸感兴趣区域,然后,将所述人脸感兴趣区域投影到卷积层814输出的特征图上,得到第一特征图(图8中未示出);在所述第一特征图上生成锚框,得到第二特征图(图8中未示出),将所述第二特征图作为损失函数层815的输入,以得到人脸候选区域。其中损失函数层815的损失函数包括针对人脸目标的分类任务的损失函数softmax,针对人脸目标框的位置回归任务的损失函数bbox regressor、以及针对人脸关键点的位置回归任务的损失函数landmark regressor。
在图8所示的场景中,是利用三层卷积网络的投影特征对人脸候选区域进行人脸检测,具体地,利用conv3、conv4和conv5的投影特征对人脸候选区域进行人脸检测。将利用conv3的投影特征816、conv4的投影特征817和conv5的投影特征818,输入ROI Pooling层,ROI Pooling层对投影特征816、817和818进行处理,得到特征819,然后对特征819进行归一化处理,得到特征820,最后将特征820输入两层全连接层(简称FC层),得到人脸检测结果。其中检测结果包括:是否是人脸(对应图8中人脸目标的分类任务的分类结果821)以及人脸框的位置(针对图8中人脸目标框的位置回归任务的结果822)。
为方便描述,将步骤S709中所述的“小尺度人脸中的较小尺度人脸”称为小尺度人脸,将步骤S709中所述的“小尺度人脸中的较大尺度人脸”称为中尺度人脸。以下结合图9,介绍利用卷积网络conv3、conv4和conv5的投影特征,对上述三种尺度人脸的检测方法。
参见图9,该图为本申请实施例提供的一种检测模型的结构示意图的示意图。
在图9中,(a)示出了利用conv3、conv4和conv5的投影特征识别小尺度人脸的示意图。
具体地,所述conv3为第一卷积网络,conv4为第二卷积网络。对conv3_3的投影特征通过1×1卷积层((a)中所示1×1conv)进行降通道处理。由于conv4_3和conv3_3之间,包括两层卷积层,分别为conv4_1和conv4_2(即图8所示808和809),因此,从所述conv3_3的投影特征到所述conv4_3的投影特征,经历了两次降采样,为了使得conv4_3的投影特征的特征分辨率与所述conv3_3的投影特征的特征分辨率相同,故而在此对所述并对conv4_3的投影特征进行两次上采样处理((a)中所示×2upsampling)。而后将所述将通道处理之后得到的特征和上采样处理得到的特征进行像素相加,得到融合特征,将所述融合特征作为第一卷积网络conv3的投影特征。然后基于所述conv3、conv4和conv5的投影特征,以及分别对应的权重系数α small、β small和γ small,对所述小尺度人脸进行检测。其中,conv3_3表示conv3的第三层卷积层,即图8所示的卷积层807,conv4_3表示conv4的第三层卷积层,即图8所示的卷积层810;conv5_3表示conv5的第三层卷积层,即图8所示的卷积层813。
关于图9所示(b),其原理与(a)类似,故在此不再赘述。两者的不同之处体现为两点。
第一、(b)用于检测中尺度人脸,因此,与其尺寸参数匹配的卷积网络为conv4,故而在(b)中,所述conv4为第一卷积网络,conv5为第二卷积网络。
第二、(a)中权重系数α small、β small和γ small中,conv3对应的权重系数α small最大,因为conv3对应的特征分辨率适用于小尺度人脸的尺寸参数;而由于conv4对应的特征分辨率适用于中尺度人脸的尺寸参数,因此(b)中所示的权重系数α small、β small和γ small,conv4对应的权重系数β small最大。
关于图9所示(c),在对大尺度人脸进行人脸识别时,无需进行融合处理,直接利用所述conv3、conv4和conv5的投影特征,以及分别对应的权重系数α small、β small和γ small,对所述大尺度人脸进行检测。可以理解的是,由于conv3对应的特征分辨率适用于小尺度人脸的尺寸参 数,conv4对应的特征分辨率适用于中尺度人脸的尺寸参数,因此,无论是conv3和conv4,都可以满足对所述大尺度人脸进行人脸检测的特征分辨率的要求。甚至,所述conv5也适用于所述大尺度人脸的尺寸参数。关于(c)中权重系数α small、β small和γ small的具体取值,可以根据实际情况确定。例如,若所述conv5可以适用于所述大尺度人脸的尺寸参数,则可以将conv5对应的权重系数γ small设置为最大。若所述conv5可以不适用于所述大尺度人脸的尺寸参数,则由于相较于cov3而言,conv4的投影特征携带的语义信息更多,因此可以将conv4对应的权重系数β small设置为最大,以使得检测出的人脸更加准确。
以下结合具体检测数据说明本申请实施例提供的人脸检测方法的检测效果,参见图10a所示,图10a示出了利用本申请实施例提供的人脸检测方法和传统的人脸检测方法,对人脸检测模型训练过程中使用的验证集进行人脸检测得到的精度-召回率(precision-recall)曲线图。
所述验证集可以包括多张图像,所述多张图像例如可以为包括不同尺度的人脸的图像。所述验证集中的多张图像可以用于检测人脸检测模型在训练的迭代过程中的人脸检测效果。
其中:
图10a中的曲线①为利用ACF-WIDER人脸检测方法对验证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线②为利用Two-stage-CNN人脸检测方法对验证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线③为利用Faceness-WIDER人脸检测方法对验证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线④为利用Multiscale Cascade CNN人脸检测方法对验证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线⑤为利用LDCF+人脸检测方法对验证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线⑥为利用Multitask Cascade CNN人脸检测方法对验 证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线⑦为利用CMS-RCNN人脸检测方法对验证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线⑧为利用HR人脸检测方法对验证集进行人脸检测得到的precision-recall曲线;
图10a中的曲线⑨为利用本申请实施例提供的人脸检测方法对验证集进行人脸检测得到的precision-recall曲线。
从图10a可以看出,在召回率相同时,本申请实施例提供的人脸检测方法的人脸检测精度更高;在检测精度相同时,本申请实施例提供的人脸检测方法的召回率更高。也就是说,本申请实施例提供的人脸检测方法,无论是检测精度、还是召回率都比传统方式的人脸检测方法的效果好。换言之,本申请实施例的人脸检测模型在迭代过程中的检测精度和召回率都比较高。
参见图10b所示,图10b示出了利用本申请实施例提供的人脸检测方法和传统的人脸检测方法,对人脸检测模型训练过程中使用的测试集进行人脸检测得到的精度-召回率(precision-recall)曲线图。
所述测试集可以包括多张图像,所述多张图像例如可以为包括不同尺度的人脸的图像。所述多张图像可以用于检测训练得到的人脸检测模型的人脸检测效果。
其中:
图10b中的曲线①为利用ACF-WIDER人脸检测方法对测试集进行人脸检测得到的precision-recall曲线;
图10b中的曲线②为利用Two-stage-CNN人脸检测方法对测试集进行人脸检测得到的precision-recall曲线;
图10b中的曲线③为利用Faceness-WIDER人脸检测方法对测试集进行人脸检测得到的precision-recall曲线;
图10b中的曲线④为利用Multiscale Cascade CNN人脸检测方法对测试集进行人脸检测得到的precision-recall曲线;
图10b中的曲线⑤为利用LDCF+人脸检测方法对测试集进行人脸 检测得到的precision-recall曲线;
图10b中的曲线⑥为利用Multitask Cascade CNN人脸检测方法对测试集进行人脸检测得到的precision-recall曲线;
图10b中的曲线⑦为利用CMS-RCNN人脸检测方法对测试集进行人脸检测得到的precision-recall曲线;
图10b中的曲线⑧为利用HR人脸检测方法对测试集进行人脸检测得到的precision-recall曲线;
图10b中的曲线⑨为利用本申请实施例提供的人脸检测方法对测试集进行人脸检测得到的precision-recall曲线。
从图10b可以看出,在召回率相同时,本申请实施例提供的人脸检测方法的人脸检测精度更高;在检测精度相同时,本申请实施例提供的人脸检测方法的召回率更高。也就是说,本申请实施例提供的人脸检测方法,无论是检测精度、还是召回率都比传统方式的人脸检测方法的效果好。换言之,本申请实施例中使用的训练得到的人脸检测模型,对待检测图像进行人脸检测的精度和召回率都比较高。
结合图10a和图10b可以看出,利用本申请实施例提供的人脸检测模型进行人脸识别,无论是在对人脸检测模型训练的迭代过程中,还是利用训练得到的人脸检测模型,与传统的人脸检测方法相比,都具有较高的精度和较高的召回率。
可以理解的是,前述提及的验证集以及测试集,均为包含多个图像的图像集合。所述验证集(或者测试集)中的图像,可以为包含多种尺度的人脸的图像,利用本申请实施例提供的人脸检测方法,可以有效的检测出包含多尺度人脸的图像中各个尺度的人脸。可结合图11进行理解。
图11示出了本申请实施例提供的人脸检测的方法的检测效果,图11中一个小框框表示一个识别出的人脸。从图11可以看出,利用本申请实施例提供的人脸检测方法,可以检测出各个尺度的人脸,例如图11中左上角的图像中楼梯附近的小尺度人脸和坐在沙发上的大尺度的人 脸,都可以精确的检测出来。
基于以上实施例提供的人脸检测方法,以下从整体角度描述本申请实施例提供的又一种人脸检测方法。
参见图12,该图为本申请实施例提供的又一种人脸检测方法的流程示意图。该方法例如可以通过如下步骤S1201-S1203实现。
步骤S1201:根据人脸检测模型确定待检测图像中的人脸候选区域。
需要说明的是,此处提及的人脸检测模型,可以与前述实施例步骤S301中提及的人脸检测模型相同,该人脸检测模型可以包括多层卷积网络。关于根据人脸检测模型确定待检测图像中的人脸候选区域的实现方式,与前述实施例步骤S301中“根据人脸检测模型确定待检测图像中的人脸候选区域”相同,可以参考前述实施例步骤S301中相关内容的描述,此处不再赘述。
步骤S1202:根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,所述目标尺度为多个尺度中的一个,不同尺度的人脸对应不同的检测模型。
步骤S1203:根据所述目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
本申请实施例提供的人脸检测方法,可以对待检测图像中的多个尺度的人脸进行检测。在本申请实施例中,考虑到人脸候选区域的尺寸参数,在一定程度上可以表征人脸候选区域中包含的人脸的尺寸参数,因此,可以根据人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度。
在本申请实施例中,所述目标尺度例如可以为小尺度或者大尺度。具体地,在本申请实施例中,若所述人脸候选区域的尺寸参数小于或等于第一比例条件,确定所述人脸候选区域所对应人脸的目标尺度为小尺度;若所述人脸候选区域的尺寸参数大于第二比例条件,确定所述人脸候选区域所对应人脸的目标尺度为大尺度。
关于人脸候选区域的尺寸参数以及第一比例条件的描述,可以参考 前述实施例步骤S301中关于尺寸参数的描述部分,此处不再赘述;关于所述第二比例条件,可以参考前述实施例对于第二比例条件的描述,此处不再详述。
在本申请实施例中,包括多个人脸检测模型,分别用于检测各个尺度的人脸。因此,在确定人脸候选区域所对应人脸的目标尺度之后,可以利用目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
在本申请实施例中,若所述人脸候选区域所对应人脸的目标尺度为小尺度,则可以利用前述实施例提及的第一检测模型,对所述人脸候选区域进行检测。关于利用第一检测模型对小尺度人脸进行检测的具体实现方式,可以参考前述实施例的描述部分,此处不再详述。
在本申请实施例中,若所述人脸候选区域所对应人脸的目标尺度为大尺度,则可以利用前述实施例提及的第二检测模型,对所述人脸候选区域进行检测。关于利用第二检测模型对大尺度人脸进行检测的具体实现方式,可以参考前述实施例的描述部分,此处不再详述。
所述小尺度还可以进一步细分为多个不同尺度的小尺度。所述若干种不同尺度的小尺度人脸分别具有各自对应的第一人脸检测模型,以实现对所述若干种尺度的小尺度人脸的人脸检测。
可以理解的是,待检测图像可以包括多个人脸检测区域,对于任意一个人脸检测区域,均可以执行步骤S1202-S1203的方法,对该人脸检测区域进行人脸检测。在本申请实施例中,若待检测图像包括多个人脸检测区域,可以分别利用步骤S1202-S1203的方法对多个人脸检测区域进行人脸检测,然后分别获取所述多个人脸候选区域的多个人脸检测结果;将所述多个人脸检测结果合并,以得到包括该待检测图像中的各个尺度的人脸的人脸检测结果。
由此可见,利用本申请实施例提供的人脸检测方法,确定人脸候选区域对应的人脸尺度之后,可以选择对应的检测模型进行人脸检测,实现了对不同尺度的人脸的识别。
基于前述图2至图9对应的实施例提供的一种人脸检测方法,本实 施例提供一种人脸检测装置1300,参见图13a,所述装置1300包括:第一确定单元1301、第二确定单元1302和第一检测单元1303。
第一确定单元1301,用于根据人脸检测模型确定待检测图像中的人脸候选区域;所述人脸检测模型包括多层卷积网络;
第二确定单元1302,用于若所述人脸候选区域的尺寸参数小于第一比例条件,确定所述人脸候选区域对应小尺度人脸;
第一检测单元1303,用于通过对应所述小尺度人脸的第一检测模型对所述人脸候选区域进行人脸检测,包括:
通过所述第一检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第一卷积网络和第二卷积网络,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,所述第一卷积网络的相邻层卷积网络为所述第二卷积网络,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率;
将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征;
根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
在一种实现方式中,若所述人脸候选区域的尺寸参数大于第二比例条件,参见图13b,所述装置1300还包括:第三确定单元1304和第二检测单元1305。
第三确定单元1304,用于确定所述人脸候选区域对应大尺度人脸;
第二检测单元1305,用于通过对应所述大尺度人脸的第二检测模型对所述人脸候选区域进行人脸检测,包括:
通过所述第二检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第三卷积网络,所述第三卷积网络所输出特征图的特征分辨率适用于所述尺寸参数;
根据所述至少两层卷积网络所输出特征图上的投影特征对所述人 脸候选区域进行人脸检测。
在一种实现方式中,所述至少两层卷积网络分别设置有权重系数,在所述至少两层卷积网络中,特征分辨率适用于所述尺寸参数的卷积网络的权重系数大于其它卷积网络的权重系数;
所述根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测,包括:
根据所述至少两层卷积网络所输出特征图上的投影特征,以及分别对应的权重系数对所述人脸候选区域进行人脸检测。
在一种实现方式中,所述将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征,包括:
通过降低所述第一卷积网络的投影特征中的通道数量得到第一特征;
通过将所述第二卷积网络的投影特征的特征分辨率提高到与所述第一卷积网络的投影特征的特征分辨率一致,得到第二特征;
将所述第一特征和所述第二特征进行像素相加操作得到所述融合特征;
将所述融合特征作为所述第一卷积网络的投影特征。
在一种实现方式中,所述第一确定单元1301,具体用于:
获取所述待检测图像中的人脸感兴趣区域;
将所述人脸感兴趣区域投影到根据所述人脸检测模型输出的特征图上,得到第一特征图;
在所述第一特征图上生成锚框,得到第二特征图;在生成锚框的过程中,若目标锚框的中心点未与所述人脸感兴趣区域重叠,增大所述目标锚框的划窗步长;
根据多个人脸检测任务的损失函数计算所述第二特征图中的人脸候选区域,并将确定出的人脸候选区域作为所述待检测图像的人脸候选区域。
在一种实现方式中,所述多个人脸检测任务包括针对人脸目标的分 类任务、针对人脸目标框的位置回归任务和针对人脸关键点的位置回归任务,所述多个人脸检测任务的损失函数根据下列方式训练得到:
将所述针对人脸目标的分类任务和针对人脸目标框的位置回归任务作为主任务,将所述针对人脸关键点的位置回归任务作为辅助任务联合训练各自对应的损失函数。
由上述技术方案可以看出,根据包括多层卷积网络的人脸检测模型确定待检测图像中的人脸候选区域,根据人脸候选区域的尺寸参数确定人脸候选区域是否对应的是小尺度人脸,若是,通过用于识别小尺度人脸的第一检测模型对人脸候选区域进行人脸检测,在对人脸候选区域进行人脸检测中,获取人脸候选区域在人脸检测模型中至少两层卷积网络所输出特征图上的投影特征,至少两层卷积网络包括第一卷积网络和第二卷积网络,其中第一卷积网络是根据人脸候选区域的尺寸参数确定的,故第一卷积网络所输出特征图的特征分辨率相对较高,适用于检测具有该尺寸参数的人脸候选区域,而第二卷积网络为第一卷积网络的相邻层卷积网络,虽然特征分辨率没有第一卷积网络高,但是基于人脸检测模型的特性,相对于第一卷积网络,第二卷积网络所输出的特征图携带有更多的语义信息,故将第一卷积网络的投影特征与第二卷积网络的投影特征融合得到的融合特征,不仅具有较高的特征分辨率,而且携带较多的语义信息,有助于检测小尺度人脸,故将该融合特征作为第一卷积网络的投影特征并根据至少两层卷积网络的投影特征对所述人脸候选区域进行人脸检测时,可以提高小尺度人脸的检测精度。
基于前述图12对应的实施例提供的一种人脸检测方法,本实施例提供一种人脸检测装置1400,参见图14a,所述装置1400包括:第一确定模块1401、第二确定模块1402和检测模块1403。
第一确定模块1401,用于根据人脸检测模型确定待检测图像中的人脸候选区域;
第二确定模块1402,用于根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,所述 目标尺度为多个尺度中的一个,不同尺度的人脸对应不同的检测模型;
检测模块1403,用于根据所述目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
在一种实现方式中,所述第二确定模块1402,具体用于:
若所述人脸候选区域的尺寸参数小于或等于第一比例条件,确定所述人脸候选区域所对应人脸的目标尺度为小尺度;
若所述人脸候选区域的尺寸参数大于第二比例条件,确定所述人脸候选区域所对应人脸的目标尺度为大尺度。
在一种实现方式中,所述待检测图像包括多个人脸候选区域,参见图14b,所述装置1400还包括:获取单元1404和合并单元1405。
获取单元1404,用于分别获取所述多个人脸候选区域的多个人脸检测结果;
合并单元1405,用于将所述多个人脸检测结果合并作为所述待检测图像的人脸检测结果。
由此可见,利用本申请实施例提供的人脸检测装置,确定人脸候选区域对应的人脸尺度之后,可以选择对应的检测模型进行人脸检测,实现了对不同尺度的人脸的识别。
本申请实施例还提供了一种人脸检测设备,下面结合附图对人脸检测设备进行介绍。请参见图15所示,本申请实施例提供了一种人脸检测设备1500,该设备1500可以是服务器等计算设备,可因配置或性能不同而产生比较大的差异,可以包括一个或一个以***处理器(Central Processing Units,简称CPU)1522(例如,一个或一个以上处理器)和存储器1532,一个或一个以上存储应用程序1542或数据1544的存储介质1530(例如一个或一个以上海量存储设备)。其中,存储器1532和存储介质1530可以是短暂存储或持久存储。存储在存储介质1530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1522可以设置为与存储介质1530通信,在人脸检测设备1500上执行存储介质1530中的一系列指令操作,以实现本申请任一实施例所述的人脸检测方法。
人脸检测设备1500还可以包括一个或一个以上电源1526,一个或一个以上有线或无线网络接口1550,一个或一个以上输入输出接口1558,和/或,一个或一个以上操作***1541,例如Windows Server TM,Mac OS X TM,Unix TM,Linux TM,FreeBSD TM等等。
上述实施例中执行图2至图9所述的人脸检测方法可以基于该图15所示的服务器结构。
其中,CPU 1522用于执行如下步骤:
根据人脸检测模型确定待检测图像中的人脸候选区域;所述人脸检测模型包括多层卷积网络;
若所述人脸候选区域的尺寸参数小于第一比例条件,确定所述人脸候选区域对应小尺度人脸;
通过对应所述小尺度人脸的第一检测模型对所述人脸候选区域进行人脸检测,包括:
通过所述第一检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第一卷积网络和第二卷积网络,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,所述第一卷积网络的相邻层卷积网络为所述第二卷积网络,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率;
将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征;
根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
上述实施例中执行图12所述的人脸检测方法可以基于该图15所示的服务器结构。
其中,CPU 1522用于执行如下步骤:
根据人脸检测模型确定待检测图像中的人脸候选区域;
根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,所述目标尺度为多个尺度中的 一个,不同尺度的人脸对应不同的检测模型;
根据所述目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
请参见图16所示,本申请实施例提供了一种人脸检测设备1600,该设备1600还可以是终端设备等计算设备,该终端设备可以为包括手机、平板电脑、个人数字助理(Personal Digital Assistant,简称PDA)、销售终端(Point of Sales,简称POS)、车载电脑等任意终端设备,以终端设备为手机为例。
图16示出的是与本申请实施例提供的终端设备相关的手机的部分结构的框图。参考图16,手机包括:射频(Radio Frequency,简称RF)电路1610、存储器1620、输入单元1630、显示单元1640、传感器1650、音频电路1660、无线保真(wireless fidelity,简称WiFi)模块1670、处理器1680、以及电源1690等部件。本领域技术人员可以理解,图16中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图16对手机的各个构成部件进行具体的介绍:
RF电路1610可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器1680处理;另外,将设计上行的数据发送给基站。通常,RF电路1610包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,简称LNA)、双工器等。此外,RF电路1610还可以通过无线通信与网络和其它设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯***(Global System of Mobile communication,简称GSM)、通用分组无线服务(General Packet Radio Service,简称GPRS)、码分多址(Code Division Multiple Access,简称CDMA)、宽带码分多址(Wideband Code Division Multiple Access,简称WCDMA)、长期演进(Long Term Evolution,简称LTE)、电子邮件、短消息服务(Short Messaging Service,简称SMS)等。
存储器1620可用于存储软件程序以及模块,处理器1680通过运行存储在存储器1620的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器1620可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1620可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其它易失性固态存储器件。
输入单元1630可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元1630可包括触控面板1631以及其它输入设备1632。触控面板1631,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1631上或在触控面板1631附近的操作),并根据预先设定的程式驱动相应的连接装置。在本申请实施例中,触控面板1631可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1680,并能接收处理器1680发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板1631。除了触控面板1631,输入单元1630还可以包括其它输入设备1632。具体地,其它输入设备1632可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元1640可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元1640可包括显示面板1641,在本申请实施例中,可以采用液晶显示器(Liquid Crystal Display,简称LCD)、有机发光二极管(Organic Light-Emitting Diode,简称OLED)等形式来配置显示面板1641。进一步地,触控面板1631可覆盖显示面板1641,当触控面板1631检测到在其上或附近的触摸操作后,传送给处理器1680 以确定触摸事件的类型,随后处理器1680根据触摸事件的类型在显示面板1641上提供相应的视觉输出。虽然在图16中,触控面板1631与显示面板1641是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板1631与显示面板1641集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器1650,比如光传感器、运动传感器以及其它传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板1641的亮度,接近传感器可在手机移动到耳边时,关闭显示面板1641和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其它传感器,在此不再赘述。
音频电路1660、扬声器1661,传声器1662可提供用户与手机之间的音频接口。音频电路1660可将接收到的音频数据转换后的电信号,传输到扬声器1661,由扬声器1661转换为声音信号输出;另一方面,传声器1662将收集的声音信号转换为电信号,由音频电路1660接收后转换为音频数据,再将音频数据输出处理器1680处理后,经RF电路1610以发送给比如另一手机,或者将音频数据输出至存储器1620以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块1670可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图16示出了WiFi模块1670,但是可以理解的是,其并不属于手机的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器1680是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器1620内的软件程序和/或模块,以及调用存储在存储器1620内的数据,执行手机的各种功能和 处理数据,从而对手机进行整体监控。在本申请实施例中,处理器1680可包括一个或多个处理单元;在本申请实施例中,处理器1680可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1680中。
手机还包括给各个部件供电的电源1690(比如电池),在本申请实施例中,电源可以通过电源管理***与处理器1680逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
本申请实施例还提供一种计算机可读存储介质,用于存储程序代码,该程序代码可被处理器执行,以用于执行前述各个实施例所述的一种人脸检测方法中的任意一种实施方式。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质可以是下述介质中的至少一种:只读存储器(英文:read-only memory,缩写:ROM)、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其它实施例的不同之处。尤其,对于设备及***实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的设备及***实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人 员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述,仅为本申请的一种具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应该以权利要求的保护范围为准。

Claims (15)

  1. 一种人脸检测方法,由计算设备执行,所述方法包括:
    根据人脸检测模型确定待检测图像中的人脸候选区域;所述人脸检测模型包括多层卷积网络;
    若所述人脸候选区域的尺寸参数小于第一比例条件,确定所述人脸候选区域对应小尺度人脸;
    通过对应所述小尺度人脸的第一检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第一卷积网络和第二卷积网络,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,所述第一卷积网络的相邻层卷积网络为所述第二卷积网络,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率;
    将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征;
    根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
  2. 根据权利要求1所述的方法,还包括:
    若所述人脸候选区域的尺寸参数大于第二比例条件,确定所述人脸候选区域对应大尺度人脸;
    通过对应所述大尺度人脸的第二检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第三卷积网络,所述第三卷积网络所输出特征图的特征分辨率适用于所述尺寸参数;
    根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
  3. 根据权利要求1或2所述的方法,所述至少两层卷积网络分别设置有权重系数,在所述至少两层卷积网络中,特征分辨率适用于所述尺寸参数的卷积网络的权重系数大于其它卷积网络的权重系数;
    所述根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测,包括:
    根据所述至少两层卷积网络所输出特征图上的投影特征,以及分别对应的权重系数对所述人脸候选区域进行人脸检测。
  4. 根据权利要求1所述的方法,所述将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征,包括:
    通过降低所述第一卷积网络的投影特征中的通道数量得到第一特征;
    通过将所述第二卷积网络的投影特征的特征分辨率提高到与所述第一卷积网络的投影特征的特征分辨率一致,得到第二特征;
    将所述第一特征和所述第二特征进行像素相加操作得到所述融合特征;
    将所述融合特征作为所述第一卷积网络的投影特征。
  5. 根据权利要求1所述的方法,所述根据人脸检测模型确定待检测图像中的人脸候选区域,包括:
    获取所述待检测图像中的人脸感兴趣区域;
    将所述人脸感兴趣区域投影到根据所述人脸检测模型输出的特征图上,得到第一特征图;
    在所述第一特征图上生成锚框,得到第二特征图;在生成锚框的过程中,若目标锚框的中心点未与所述人脸感兴趣区域重叠,增大所述目标锚框的划窗步长;
    根据多个人脸检测任务的损失函数计算所述第二特征图中的人脸候选区域,并将确定出的人脸候选区域作为所述待检测图像的人脸候选区域。
  6. 根据权利要求5所述的方法,所述多个人脸检测任务包括针对人脸目标的分类任务、针对人脸目标框的位置回归任务和针对人脸关键点的位置回归任务,所述多个人脸检测任务的损失函数根据下列方式训练得到:
    将所述针对人脸目标的分类任务和针对人脸目标框的位置回归任务作为主任务,将所述针对人脸关键点的位置回归任务作为辅助任务联合训练各自对应的损失函数。
  7. 一种人脸检测方法,由计算设备执行,所述方法包括:
    根据人脸检测模型确定待检测图像中的人脸候选区域;
    根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,所述目标尺度为多个尺度中的一个,不同尺度的人脸对应不同的检测模型;
    根据所述目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
  8. 根据权利要求7所述的方法,所述根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,包括:
    若所述人脸候选区域的尺寸参数小于或等于第一比例条件,确定所述人脸候选区域所对应人脸的目标尺度为小尺度;
    若所述人脸候选区域的尺寸参数大于第二比例条件,确定所述人脸候选区域所对应人脸的目标尺度为大尺度。
  9. 根据权利要求7所述的方法,所述待检测图像包括多个人脸候选区域,所述方法还包括:
    分别获取所述多个人脸候选区域的多个人脸检测结果;
    将所述多个人脸检测结果合并作为所述待检测图像的人脸检测结果。
  10. 一种人脸检测装置,包括:
    第一确定单元,用于根据人脸检测模型确定待检测图像中的人脸候选区域;所述人脸检测模型包括多层卷积网络;
    第二确定单元,用于若所述人脸候选区域的尺寸参数小于第一比例条件,确定所述人脸候选区域对应小尺度人脸;
    第一检测单元,用于:
    通过对应所述小尺度人脸的第一检测模型获取所述人脸候选区域 在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第一卷积网络和第二卷积网络,所述第一卷积网络所输出特征图的特征分辨率适用于所述尺寸参数,所述第一卷积网络的相邻层卷积网络为所述第二卷积网络,所述第二卷积网络所输出特征图的特征分辨率低于所述第一卷积网络所输出特征图的特征分辨率;
    将所述第一卷积网络的投影特征与所述第二卷积网络的投影特征融合得到的融合特征作为所述第一卷积网络的投影特征;
    根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
  11. 根据权利要求10所述的装置,还包括:
    第三确定单元,用于,若所述人脸候选区域的尺寸参数大于第二比例条件,确定所述人脸候选区域对应大尺度人脸;
    第二检测单元,用于:
    通过对应所述大尺度人脸的第二检测模型获取所述人脸候选区域在所述人脸检测模型中至少两层卷积网络所输出特征图上的投影特征;所述至少两层卷积网络包括第三卷积网络,所述第三卷积网络所输出特征图的特征分辨率适用于所述尺寸参数;
    根据所述至少两层卷积网络所输出特征图上的投影特征对所述人脸候选区域进行人脸检测。
  12. 一种人脸检测装置,包括:
    第一确定模块,用于根据人脸检测模型确定待检测图像中的人脸候选区域;
    第二确定模块,用于根据所述人脸候选区域的尺寸参数与比例条件的大小关系,确定所述人脸候选区域所对应人脸的目标尺度,所述目标尺度为多个尺度中的一个,不同尺度的人脸对应不同的检测模型;
    检测模块,用于根据所述目标尺度的人脸对应的检测模型,对所述人脸候选区域进行人脸检测。
  13. 根据权利要求12所述的装置,所述待检测图像包括多个人脸 候选区域,所述装置还包括:
    获取单元,用于分别获取所述多个人脸候选区域的多个人脸检测结果;
    合并单元,用于将所述多个人脸检测结果合并作为所述待检测图像的人脸检测结果。
  14. 一种人脸检测设备,包括处理器以及存储器:
    所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
    所述处理器用于根据所述程序代码中的指令执行权利要求1-9任一项所述的人脸检测方法。
  15. 一种计算机可读存储介质,所述存储介质中存储有程序代码,所述程序代码可以被处理器执行以实现权利要求1-9任一项所述的人脸检测方法。
PCT/CN2019/127003 2019-01-02 2019-12-20 一种人脸检测方法、装置、设备以及存储介质 WO2020140772A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19906810.7A EP3910551A4 (en) 2019-01-02 2019-12-20 FACE RECOGNITION METHOD, DEVICE, DEVICE AND STORAGE MEDIA
US17/325,862 US12046012B2 (en) 2019-01-02 2021-05-20 Face detection method, apparatus, and device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910002499.2 2019-01-02
CN201910002499.2A CN109753927A (zh) 2019-01-02 2019-01-02 一种人脸检测方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/325,862 Continuation US12046012B2 (en) 2019-01-02 2021-05-20 Face detection method, apparatus, and device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020140772A1 true WO2020140772A1 (zh) 2020-07-09

Family

ID=66405145

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127003 WO2020140772A1 (zh) 2019-01-02 2019-12-20 一种人脸检测方法、装置、设备以及存储介质

Country Status (4)

Country Link
US (1) US12046012B2 (zh)
EP (1) EP3910551A4 (zh)
CN (1) CN109753927A (zh)
WO (1) WO2020140772A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985448A (zh) * 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 车辆图像识别方法、装置、计算机设备及可读存储介质
CN112036404A (zh) * 2020-08-31 2020-12-04 上海大学 一种海上船只目标检测方法及***
CN112580435A (zh) * 2020-11-25 2021-03-30 厦门美图之家科技有限公司 人脸定位方法、人脸模型训练与检测方法及装置
CN113688663A (zh) * 2021-02-23 2021-11-23 北京澎思科技有限公司 人脸检测方法、装置、电子设备以及可读存储介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109753927A (zh) * 2019-01-02 2019-05-14 腾讯科技(深圳)有限公司 一种人脸检测方法和装置
JP6635208B1 (ja) * 2019-02-22 2020-01-22 日本電気株式会社 検索装置、検索方法、およびプログラム
CN110263731B (zh) * 2019-06-24 2021-03-16 电子科技大学 一种单步人脸检测***
US20210350139A1 (en) * 2020-05-11 2021-11-11 Nvidia Corporation Highlight determination using one or more neural networks
CN111695430B (zh) * 2020-05-18 2023-06-30 电子科技大学 一种基于特征融合和视觉感受野网络的多尺度人脸检测方法
CN112085701B (zh) * 2020-08-05 2024-06-11 深圳市优必选科技股份有限公司 一种人脸模糊度检测方法、装置、终端设备及存储介质
CN112232258B (zh) * 2020-10-27 2024-07-09 腾讯科技(深圳)有限公司 一种信息处理方法、装置及计算机可读存储介质
CN112800942B (zh) * 2021-01-26 2024-02-13 泉州装备制造研究所 一种基于自校准卷积网络的行人检测方法
CN113128479B (zh) * 2021-05-18 2023-04-18 成都市威虎科技有限公司 一种学习噪声区域信息的人脸检测方法及装置
CN114358795B (zh) * 2022-03-18 2022-06-14 武汉乐享技术有限公司 一种基于人脸的支付方法和装置
CN114550223B (zh) * 2022-04-25 2022-07-12 中国科学院自动化研究所 人物交互检测方法、装置及电子设备
CN114882243B (zh) * 2022-07-11 2022-11-22 浙江大华技术股份有限公司 目标检测方法、电子设备及计算机可读存储介质
CN116229336B (zh) * 2023-05-10 2023-08-18 江西云眼视界科技股份有限公司 视频移动目标识别方法、***、存储介质及计算机

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341517A (zh) * 2017-07-07 2017-11-10 哈尔滨工业大学 一种基于深度学习层级间特征融合的多尺度小物体检测方法
US20180096457A1 (en) * 2016-09-08 2018-04-05 Carnegie Mellon University Methods and Software For Detecting Objects in Images Using a Multiscale Fast Region-Based Convolutional Neural Network
CN108460411A (zh) * 2018-02-09 2018-08-28 北京市商汤科技开发有限公司 实例分割方法和装置、电子设备、程序和介质
CN108564097A (zh) * 2017-12-05 2018-09-21 华南理工大学 一种基于深度卷积神经网络的多尺度目标检测方法
CN109117876A (zh) * 2018-07-26 2019-01-01 成都快眼科技有限公司 一种稠密小目标检测模型构建方法、模型及检测方法
CN109753927A (zh) * 2019-01-02 2019-05-14 腾讯科技(深圳)有限公司 一种人脸检测方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198303B (zh) * 2013-04-12 2016-03-02 南京邮电大学 一种基于人脸图像的性别识别方法
US10354159B2 (en) * 2016-09-06 2019-07-16 Carnegie Mellon University Methods and software for detecting objects in an image using a contextual multiscale fast region-based convolutional neural network
US10360494B2 (en) * 2016-11-30 2019-07-23 Altumview Systems Inc. Convolutional neural network (CNN) system based on resolution-limited small-scale CNN modules
CN107273845B (zh) * 2017-06-12 2020-10-02 大连海事大学 一种基于置信区域和多特征加权融合的人脸表情识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180096457A1 (en) * 2016-09-08 2018-04-05 Carnegie Mellon University Methods and Software For Detecting Objects in Images Using a Multiscale Fast Region-Based Convolutional Neural Network
CN107341517A (zh) * 2017-07-07 2017-11-10 哈尔滨工业大学 一种基于深度学习层级间特征融合的多尺度小物体检测方法
CN108564097A (zh) * 2017-12-05 2018-09-21 华南理工大学 一种基于深度卷积神经网络的多尺度目标检测方法
CN108460411A (zh) * 2018-02-09 2018-08-28 北京市商汤科技开发有限公司 实例分割方法和装置、电子设备、程序和介质
CN109117876A (zh) * 2018-07-26 2019-01-01 成都快眼科技有限公司 一种稠密小目标检测模型构建方法、模型及检测方法
CN109753927A (zh) * 2019-01-02 2019-05-14 腾讯科技(深圳)有限公司 一种人脸检测方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3910551A4 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036404A (zh) * 2020-08-31 2020-12-04 上海大学 一种海上船只目标检测方法及***
CN112036404B (zh) * 2020-08-31 2024-01-02 上海大学 一种海上船只目标检测方法及***
CN111985448A (zh) * 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 车辆图像识别方法、装置、计算机设备及可读存储介质
CN112580435A (zh) * 2020-11-25 2021-03-30 厦门美图之家科技有限公司 人脸定位方法、人脸模型训练与检测方法及装置
CN112580435B (zh) * 2020-11-25 2024-05-31 厦门美图之家科技有限公司 人脸定位方法、人脸模型训练与检测方法及装置
CN113688663A (zh) * 2021-02-23 2021-11-23 北京澎思科技有限公司 人脸检测方法、装置、电子设备以及可读存储介质

Also Published As

Publication number Publication date
EP3910551A4 (en) 2022-05-11
US20210326574A1 (en) 2021-10-21
EP3910551A1 (en) 2021-11-17
CN109753927A (zh) 2019-05-14
US12046012B2 (en) 2024-07-23

Similar Documents

Publication Publication Date Title
WO2020140772A1 (zh) 一种人脸检测方法、装置、设备以及存储介质
US11907851B2 (en) Image description generation method, model training method, device and storage medium
US20210264227A1 (en) Method for locating image region, model training method, and related apparatus
EP3985990A1 (en) Video clip positioning method and apparatus, computer device, and storage medium
US10607120B2 (en) Training method and apparatus for convolutional neural network model
US20220004794A1 (en) Character recognition method and apparatus, computer device, and storage medium
CN106919918B (zh) 一种人脸跟踪方法和装置
WO2018103525A1 (zh) 人脸关键点跟踪方法和装置、存储介质
CN109213732B (zh) 一种改善相册分类的方法、移动终端及计算机可读存储介质
WO2018113512A1 (zh) 图像处理方法以及相关装置
WO2020215949A1 (zh) 对象处理方法及终端设备
WO2021098695A1 (zh) 信息分享方法及电子设备
WO2019233216A1 (zh) 一种手势动作的识别方法、装置以及设备
CN110162604B (zh) 语句生成方法、装置、设备及存储介质
US20220404959A1 (en) Search method and electronic device
WO2021057301A1 (zh) 文件控制方法及电子设备
CN110830713A (zh) 一种变焦方法及电子设备
KR20230071720A (ko) 얼굴 이미지의 랜드마크 좌표 예측 방법 및 장치
CN114020188A (zh) 控制方法、智能终端及存储介质
WO2024022149A1 (zh) 数据增强方法、装置及电子设备
US11877057B2 (en) Electronic device and focusing method
CN108255389B (zh) 图像编辑方法、移动终端及计算机可读存储介质
WO2023273345A1 (zh) 基于图像的车辆定损方法、装置及***
CN113536876A (zh) 一种图像识别方法和相关装置
CN111696051A (zh) 人像修复方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19906810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019906810

Country of ref document: EP

Effective date: 20210802