CN113011356A

CN113011356A - Face feature detection method, device, medium and electronic equipment

Info

Publication number: CN113011356A
Application number: CN202110324684.0A
Authority: CN
Inventors: 王猛; 阮良; 陈功
Original assignee: Hangzhou Langhe Technology Co Ltd
Current assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-22

Abstract

The embodiment of the disclosure provides a face feature detection method, a face feature detection device, a face feature detection medium and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring an image to be detected; inputting an image to be detected into a face feature detection model, and outputting face region coordinates and face key point coordinates corresponding to the image to be detected, wherein the face feature detection model comprises a target face region detection network and a face key point detection network; and labeling the image to be detected according to the coordinates of the face area and the coordinates of the face key points to generate the image to be detected with the face area and the face key points. According to the technical scheme of the embodiment of the invention, the face region coordinates and the face key point coordinates can be simultaneously output, the image containing the face region label and the face key point label can be quickly obtained, and the face feature extraction efficiency is effectively improved.

Description

Face feature detection method, device, medium and electronic equipment

Technical Field

The embodiments of the present disclosure relate to the technical field of artificial intelligence, and more particularly, to a face feature detection method, a face feature detection apparatus, a computer-readable storage medium, and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the rapid development of scientific technology, Face Detection technology (Face Detection) is more and more widely used. The face detection technology is a technology for identifying and searching a given image by adopting a certain strategy to determine whether the given image contains a face, and the face detection is a key link in an automatic face identification system. The face detection technology mainly comprises face region detection and face key point detection, and the precondition for realizing the face key point detection is that the face region needs to be determined firstly.

At present, in related technical solutions, an individual face region detection network and an individual face key point detection network are used to respectively detect a face region and a face key point in a given image. Specifically, a face region in a given image is determined through an individual face region detection network, and then different face regions are sequentially input into the individual face key point detection network to obtain face key points corresponding to different face regions. However, in this scheme, the face key point detection network must wait for the face area detection network to output the face area, and then can start to perform face key point detection on the face area output by the face key point detection network, which results in low face feature detection efficiency and long time consumption; meanwhile, because the network structure of the single face region detection network is complex, the calculation amount is large when the face region is detected, and a long time is also consumed.

Disclosure of Invention

In this context, embodiments of the present disclosure are expected to provide a face feature detection method, a face feature detection apparatus, a computer-readable storage medium, and an electronic device, so as to overcome, at least to a certain extent, the problems of low face feature detection efficiency and long time consumption in related face feature detection schemes.

In a first aspect of the embodiments of the present disclosure, a method for detecting a face feature is provided, including:

acquiring an image to be detected;

inputting the image to be detected into a face feature detection model and outputting face region coordinates and face key point coordinates corresponding to the image to be detected, wherein the face feature detection model comprises a target face region detection network and a face key point detection network;

and labeling the image to be detected according to the face area coordinates and the face key point coordinates to generate the image to be detected with the face area and the face key point labeled.

In some embodiments of the present disclosure, based on the foregoing scheme, inputting the image to be detected into a face feature detection model, and outputting the face region coordinates and the face key point coordinates corresponding to the image to be detected includes:

inputting the image to be detected into the target human face area detection network and outputting the human face area coordinate corresponding to the image to be detected;

extracting a face area feature map corresponding to the face area coordinates from the target face area detection network;

and inputting the extracted face region characteristic graph into the face key point detection network to output the face key point coordinates corresponding to the image to be detected.

In some embodiments of the present disclosure, based on the foregoing scheme, extracting a face region feature corresponding to the face region coordinate from the target face region detection network includes:

determining a target intermediate characteristic diagram corresponding to the target face area detection network;

and carrying out position mapping on the target intermediate characteristic graph based on the face region coordinates to obtain a face region characteristic graph matched with the image to be detected.

In some embodiments of the present disclosure, based on the foregoing, the method further comprises:

performing knowledge distillation processing on a second face region detection network at least once based on an output result of a first face region detection network to obtain a target face region detection network;

wherein the structural scale of the first face region detection network is larger than that of the second face region detection network; the knowledge distillation process can guide the training of the second face area detection network through the output result corresponding to the first face area detection network, and the knowledge transfer is realized.

acquiring face region sample data subjected to face region labeling in advance;

and carrying out network training processing on an original face area detection network through the face area sample data to obtain the first face area detection network.

In some embodiments of the present disclosure, based on the foregoing solution, the performing knowledge distillation processing on a second face area detection network at least once based on an output result of a first face area detection network to obtain the target face area detection network includes:

executing the following cyclic process until the accuracy of the output result corresponding to the second face area detection network is smaller than a preset accuracy threshold, and taking the second face area detection network finished by the last training as the target face area detection network:

cutting to obtain a second face area detection network with a structure scale smaller than that of the first face area detection network;

performing network training on the second face area detection network according to the output result of the first face area detection network and the face area sample data to obtain a trained second face area detection network;

and calculating the accuracy of the output result of the trained second face area detection network, and when the accuracy is greater than or equal to the accuracy threshold, taking the trained second face area detection network as the first face area detection network of the next cycle again.

In some embodiments of the present disclosure, based on the foregoing scheme, the inputting the image to be detected into the target face area detection network and outputting the face area coordinates corresponding to the image to be detected includes:

inputting the image to be detected into the target human face area detection network to output a plurality of human face area coordinates and confidence scores corresponding to the human face area coordinates;

and taking the face region coordinates with the confidence score larger than or equal to the confidence threshold value as the face region coordinates corresponding to the image to be detected.

In a second aspect of the disclosed embodiments, there is provided a face feature detection apparatus, comprising:

the data acquisition module is used for acquiring an image to be detected;

the human face feature detection module is used for inputting the image to be detected into a human face feature detection model and outputting a human face area coordinate and a human face key point coordinate corresponding to the image to be detected, wherein the human face feature detection model comprises a target human face area detection network and a human face key point detection network;

and the face characteristic image generation module is used for labeling the image to be detected according to the face area coordinates and the face key point coordinates to generate the image to be detected with the face area and the face key point labeled.

In some embodiments of the present disclosure, based on the foregoing solution, the face feature detection module includes:

the face area coordinate detection unit is used for inputting the image to be detected into the target face area detection network and outputting the face area coordinate corresponding to the image to be detected;

a face region feature extraction unit, configured to extract a face region feature map corresponding to the face region coordinates from the target face region detection network;

and the face key point coordinate detection unit is used for inputting the extracted face region characteristic graph into the face key point detection network and outputting the face key point coordinate corresponding to the image to be detected.

In some embodiments of the present disclosure, based on the foregoing solution, the face region feature extraction unit is further configured to:

In some embodiments of the present disclosure, based on the foregoing scheme, the face feature detection apparatus further includes a target face area detection network generation module, where the target face area detection network generation module includes:

the knowledge distillation unit is used for carrying out at least one time of knowledge distillation processing on a second face area detection network based on an output result of a first face area detection network to obtain a target face area detection network;

In some embodiments of the present disclosure, based on the foregoing scheme, the target face area detection network generation module further includes a large face area detection network generation unit, where the large face area detection network generation unit is configured to:

acquiring face sample data for carrying out face region labeling in advance;

In some embodiments of the present disclosure, based on the foregoing scheme, the knowledge distillation unit is further configured to:

In some embodiments of the present disclosure, based on the foregoing solution, the face region coordinate detecting unit is further configured to:

In a third aspect of embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the face feature detection method as described in the first aspect above.

In a fourth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; and a memory having computer readable instructions stored thereon, the computer readable instructions, when executed by the processor, implementing the method of face feature detection as described in the first aspect above.

According to the technical scheme of the embodiment of the disclosure, on one hand, a face feature detection model is obtained by combining a target face region detection network and a face key point detection network, and the face features in the image to be detected are extracted according to the face feature detection model, so that the extraction efficiency of the face features can be effectively improved, the detection time delay is reduced, and the detection effect of the face feature detection in a face feature detection scene, especially a real-time video communication scene, is improved while the detection accuracy of the face features is ensured; on the other hand, the image to be detected is input into the face feature detection model, the image to be detected which comprises the face region coordinates and the face key point coordinates can be directly output, the problem that the face region is detected through an independent face region detection network firstly, and the face key points can be obtained only when the face region is input into the independent face key point detection network is solved, the face feature detection efficiency is further improved, and the accuracy of the face region and the face key points is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically illustrates a schematic block diagram of a system architecture of an exemplary application scenario, in accordance with some embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a method of face feature detection, in accordance with some embodiments of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a shared face region feature map, according to some embodiments of the present disclosure;

FIG. 4 schematically illustrates a flow diagram for determining a face region feature map by location mapping, according to some embodiments of the present disclosure;

FIG. 5 schematically illustrates a flow diagram for training a first face region detection network, in accordance with some embodiments of the present disclosure;

FIG. 6 schematically illustrates a flow diagram of deriving a target face region detection network by fractional knowledge distillation, according to some embodiments of the present disclosure;

FIG. 7 schematically illustrates a flow diagram for determining face region coordinates according to some embodiments of the present disclosure;

FIG. 8 schematically illustrates a flow diagram for implementing face feature detection, in accordance with some embodiments of the present disclosure;

FIG. 9 schematically illustrates an application diagram of facial features according to some embodiments of the present disclosure;

FIG. 10 schematically illustrates a schematic block diagram of a face feature detection apparatus according to some embodiments of the present disclosure;

FIG. 11 schematically shows a schematic view of a storage medium according to an example embodiment of the present disclosure; and

fig. 12 schematically shows a block diagram of an electronic device according to an exemplary embodiment of the invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described below with reference to a number of exemplary embodiments. It is understood that these examples are given solely to enable those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present disclosure, a face feature detection method, a face feature detection apparatus, a medium, and an electronic device are provided.

In this context, it is to be understood that the terms referred to, such as:

the Region of interest (ROI) is an image Region to be processed, which is delineated from a processed image in a form of a square, a circle, an ellipse, an irregular polygon, or the like in machine vision and image processing.

Face region detection (Face region detection) refers to detecting video frames or images in real time and returning a Face ROI (region of interest) in each frame of image.

Face key point detection (Face key point detection) takes a detected Face ROI area as input, and identifies a plurality of key points on a Face, such as key corner points and Face contour points on five sense organs, returned to the Face ROI area.

Knowledge Distillation (Knowledge Distillation) refers to a model construction method for learning Knowledge of a large teacher network through a small student network, so that the student network can achieve or exceed the effect of the teacher network.

The step distillation means that a very small student network is distilled directly by a large teacher network, so that the precision is lost, and therefore an intermediate network is obtained through knowledge distillation and serves as a transition network between the large network and the very small network, and knowledge transfer from the teacher network to the student network is achieved.

Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

In the related face feature detection scheme, a separate face region detection network and a separate face key point detection network are generally adopted to respectively detect a face region and a face key point in a given image. Specifically, a face region in a given image is determined through an individual face region detection network, and then different face regions are sequentially input into the individual face key point detection network to obtain face key points corresponding to different face regions.

In the face feature detection scheme, although the face regions in the given image are determined through the individual face region detection network, different face regions are sequentially input into the individual face key point detection network to obtain face key points corresponding to different face regions, and the accuracy of the detected face regions and the accuracy of the face key points can be ensured. However, since the individual face region detection network and the individual face key point detection network are relatively complex in network structure, a large amount of time is required for extracting the face features in each frame of image, the processing efficiency is low, and the time delay of face feature detection is large. In addition, when face features, namely, a face region and face key points, in a frame of image are extracted simultaneously, an individual face region detection network is needed to detect the face region in the image, and then the individual face key point detection network detects the face key points based on the extracted face region, so that more time is consumed for detecting the face features, and a larger time delay is possibly caused particularly in a real-time video communication scene.

Based on the above, the basic idea of the present disclosure is to combine a target face region detection network and a face key point detection network to obtain a face feature detection model, and output face region coordinates and face key point coordinates corresponding to an image to be detected by inputting the image to be detected into the face feature detection model including the target face region detection network and the face key point detection network, and label the image to be detected by the face region coordinates and the face key point coordinates to generate an image to be detected with a face region and face key points labeled, so that while ensuring the detection accuracy of face features, the extraction efficiency of face features can be effectively improved, the detection time delay is reduced, and the detection effect of face feature detection in a face feature detection scene, especially a real-time video communication scene, is improved.

Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

Application scene overview

Referring first to fig. 1, fig. 1 is a schematic block diagram illustrating a system architecture of an exemplary application scenario to which a face feature detection method and apparatus according to an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The face feature detection method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the face feature detection apparatus is generally disposed in the server 105. However, it is easily understood by those skilled in the art that the face feature detection method provided in the embodiment of the present disclosure may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the face feature detection apparatus may also be disposed in the

terminal devices

101, 102, and 103, which is not particularly limited in this exemplary embodiment. For example, in an exemplary embodiment, the

terminal devices

101, 102, and 103 may upload an input video to be detected or an input image to be detected to the server 105, and the server outputs face region coordinates and face key point coordinates by using the face feature detection method provided by the embodiment of the present disclosure, and transmits the face region coordinates and the face key point coordinates to the

terminal devices

101, 102, and 103, so that the

terminal devices

101, 102, and 103 label the face region coordinates and the face key point coordinates to the image to be detected to complete face feature detection.

It should be understood that the application scenario illustrated in fig. 1 is only one example in which embodiments of the present disclosure may be implemented. The application scope of the embodiments of the present disclosure is not limited in any way by the application scenario.

Exemplary method

In the following, a face feature detection method according to an exemplary embodiment of the present disclosure is described with reference to fig. 2 in conjunction with an application scenario of fig. 1. It should be noted that the above application scenarios are merely illustrative for the convenience of understanding the spirit and principles of the present disclosure, and embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

In the current human face feature detection scheme, on one hand, because an individual human face area detection network and an individual human face key point detection network are relatively complex in network structure, a large amount of time is consumed for extracting human face features in each frame of image, the processing efficiency is low, and the time delay of human face feature detection is large; on the other hand, when face features, namely, a face region and face key points, in a frame of image are extracted simultaneously, an individual face region detection network is needed to detect the face region in the image, and then the individual face key point detection network detects the face key points based on the extracted face region, so that more time is consumed for detecting the face features, and especially in a real-time video communication scene, a larger time delay may be caused.

Therefore, in the prior art, it is difficult to achieve a satisfactory face feature detection scheme.

Therefore, an improved face feature detection method is highly needed, so that the accuracy of face feature detection can be ensured, and meanwhile, a face region and face key points can be simultaneously output, thereby effectively reducing the time consumption of face feature detection, improving the detection efficiency of face features, improving the response speed, particularly reducing the time delay in a real-time video communication scene, and improving the face feature detection effect.

The present disclosure first provides a face feature detection method, where an execution subject of the method may be a terminal device or a server, and the present disclosure is not particularly limited to this, and in this example embodiment, the server executes the method as an example for description.

Referring to fig. 2, in step S210, an image to be detected is acquired.

In an exemplary embodiment, the image to be detected refers to an image frame that needs to be subjected to face feature detection, for example, the image to be detected may be a still image that includes face information and is acquired by an image acquisition unit, or may be a video frame that corresponds to a dynamic video that includes face information, such as a real-time video in a scene such as a live video broadcast or a video conference, and of course, a person skilled in the art can easily understand that the image to be detected may also be an image that includes face information and needs to be subjected to face feature detection in any scene, which is not particularly limited in this exemplary embodiment.

In step S220, the image to be detected is input into a face feature detection model, and face region coordinates and face key point coordinates corresponding to the image to be detected are output.

In an exemplary embodiment, the face feature detection model refers to a pre-constructed neural network model for detecting a face region and face key points in an image to be detected, and the face feature detection model may be composed of a pre-trained target face region detection network for detecting the face region and a pre-trained face key point detection network for detecting the face key points.

The face region coordinates refer to position coordinates of an ROI in the image to be detected corresponding to the face information in the image to be detected, for example, the face region used for labeling the face information in the image to be detected may be a rectangular region, and the face region coordinates may be coordinates of four vertices corresponding to the rectangular region, of course, the face region used for labeling the face information in the image to be detected may be a region of other geometric shapes, and the face region coordinates may be coordinates of a point uniquely determining the geometric shape corresponding to the region of other geometric shapes, which is not particularly limited in this exemplary embodiment.

The face key point coordinates refer to position coordinates of key points representing face features of a face in the image to be detected, for example, the face key point coordinates may be position coordinates corresponding to key points representing face key features of eyebrows, eyes, nose bridges, mouths and the like corresponding to the face in the image to be detected, for example, the face key features of certain face information in the image to be detected may be represented by 68 key points, which is not particularly limited in this example.

Specifically, the face key point detection network may be used as a branch network of the target face area detection network to construct a face feature detection model, and certainly, in the actual application process, the target face area detection network may also be used as a branch network of the face key point detection network to construct a face feature detection model, so that the image to be detected is input into the face feature detection model to directly output face area coordinates and face key point coordinates corresponding to face information in the image to be detected, and the detection of the face features in the image to be detected is realized.

In step S230, the image to be detected is labeled through the coordinates of the face region and the coordinates of the face key points, and an image to be detected with the face region and the face key points labeled is generated.

In an exemplary embodiment, after obtaining the coordinates of the face region and the coordinates of the face key points, corresponding key points are added at corresponding position coordinates on an image to be detected according to the coordinates of the face key points, and meanwhile, corresponding face region frames are generated at corresponding position coordinates on the image to be detected according to the coordinates of the face region, so that the image to be detected with the face region and the face key points is obtained, and the detection and identification of face information in videos or images and the labeling of the face region and the face key points can be rapidly and accurately completed in real-time video communication scenes such as video conferences, face recognition, video telephones and the like.

According to the technical scheme of the example embodiment of fig. 2, on one hand, a face feature detection model is obtained by combining a target face region detection network and a face key point detection network, and face features in an image to be detected are extracted according to the face feature detection model, so that the extraction efficiency of the face features can be effectively improved, the detection time delay is reduced, and the detection effect of face feature detection in a face feature detection scene, especially a real-time video communication scene, is improved while the detection accuracy of the face features is ensured; on the other hand, the image to be detected is input into the face feature detection model, the image to be detected which comprises the face region coordinates and the face key point coordinates can be directly output, the problem that the face region is detected through an independent face region detection network firstly, and the face key points can be obtained only when the face region is input into the independent face key point detection network is solved, the face feature detection efficiency is further improved, and the accuracy of the face region and the face key points is improved.

The following further describes steps S210 to S230 in fig. 2.

In an exemplary embodiment, the step in fig. 3 may be implemented to input the image to be detected into the face feature detection model and output the face region coordinates and the face key point coordinates corresponding to the image to be detected, which is shown with reference to fig. 3, and specifically may include:

step S310, inputting the image to be detected into the target human face area detection network and outputting the human face area coordinates corresponding to the image to be detected;

step S320, extracting a face area feature map corresponding to the face area coordinates from the target face area detection network;

and step S330, inputting the extracted face region characteristic map into the face key point detection network and outputting the face key point coordinates corresponding to the image to be detected.

The face region feature map refers to a feature map output by an intermediate layer of the target face region detection network and including a face ROI region feature, for example, the face region feature map may be a feature image feature map output by an intermediate convolutional layer of the target face region detection network and including a face ROI region feature, of course, the face region feature map may also be a feature vector matrix including a face ROI region feature, which is not particularly limited in this example embodiment.

The intermediate layer which can determine the face region feature map containing the complete face ROI region feature in the target face region detection network can be detected in advance, and then the output of the intermediate layer is used as the input of the face key point detection network, namely, the face key point detection network is used as a branch network of the target face region detection network. Therefore, the target face region detection network and the face key point detection network share the face region feature map containing the complete face ROI region feature, the face region coordinates and the face key point coordinates can be rapidly output while the accuracy of the face region coordinates and the face key point coordinates is guaranteed, the face key point detection network does not need to wait for the face region generated by the target face region detection network, the output efficiency of a face feature detection model is effectively improved, and the face feature detection period is shortened.

Specifically, after the image to be detected is processed through each intermediate layer of the target face region detection network, the face ROI region included in the generated feature map may not correspond to the position of the face in the original image to be detected, so that the face region feature map corresponding to the face region coordinates may be extracted from the target face region detection network through the step in fig. 4, so as to ensure that the position of the face region in the face region feature map input to the face key point detection network corresponds to the image to be detected, which is shown with reference to fig. 4, and specifically may include:

step S410, determining a target intermediate characteristic diagram corresponding to the target face area detection network;

and step S420, carrying out position mapping on the target intermediate characteristic map based on the face region coordinates to obtain a face region characteristic map matched with the image to be detected.

The target intermediate feature map is a feature map containing human face ROI (region of interest) features after an intermediate layer of a target human face region detection network processes an image to be detected, and because the position coordinates of the human face ROI features output by the intermediate layer in the target intermediate feature map do not correspond to the position coordinates of human face information corresponding to the human face ROI features in the image to be detected, the target intermediate feature map needs to be subjected to position mapping based on the human face region coordinates to obtain the human face region feature map matched with the image to be detected. For example, it is assumed that a target intermediate feature map corresponding to a certain intermediate layer of the target face region detection network is W/32 × H/32 × C5, where W may represent the height of an image to be detected, H may represent the width of the image to be detected, and C may represent the channel of the image to be detected, and since the face region coordinates in the target intermediate feature map are reduced by 32 times, the coordinate values corresponding to the face region coordinates in the target intermediate feature map may be enlarged by 32 times, so as to obtain a face region feature map matching the size of the image to be detected.

The face region feature map subjected to position mapping is used as the input of the face key point detection network, the detection accuracy of the face key point detection network can be effectively improved, the detection efficiency of the face key point detection network is improved, the face key point coordinates output by the face key point detection network correspond to the face region coordinates output by the target face region detection network in the image to be detected, and the face feature labeling effect in the image to be detected is improved.

In the training process, the face key point detection network can be used as a branch network of the target face area detection network, a feature map which corresponds to the middle layer of the target face area detection network and contains the features of the whole face area is used as the input of the face key point detection network, the weight of the whole target face area detection network is fixed to be unchanged, and the training of the face key point detection network is realized.

In the example embodiment, because the network structure of the current mature face region detection network is complex (the depth of the network is deep), when the face region detection network outputs the coordinates of the face region, the calculation amount is large, the detection output efficiency is low, and especially in a real-time video communication scene, the detection consumes a long time, the calculation performance consumes a large amount, and the application cannot be realized. Therefore, the target face area detection network can be obtained by performing knowledge distillation processing on the second face area detection network at least once based on the output result of the first face area detection network.

The first face area detection network refers to a face area detection network that is mature at present, for example, the first face area detection network may be a dark net53 network, and may include 52 Convolutional layers (CNNs) and 1 Fully Connected layer (FCs), of course, the first face area detection network may also be an YOLOv3 network, which is not limited in this example.

The second face area detection network is a face area detection network obtained by cutting based on the first face area detection network, and the structural scale of the first face area detection network is larger than that of the second face area detection network. For example, if the first face region detection network may be a dark net53 network with a structure size of 53 layers, the second face region detection network may be a face region detection network with a structure size of 49 layers, which is obtained by cutting out 4 convolutional layers in the dark net53 network, although this is only an exemplary illustration and should not be construed as limiting the exemplary embodiment in any way.

Knowledge Distillation (Knowledge migration) refers to the realization of Knowledge migration (Knowledge transfer) by introducing a soft-target (soft-target) related to a teacher network (a teacher network: a complex but reasoning performance superior network) as part of the total loss to induce the training of student networks (student networks: a compact, low complexity network). For example, knowledge distillation can guide the training of the second face area detection network through the output result corresponding to the first face area detection network, so as to realize knowledge migration.

In the present exemplary embodiment, the first face region detection network may be regarded as a teacher network and the target face region detection network may be regarded as a student network, but since the target face region detection network of a smaller structural scale obtained by directly knowledge distillation from the first face region detection network of a larger structural scale may cause information loss, accuracy of the target face region detection network may not meet the requirement. Therefore, the second face area detection network is introduced to serve as a transition network between the first face area detection network and the target face area detection network, the target face area detection network is obtained step by step through multiple knowledge distillations, and the accuracy of the target face area detection network can be effectively guaranteed.

Specifically, the first face area detection network may be implemented by the steps in fig. 5, and as shown in fig. 5, the first face area detection network may specifically include:

step S510, acquiring face region sample data subjected to face region labeling in advance;

step S520, network training processing is carried out on the original face area detection network through the face area sample data, and the first face area detection network is obtained.

The face region sample data refers to sample data labeled with various kinds of face information, and face information data can be acquired from a database or a network, and face region labeling is performed on the face information data to obtain face region sample data.

The original face region detection network refers to a face region detection network which is constructed in advance and is not trained, for example, the original face region detection network may be a darknet53 network, the darknet53 network is trained through labeled face region sample data, the trained darknet53 network is verified through a verification set, and the trained darknet53 network is used as a first face region detection network until the accuracy of the trained darknet53 network reaches a preset accuracy threshold.

Specifically, the step in fig. 6 may be implemented to perform knowledge distillation processing on the second face region detection network at least once based on the output result of the first face region detection network, so as to obtain the target face region detection network, and as shown in fig. 6, the method specifically may include:

step S610, executing the following loop process until the accuracy of the output result corresponding to the second face area detection network is smaller than the preset accuracy threshold, and taking the second face area detection network finished with the previous training as the target face area detection network:

step S620, a second face area detection network with a structure scale smaller than that of the first face area detection network is obtained through cutting;

step S630, according to the output result of the first face area detection network and the face area sample data, performing network training on the second face area detection network to obtain a trained second face area detection network;

and step 640, calculating the accuracy of the output result of the trained second face area detection network, and when the accuracy is greater than or equal to the accuracy threshold, taking the trained second face area detection network as the first face area detection network of the next cycle again.

The accuracy threshold is a threshold used for judging whether the accuracy of the second face area detection network meets the requirement, for example, the accuracy threshold may be 99%, and if the accuracy of the output result and the expected result of the second face area detection network is 98%, it is determined that the second face area detection network does not meet the requirement of network training, and network training needs to be continued, although the accuracy threshold may also be 95%, specifically, the accuracy threshold may be set by self-definition according to an actual situation, which is not specially limited in the present embodiment.

With continued reference to fig. 6, in a knowledge distillation process, a first face region detection network (which may be considered a teacher network) is trained first, then, a second face area detection network (which can be regarded as a student network) with a network structure smaller than that of the first face area detection network is obtained by cutting, prediction information logs (which can be regarded as the probability that each object in the image predicted by the network belongs to each classification) which is the output result of the first face area detection network is used as a label, face area sample data is used as the input of the second face area detection network, performing network training on the second face region detection network to enable the second face region detection network to have the generalization capability of the first face region detection network, completing the knowledge distillation process, therefore, the data set with less marking amount can be used, and the purpose of improving the accuracy of the second face area detection network is achieved.

Of course, the knowledge distilling process in fig. 6 is a semi-supervised knowledge distilling process, that is, the output result of the teacher network on the unlabelled sample data is used as the label of the unlabelled sample data, that is, the supervision information, and the semi-supervised learning is performed on the student network through the unlabelled sample data and the output result of the teacher network. In this exemplary embodiment, an off-line knowledge distillation (off-line knowledge distillation refers to that, when a student network is trained, the obtained teacher network is used for supervised training, network parameters of the teacher network are kept unchanged during the course of the training, a distilled loss function distillation loss calculates a difference between predicted values output by the teacher network and the student network, and the difference is added to a loss function value loss of the student network to perform gradient updating, so as to obtain a student network with higher performance and precision, or an auto-supervised knowledge distillation (auto-supervised knowledge distillation refers to that a teacher network does not need to be trained in advance, but the training of the student network itself completes a knowledge distillation process, and the specific implementation manner is various, for example, the student network can be trained first, in the last epoch of the whole training process, the previously trained student network is used as a supervision model, and in the remaining epoch, the student network is distilled).

Due to the fact that the second face area detection network with a very small structural scale is obtained through direct cutting, knowledge distillation of the second face area detection network cannot be completed. Therefore, each time, a second face area detection network with a structure size slightly smaller than that of the first face area detection network may be obtained by cutting, for example, the structure size of the second face area detection network obtained by cutting may be 4 layers or 5 layers less than that of the first face area detection network, which is not limited in this exemplary embodiment. Therefore, the number of times of the distillation of the specific knowledge can be determined according to the limited accuracy threshold and the number of the structural scale layers of the second face area detection network which is cut out every time, and of course, the distillation of the specific knowledge can also be set by self-definition according to actual needs, which is not specially limited in this exemplary embodiment.

For example, assuming that the first face area detection network has a structure scale of 53 layers and a preset accuracy threshold of 95%, the at least one knowledge distillation process is: cutting to obtain a second face area detection network A with a structure scale of 49 layers, and performing knowledge distillation processing on the second face area detection network A according to an output result of the first face area detection network and face area sample data, wherein if the accuracy of the output result of the second face area detection network A obtained through the knowledge distillation processing is 98% and is greater than a preset accuracy threshold, the second face area detection network A learns the knowledge of the first face area detection network, and knowledge distillation is completed; therefore, the second face area detection network a can be used as a new first face area detection network, namely a teacher network, to perform the next round of knowledge distillation, and cut to obtain a second face area detection network B with a smaller structural scale than the second face area detection network a, for example, a structural scale of 45 layers, the knowledge distillation processing is continuously performed on the second face area detection network B according to the output result of the first face area detection network, namely, the second face area detection network a, and the face area sample data, if the accuracy of the output result of the second face area detection network B obtained by the knowledge distillation processing is 97% and is greater than the preset accuracy threshold, the knowledge distillation is completed, the second face area detection network B is continuously used as a new first face area detection network, and the second face area detection network C with a smaller structural scale is obtained by the knowledge distillation processing, and until the accuracy of the output result of the newly cut second face area detection network C subjected to knowledge distillation is smaller than a preset accuracy threshold, the second face area detection network C cannot realize knowledge distillation due to the fact that the network structure scale is too small, and therefore the second face area detection network B obtained by last knowledge distillation can be used as a target face area detection network obtained by final knowledge distillation in different orders.

The target face area detection network obtained through the fractional knowledge distillation well inherits the knowledge of the first face area detection network, so that knowledge transfer is realized, and the target face area detection network is smaller than the first face area detection network in scale structure, so that the target face area detection network obtained through the fractional knowledge distillation can ensure the accuracy of face area detection, shorten the detection period, reduce the data calculation amount of the network and effectively improve the detection efficiency.

In an exemplary embodiment, inputting an image to be detected into a target face area detection network and outputting a face area coordinate corresponding to the image to be detected may be implemented through the steps in fig. 7, and as shown with reference to fig. 7, the method specifically may include:

step S710, inputting the image to be detected into the target human face area detection network to output a plurality of human face area coordinates and confidence scores corresponding to the human face area coordinates;

and step S720, taking the face region coordinates of which the confidence score is greater than or equal to the confidence threshold value as the face region coordinates corresponding to the image to be detected.

The confidence level is the probability that the true value of the parameter falls around the measurement result in statistics, i.e. the reliability or confidence level of the output result. The confidence threshold is a preset threshold used for judging whether the confidence of the output result meets the requirement or not. For example, the confidence threshold may be 80%, i.e. if the confidence of the output result is greater than 80%, the output result may be considered reliable and may be the final output result.

For example, assume that the output result of the target face region detection network is 5 × 15, where 5 × 5 refers to 25 regions dividing the input image to be detected into 5 × 5, that is, 15 is 3 × 5, and refers to 3 face regions that may contain face information in each region, and each face region contains 4 coordinate values (e.g., the face region is labeled as a rectangle) and 1 confidence score. And the output result 5 x 15 contains the real face region coordinates and the detected other region coordinates not containing face information. Therefore, the face region in the output result is screened through the preset confidence threshold, the accuracy of the coordinates of the output face region can be effectively improved, and the detection precision is improved.

Fig. 8 schematically illustrates a flow diagram for implementing face feature detection according to some embodiments of the present disclosure.

Referring to fig. 8, in step S810, the image W × H × C to be detected is input into the face feature detection model 801, where the face feature detection model 801 may include a target face region detection network 802 and a face key point detection network 803;

processing data of an image to be detected W/H/C through an intermediate layer in the target face region detection network 802 to obtain a first feature map W/2H/2C 1, a second feature map W/4H/4C 2, a third feature map W/8H/8C 3, a fourth feature map W/16H/16C 4 and a fifth feature map W/32H/32C 5;

step S820, detecting that the output third feature map W/8 × H/8 × C3 of the third middle layer in the target face region detection network 802 contains complete face region features, so that the face region coordinates in the third feature map W/8 × H/8 × C3 are subjected to position mapping (for example, amplified by 8 times) to obtain a face region feature map with the size consistent with that of the image to be detected, and the face region feature map is used as the input a × b C3 of the face keypoint detection network 803 to perform face keypoint detection;

step S830, obtaining coordinates of a face region output by inputting an image to be detected into the target face region detection network 802, obtaining coordinates of a face key point output by inputting a face region feature map a b C3 into the face key point detection network 803, and labeling the coordinates of the face region and the coordinates of the face key point into the image to be detected, thereby completing the face feature detection.

Figure 9 schematically illustrates an application diagram of facial features according to some embodiments of the present disclosure.

Referring to fig. 9, an image to be detected 901 containing face information is input into a face feature detection model 801, and data processing is performed on the image to be detected 901 through a target face region detection network 802 in the face feature detection model 801 to obtain face region coordinates 902 (as is easily understood by those skilled in the art, the output of the target face region detection network 802 should be a feature vector, which is convenient for comparison and observation here, and represents the face region coordinates in an illustrative manner); the sharing of bottom layer features is realized by extracting the face region feature map output by the middle layer of the target face region detection network 802 as the input of the face key point detection network 803, so that the problem that the face key point detection network 803 needs the target face region detection network 802 to take the output face region as the input is avoided, and the detection of face key point coordinates is quickly realized, so that the face key point coordinates 903 can be obtained while the face region coordinates 902 are basically output; and marking the face region coordinates 902 and the face key point coordinates 903 in the image to be detected to obtain the image to be detected 904 marked with the face region coordinates 902 and the face key point coordinates 903, and completing the detection of the face features.

Exemplary devices

Having described the method of the exemplary embodiment of the present disclosure, next, a face feature detection apparatus of the exemplary embodiment of the present disclosure is described with reference to fig. 10.

In fig. 10, the face feature detection apparatus 1000 may include: a data acquisition module 1010, a face feature detection module 1020, and a face feature image generation module 1030. Wherein:

the data acquisition module 1010 is used for acquiring an image to be detected;

the face feature detection module 1020 is configured to input the image to be detected into a face feature detection model, and output face region coordinates and face key point coordinates corresponding to the image to be detected, where the face feature detection model includes a target face region detection network and a face key point detection network;

the face feature image generation module 1030 is configured to label the image to be detected according to the face region coordinates and the face key point coordinates, and generate an image to be detected with a face region and a face key point labeled.

In some embodiments of the present disclosure, based on the foregoing solution, the face feature detection module 1020 includes:

In some embodiments of the present disclosure, based on the foregoing solution, the facial feature detection apparatus 1000 further includes a target face area detection network generation module, where the target face area detection network generation module includes:

acquiring face sample data for carrying out face region labeling in advance;

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

Exemplary Medium

Having described the apparatuses of the exemplary embodiments of the present disclosure, a storage medium of the exemplary embodiments of the present disclosure will be described next.

In some embodiments, aspects of the present disclosure may also be implemented as a medium having stored thereon program code for implementing steps in a method of detecting a face feature according to various exemplary embodiments of the present disclosure described in the above section "exemplary methods" of this specification, when the program code is executed by a processor of a device.

For example, the processor of the apparatus may implement step S210 as described in fig. 2 when executing the program code, and acquire an image to be detected; step S220, inputting the image to be detected into a face feature detection model and outputting face region coordinates and face key point coordinates corresponding to the image to be detected, wherein the face feature detection model comprises a target face region detection network and a face key point detection network; and step S230, labeling the image to be detected through the face region coordinates and the face key point coordinates to generate the image to be detected with the face region and the face key point labeled.

Referring to fig. 11, a program product 1100 for implementing the above-described face feature detection method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. The readable signal medium may also be any readable medium other than a readable storage medium.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

Exemplary computing device

Having described the face feature detection method, the face feature detection apparatus, and the storage medium according to the exemplary embodiments of the present disclosure, an electronic device according to the exemplary embodiments of the present disclosure is next described.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an electronic device according to the present disclosure may include at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code that, when executed by the processing unit, causes the processing unit to perform the steps in the face feature detection method according to various exemplary embodiments of the present disclosure described in the above section "exemplary methods" of this specification. For example, the processing unit may execute step S210 as shown in fig. 2, acquiring an image to be detected; step S220, inputting the image to be detected into a face feature detection model and outputting face region coordinates and face key point coordinates corresponding to the image to be detected, wherein the face feature detection model comprises a target face region detection network and a face key point detection network; and step S230, labeling the image to be detected through the face region coordinates and the face key point coordinates to generate the image to be detected with the face region and the face key point labeled.

An electronic device 1200 according to an example embodiment of the disclosure is described below with reference to fig. 12. The electronic device 1200 shown in fig. 12 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 12, the electronic device 1200 is embodied in the form of a general purpose computing device. The components of the electronic device 1200 may include, but are not limited to: the at least one processing unit 1201, the at least one storage unit 1202, a bus 1203 connecting different system components (including the storage unit 1202 and the processing unit 1201), and a display unit 1207.

Bus 1203 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The memory unit 1202 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1221 and/or cache memory 1222, and may further include Read Only Memory (ROM) 1223.

Storage unit 1202 may also include a program/utility 1225 having a set (at least one) of program modules 1224, such program modules 1224 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The electronic device 1200 may also communicate with one or more external devices 1204 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1200 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 1205. Also, the electronic device 1200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through the network adapter 1206. As shown, the network adapter 1206 communicates with the other modules of the electronic device 1200 over a bus 1203. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the face feature detection apparatus are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A face feature detection method is characterized by comprising the following steps:

acquiring an image to be detected;

2. The method of claim 1, wherein inputting the image to be detected into a face feature detection model and outputting the face region coordinates and face key point coordinates corresponding to the image to be detected comprises:

3. The method of claim 2, wherein extracting face region features corresponding to the face region coordinates from the target face region detection network comprises:

4. The method of claim 1, further comprising:

5. The method of claim 4, further comprising:

acquiring face region sample data subjected to face region labeling in advance;

6. The method according to claim 4 or 5, wherein the performing at least one knowledge distillation process on a second face region detection network based on an output result of a first face region detection network to obtain the target face region detection network comprises:

7. The method according to claim 2, wherein the inputting the image to be detected into the target face area detection network and outputting the face area coordinates corresponding to the image to be detected comprises:

8. A face feature detection apparatus, comprising:

the data acquisition module is used for acquiring an image to be detected;

9. An electronic device, comprising:

a processor; and

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of face feature detection according to any one of claims 1 to 7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for face feature detection according to any one of claims 1 to 7.