CN107506707B

CN107506707B - Face detection using small scale convolutional neural network module in embedded system

Info

Publication number: CN107506707B
Application number: CN201710692173.8A
Authority: CN
Inventors: 王星; 梅迪·塞伊菲; 陈明华; 吴谦伟; 梁杰
Original assignee: Altumview Systems Inc
Current assignee: Altumview Systems Inc
Priority date: 2016-11-30
Filing date: 2017-08-14
Publication date: 2021-05-25
Anticipated expiration: 2037-08-14
Also published as: CN107506707A

Abstract

Embodiments of the present application provide examples of face detection systems based on the use of small-scale hardware Convolutional Neural Network (CNN) modules configured into a multitask cascaded CNN module. In some embodiments, the sub-image based CNN system may be configured to be equivalent to a large-scale CNN that processes the entire input image without division, so that the output of the sub-image based CNN system may be identical to the output of the large-scale CNN. Based on this, some embodiments disclosed herein apply the sub-image based CNN systems and techniques to one or more levels of a cascaded CNN or a multitasked cascaded CNN (MTCNN), so that a larger input image input into a specified level of the cascaded CNN or the MTCNN may be divided into a set of sub-images having smaller sizes. Thus, each stage of the cascade CNN or the MTCNN may employ the same small-scale hardware CNN module, which is limited by the maximum input image size.

Description

Face detection using small scale convolutional neural network module in embedded system

Priority claims and related patent applications

The present application is related to pending U.S. patent application 15/441,194 (application name: CONVOLUTIONAL NEURAL NETWORK (CNN) SYSTEM BASED ON limited resource small-SCALE CNN MODULES (CNN) SYSTEM BASED ON solution foundation-LIMITED SMALL-SCALE CNN MODULES), inventor: regam, wuwei, jie, petit, application date: 2017, 2/23). The above application is incorporated by reference into and made a part of this application.

This patent application claims priority from U.S. provisional patent application 62/428,497 (application name: CONVOLUTIONAL NEURAL Network (CNN) BASED ON finite resource small-SCALE CNN MODULES (CNN) BASED ON solution-LIMITED SMALL-SCALE CNN MODULES), inventor: royal, wuwei, jie, application date: 2016, 11/30/2016).

Technical Field

The present disclosure relates generally to the field of machine learning and artificial intelligence, and more particularly, to systems, apparatuses, and techniques for performing face detection on video images using a small-scale hardware Convolutional Neural Network (CNN) module.

Background

Deep Learning (DL) is a branch of machine learning and artificial neural networks based on a set of algorithms that attempt to model high-level abstractions in data by using artificial neural networks with many processing layers. A typical DL architecture may include many layers of neurons and millions of parameters. These parameters can be trained with massive amounts of data on high-speed computers equipped with GPUs and guided by new training algorithms that can also be applied in deep networks, such as modified linear units (relus), drop-outs (or discards), data set enhancements, and Stochastic Gradient Descent (SGD).

Among the existing DL architectures, Convolutional Neural Network (CNN) is one of the most popular architectures. Although the idea behind CNN was discovered over 20 years ago, the true capabilities of CNN were only recognized after the recent development of deep learning theory. CNNs have achieved great success to date in many areas of artificial intelligence and machine learning, such as face recognition, image classification, image subtitle generation, visual question answering, and auto-driving automobiles.

In many face recognition applications, face detection is an important process. Many face detection techniques can easily detect a forward face at a close distance. However, in an unconstrained situation, it is still very difficult to achieve robust and fast face detection. This is because these situations are typically associated with a large number of changes in the face, including changes in pose, occlusion, exaggerated expressions, and extreme lighting changes. Effective face detection techniques that can handle these unconstrained scenarios include: (1) a Cascade Convolutional Neural Network (CNN) structure (hereinafter referred to as "Cascade CNN" or "Cascade CNN structure") described in "Cascade of Convolutional Neural networks for Face Detection" (a Convolutional Neural Network Cascade for Face Detection) (h.li, z.lin, x.shen, j.branch, and g.hua, Computer Vision and Pattern Recognition, IEEE conference proceedings (proc.ieee conf.on Computer Vision and Pattern Recognition), 2015 year 6, month 1); and (2) a Multitask cascade CNN structure (hereinafter referred to as "MTCNN" or "MTCNN structure") described in "Joint Face Detection and Alignment Using Multitask Cascade connected Networks" (K.Zhang, Z.Zhang, Z.Li, and Y.Qiao; IEEE Signal Processing Letters, Vol.23, No.10, pp.1499-1503, 2016. 10 months).

In the cascade CNN, a rough-to-fine cascade CNN architecture is used for face detection. More specifically, the cascaded CNN architecture does not use a single deep neural network, but rather uses multiple shallow neural networks operating at different resolutions of the input image, so that the CNN can quickly discard background regions at low resolution levels, and then carefully evaluate a small number of candidate regions at final high resolution levels. To improve localization efficiency, a correction stage is employed after each detection/classification stage to adjust the position of the detected window (or "bounding box"). Thus, the cascade CNN typically requires 6 stages or 6 simple CNNs: three levels or CNNs are used for binary face detection/classification, while the other three are used for bounding box correction. Due to the cascade design and simple CNN adopted at each stage, the face detection architecture is highly suitable for operating in an embedded environment. Note that cascading each bounding box correction stage within a CNN requires additional computational overhead. Furthermore, in this cascade CNN, the inherent relevance of face detection and face alignment is ignored.

In MTCNN, a multi-task cascade CNN integrates face detection and face alignment operations by using a cascade CNN of a unified standard through multi-task learning. In principle, the MTCNN also employs multiple coarse to fine CNN levels, thereby operating on different resolutions of the input image. However, in MTCNN, at each level, a single CNN is utilized in conjunction with training face keypoint localization, binary face classification, and bounding box alignment. Thus, the MTCNN requires only three stages. More particularly, the first level of the MTCNN quickly generates candidate face windows through the shallow CNN. Next, the second level of the MTCNN utilizes the more complex CNNs to screen out candidate windows by discarding a large number of non-face windows. Finally, the third level of MTCNN uses the more processing-capable CNN to determine whether each input window includes a human face. If yes, the positions of five face key points are estimated. MTCNN performance is significantly improved over previous face detection systems. The MTCNN architecture is generally more suitable for execution on resource-limited embedded systems than the cascaded CNN architecture described above.

Disclosure of Invention

The embodiments described herein provide various examples of face detection systems, techniques, and architectures based on the use of small-scale hardware Convolutional Neural Network (CNN) modules configured into a multitask cascaded CNN. In some embodiments, the sub-image based CNN system may be configured to be equivalent to a large scale CNN system that processes the entire input image without division, so that the output of the sub-image based CNN system may be identical to the output of the large scale CNN. Based on this, some embodiments disclosed herein apply the sub-image based CNN systems and techniques to one or more levels of a cascaded CNN or a multitasked cascaded CNN (MTCNN), such that a larger input image in a given level of the cascaded CNN or the MTCNN may be divided into a set of sub-images having a smaller size. Thus, each stage of the cascade CNN or the MTCNN may employ the same small-scale hardware CNN module, which is limited by the maximum input image size.

In one aspect, a face detection system based on the use of at least one small-scale hardware Convolutional Neural Network (CNN) module for processing video images is disclosed. The face detection system comprises a motion detection module and a first processing module. The motion detection module is used for detecting a candidate image area corresponding to a moving object in a video image; the first processing module realizes functions through a hardware CNN module, wherein the first processing module processes a detected candidate image area by using the hardware CNN module and adopting a sliding window method with a first image size, and generates a first group of candidate face windows in the detected candidate image area, wherein the first image size is between the minimum input size and the maximum input size of the hardware CNN module. The face detection system also comprises a second processing module, which implements the functions through the hardware CNN module and is connected to the first processing module. The second processing module processes a first set of sub-regions within the candidate image region with a second image size using the hardware CNN module, wherein the first set of sub-regions corresponds to the first set of candidate face windows, and generates a second set of candidate face windows within the detected candidate image region, wherein the second image size is between the minimum and maximum input sizes of the hardware CNN module. The face detection system also comprises a third processing module, which implements the functions through the hardware CNN module and is connected to the second processing module. The third processing module processes a second set of sub-regions within the candidate image region with a third image size using the hardware CNN module, wherein the second set of sub-regions corresponds to the second set of candidate face windows, and generates a third set of candidate face windows within the detected candidate image region, wherein the third image size is larger than the maximum input size of the hardware CNN module.

In some embodiments, the first image size is determined based on a minimum input size of the hardware CNN module.

In some embodiments, the first image size is substantially equal to a minimum input size of the hardware CNN module.

In some embodiments, the motion detection module is to detect the candidate image region using a background elimination module within a hardware chipset or system-on-a-chip.

In some embodiments, the face detection system further comprises a pyramid generation module located between the motion detection module and the first processing module, the pyramid generation module for generating a multi-resolution representation of a pyramid of the detected candidate image regions.

In some embodiments, the first processing module processes the detected candidate image region using the first image size by: for each image in the multi-resolution representation of the detected candidate image region, generating a set of image blocks for the image using a sliding window having a first image size; and generating a first set of candidate face windows from each set of image blocks corresponding to the position of the sliding window for each multi-resolution representation of the detected candidate image region.

In some embodiments, each of the first set of candidate face windows is associated with a confidence index and a set of bounding box coordinates defining the location of the candidate face window within the detected candidate image region.

In some embodiments, the second processing module has a second complexity; the first processing module has a first complexity; the second complexity is higher than the first complexity.

In some embodiments, the second image size is larger than the first image size.

In some embodiments, the second processing module uses the second image size to process a first set of sub-regions within the candidate image region corresponding to the first set of candidate face windows by: resizing each of a first set of sub-regions within the candidate image region corresponding to the first set of candidate face windows to a second image block having the second image size; a second set of candidate face windows is then generated from a second set of image blocks having a second image size.

In some embodiments, the third processing module has a third complexity that is higher than the first and second complexities of the first and second processing modules.

In some embodiments, the third processing module is to resize each sub-region of a second set of sub-regions within the detected candidate image region corresponding to a second set of candidate face windows to a third image block having a third image size.

In some embodiments, the third processing module processes the second set of candidate face windows by processing a resized third set of image blocks having a third image size.

In some embodiments, the third processing module processes each resized third group of image blocks having a third image size in a manner that: dividing the resized image block having the third image size into a set of sub-images having a fourth image size that is smaller than the maximum input size of the hardware CNN module; sequentially processing the set of sub-images with the hardware CNN module to generate an array profile; merging the array feature maps into a merged feature map corresponding to the resized image block having the third size; and processes the set of merged feature maps to generate a third set of candidate face windows.

In some embodiments, the hardware CNN module is a small-scale hardware CNN module.

In some embodiments, the hardware CNN module is embedded within a chipset or system on a chip (SoC).

In some embodiments, the SoC comprises a haisi (HiSilicon) Hi3519 system on a chip.

In accordance with another aspect of the present application, a method for performing face detection on video images based on employing at least one small-scale hardware Convolutional Neural Network (CNN) module is disclosed. The method comprises the following steps: receiving a video sequence acquired by a camera; for each video frame in the video sequence, detecting a candidate image region corresponding to a moving object within the video frame; processing, with the hardware CNN module, a detected candidate image region based on a first image size to generate a first set of candidate face windows within the detected candidate image region, wherein the first image size is between a minimum input size and a maximum input size of the hardware CNN module; processing, with the hardware CNN module, a first set of sub-regions within the candidate image region based on a second image size, wherein the first set of sub-regions corresponds to the first set of candidate face windows, to generate a second set of candidate face windows within the detected candidate image region, wherein the second image size is between a minimum and a maximum input size of the hardware CNN module; processing, with the hardware CNN module, a second set of sub-regions within the candidate image region based on a third image size, wherein the second set of sub-regions corresponds to the second set of candidate face windows, to generate a third set of candidate face windows within the detected candidate image region, wherein the third image size is larger than a maximum input size of the hardware CNN module.

In some embodiments, the step of processing the second set of sub-regions within the candidate image region based on the third image size comprises: resizing each of a second set of sub-regions within the candidate face window to a third set of image blocks having a third image size, wherein the second set of sub-regions correspond to the second set of candidate face windows; dividing a third group of image blocks having a third image size into a group of sub-images having a fourth image size, the fourth image size being smaller than the maximum input size of the hardware CNN module; sequentially processing the set of sub-images with the hardware CNN module to generate an array profile; merging the array feature maps into a merged feature map corresponding to the resized image block having the third size; and processes the set of merged feature maps to generate a third set of candidate face windows.

In accordance with yet another aspect of the present application, an embedded system is disclosed that can perform face detection in acquired video images. The embedded system includes: a processor; a memory coupled to the processor; an image capture device coupled to the processor and the memory for capturing video images; a hardware CNN module connected to the processor and the memory; and a motion detection module connected to the hardware CNN module for receiving the acquired video image and detecting a candidate image region corresponding to a moving object within the video image. In the embedded system, the hardware CNN module is configured to: processing the detected candidate image region with a first image size to generate a first set of candidate face windows within the detected candidate image region, wherein the first image size is between a minimum input size and a maximum input size of the hardware CNN module; processing a first set of sub-regions within the candidate image region using a second image size, wherein the first set of sub-regions corresponds to the first set of candidate face windows, and generating a second set of candidate face windows within the detected candidate image region, wherein the second image size is between the minimum and maximum input sizes of the hardware CNN module; processing a second set of sub-regions within the candidate image region using a third image size, wherein the second set of sub-regions corresponds to the second set of candidate face windows, and generating a third set of candidate face windows within the detected candidate image region, wherein the third image size is larger than the maximum input size of the hardware CNN module.

In some embodiments, the third processing module is further to: resizing each of a second set of sub-regions within the candidate face window to a third set of image blocks having a third image size, wherein the second set of sub-regions correspond to the second set of candidate face windows; dividing a third group of image blocks having a third image size into a group of sub-images having a fourth image size, the fourth image size being smaller than the maximum input size of the hardware CNN module; sequentially processing the set of sub-images with the hardware CNN module to generate an array profile; merging the array feature maps into a merged feature map corresponding to the resized image block having the third size; and processes the set of merged feature maps to generate a third set of candidate face windows.

Drawings

The structure and operation of the present disclosure may be understood by reading the following detailed description and the various drawings, in which like reference numerals refer to like parts, and in which:

FIG. 1A shows a block diagram of a small-scale hardware CNN module for processing low-resolution input images;

FIG. 1B shows a more detailed implementation of the hardware CNN module of FIG. 1A;

FIG. 2A shows a block diagram of a conventional full image-based CNN system for processing higher resolution input images;

fig. 2B shows a block diagram of a sub-picture based CNN system;

FIG. 3 illustrates a block diagram of an exemplary face detection system based on a small-scale hardware CNN module, according to some embodiments of the present application;

fig. 4 illustrates a block diagram of an exemplary implementation of a first level CNN based on small scale hardware CNN modules, as illustrated in fig. 3, in accordance with some embodiments described herein;

fig. 5 illustrates a block diagram of an exemplary implementation of the small-scale hardware CNN-based second level CNN as shown in fig. 3, according to some embodiments described herein;

fig. 6 illustrates a block diagram of an exemplary implementation of the third level CNN, as shown in fig. 3, in accordance with some embodiments described herein;

FIG. 7 illustrates an exemplary input image partitioning scheme of 46 x 46 image blocks, in some embodiments described herein;

fig. 8 illustrates a block diagram of an exemplary implementation process of a third level CNN based on small-scale hardware CNN modules as illustrated in fig. 3, according to some embodiments of the present application;

FIG. 9 illustrates a block diagram of an exemplary implementation of the final decision module shown in FIG. 3, according to some embodiments of the present application;

FIG. 10 illustrates a flow chart describing an exemplary face detection process utilizing the face detection system executing on the embedded CNN enabled system disclosed herein in accordance with some embodiments of the invention;

fig. 11 shows a flowchart describing an exemplary process for processing a second set of resized image blocks (i.e., step 1014 of fig. 10) using the sub-image based CNN system, in accordance with some embodiments described herein;

FIG. 12 illustrates an exemplary embedded system within which the disclosed sub-image based face detection system functions according to some embodiments described herein.

Detailed Description

The detailed description below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The accompanying drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein, which may be practiced without these specific details. In some instances, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

Throughout the specification, the following terms have the meanings provided herein, unless the context clearly dictates otherwise. The terms "image resolution" and "image size" are used interchangeably to refer to the number of pixels within a given two-dimensional (2D) image.

Various examples of face detection systems, techniques, and architectures based on the use of small-scale low-cost CNN modules configured into a multi-tasking cascaded CNN module are described. In one embodiment, the small-scale low-cost CNN module is embedded in a chip set or system on a chip (SoC). Thus, the face detection systems, techniques, and architectures presented herein may be implemented on a chipset or system on a chip (SoC) that includes a small-scale, low-cost CNN module. In a specific example, the face detection system, techniques, and architecture presented herein may be implemented on a haisi Hi3519 system-on-chip (hereinafter or referred to as "Hi 3519" or "Hi 3519 system-on-chip"). The Hi3519 system-on-chip is developed for a smart camera and developed by Hassi semiconductors, Inc., a subsidiary of Huashi technologies, Inc. Notably, the Hi3519 system on a chip includes embedded hardware CNN modules and a CPU that can perform software CNN functions.

In many embedded system applications, most of the existing CNN-based DL architectures and systems do not have the costAnd (4) benefit. Meanwhile, some embedded systems with CNN function based on low-cost chipsets have begun to appear. A typical example is the Hi3519 system on a chip. The cost of Hi3519 system on chip is significantly lower than that of Nvidia^TMTK1/TX1 chipset. The Hi3519 system on chip includes an embedded hardware CNN module with many desirable capabilities. For example, the parameters of the embedded CNN module of the Hi3519 system on a chip are reconfigurable, i.e., the user can modify the network architecture and parameters, which can be pre-trained for different applications. In addition, the embedded CNN module has extremely high processing speed. For example, the inline CNN module may process a 32 × 40 input image within 1 ms.

Small-scale low-cost CNN modules such as the Hi3519 system-on-chip are often limited in capabilities and have many limitations for cost-reducing design purposes. For example, in a Hi3519 system on a chip, the maximum number of pixels of the input image of the embedded CNN module is 1280. However, in the coarse-to-fine architecture in the MTCNN structure described above, the input image size is rapidly increasing step by step. For example, in some embodiments of the MTCNN, the input image size of the second level is 24 × 24 × 3 ═ 1728, and the input image size of the third level is 48 × 48 × 3 ═ 6912. Both of these input image sizes exceed the input size limit of the embedded CNN module within the Hi3519 system-on-chip. To implement the MTCNN on a Hi3519 system on chip, the MTCNN needs to be modified to take a smaller input image size and down sample the input video accordingly. However, in doing so, the image quality of the face in the video may be significantly degraded, which may result in a serious impact on the face detection performance.

The contents of U.S. patent application 15/441,194, which is related to the present application and which is incorporated by reference herein, provides a solution for the above-described implementation of MTCNN on small-scale, low-cost CNN modules, which may be Hi3519 systems on a chip. To address the problem of input image sizes larger than the maximum input size of the CNN module, the related patent application provides embodiments of a sub-image based CNN system that first divides a larger input image into a set of smaller sub-images with a reasonably designed overlap between adjacent sub-images. Each sub-picture is then processed by a small-scale hardware CNN module, which may be an inline CNN module within the Hi3519 system-on-chip. The corresponding outputs of the set of sub-images may then be merged, and the merged result may be further processed by a next stage. The sub-image based CNN system described in the related patent application may be configured to be equivalent to a large-scale CNN system that processes the entire input image without division, so that the output of the sub-image based CNN system may be identical to that of the large-scale CNN. Based on this, some embodiments disclosed in the related patent application apply the sub-image-based CNN systems and techniques to one or more stages of a cascaded CNN or MTCNN, so that a larger input image in a given stage of the cascaded CNN or MTCNN may be divided into a set of sub-images having a smaller size. Thus, each stage of the cascade CNN or the MTCNN may employ the same small-scale hardware CNN module, which is limited by the maximum input image size.

In some embodiments, to improve the performance of performing face detection in real-time, the face detection techniques and systems presented herein detect moving regions in each video frame/image. For example, the face detection techniques and systems may detect moving regions in a video frame using an embedded background elimination module in Hi 3519. Next, the face detection techniques and systems use multiple stages of CNNs from coarse to fine to detect most or all faces in the video frame. More specifically, for each level of the multi-level CNN having the input image size limitation, a sub-image-based CNN structure may be applied. For example, some embodiments of the face detection techniques presented herein only require that the sub-image based CNN structure be applied to the last level of the multi-level CNN structure.

In some embodiments, to improve the efficiency of performing face detection in real-time, the face detection techniques and systems may also identify face keypoints (e.g., eyes, nose, and mouth) for each detected face. This information allows the system to track to each face, select the best posed image (also referred to as the best face) for each person, for example the image closest to the frontal perspective, and then send the best face to the server for further processing, such as face recognition. For some application environments, the face information in the video frame is transmitted without transmitting the entire video frame to the server, thereby reducing the requirements on the network bandwidth and the computing resources of the server. In system applications, this reduction in requirements is particularly important for systems equipped with a large number of cameras to acquire video signals of multiple channels simultaneously.

In the following discussion, we take the embedded hardware CNN module in the Hi3519 system-on-a-chip as an example to describe some exemplary embodiments of the face detection CNN system and technique proposed in this application. However, it should be understood that the face detection CNN system and technique is not limited to a particular chipset or system-on-chip, such as the Hi3519 system-on-chip. Face detection CNN systems and techniques utilize small-scale hardware CNN modules in place of larger, more complex CNN modules in some or all of the stages of the cascaded CNN or MTCNN. The face detection CNN system and technique may be applied in any small-scale hardware CNN module or other chipset or system-on-chip including embedded small-scale hardware CNN modules. In addition, the face detection system and technique can be implemented as a single field programmable gate array and integrated with an embedded platform.

Description of sub-picture based CNN Structure

The sub-picture based CNN system described in us related patent application 15/441,194 is built based on small-scale low-cost hardware CNN modules. Such sub-picture based CNN systems can be implemented in resource limited systems such as embedded systems and mobile devices to enable these systems to perform tasks that typically require large scale, high complexity, expensive CNN systems to implement. The sub-image-based CNN system can also be realized in the existing DL system to replace a large-scale and high-complexity CNN module, thereby remarkably reducing the system cost. For example, such a sub-image based CNN system allows the use of a low-cost embedded system with CNN functionality in applications requiring high complexity CNN, such as processing high resolution input images. For embedded systems with limited resources, high resolution input images cannot be processed in a different manner than provided by the present application. In some embodiments, the sub-image based CNN system reuses one or more small-scale hardware CNN modules designed to process low-resolution input images, such as the inline hardware CNN module within the Hi3519 system-on-a-chip, so that the sub-image based CNN system can be applied to high-resolution input images and a more challenging task that typically requires expensive, large-scale hardware CNN modules to have processing power.

The sub-picture based CNN system is a hierarchical system based on the use of a divide-and-conquer approach to handle complex tasks. In some embodiments of the related patent application, the sub-picture based CNN system is constructed with two or more levels. Wherein each of the two or more stages is implemented using one or more small-scale, low-cost hardware CNN modules operable on low-resolution inputs or software operable on low-resolution inputs. Thus, each of the two or more stages has very low complexity. More specifically, to use this sub-image based CNN system, an initial high resolution input image may be divided into a set of sub-images of the same size or substantially the same size that is significantly smaller than the size of the initial input image, wherein the divided images may include properly designed overlap portions between adjacent sub-images. These sub-images are fed to a first stage of the sub-image based CNN system, which first stage comprises at least one small scale low cost hardware CNN module designed to process low resolution input images, and the processed sub-image sets of the output of the first stage are subsequently combined. More specifically, a group of sub-images may be processed by repeatedly calling one or more small-scale hardware CNN modules repeatedly. In this manner, a high resolution input image may be processed by one or more small scale hardware CNN modules by repeatedly calling the one or more small scale hardware CNN modules on a sub-group of images.

The outputs of the first stage may then be combined based on the set of sub-images. In some embodiments, the sub-image based CNN system includes provisions for the input image and the size of the sub-image to ensure that the combined result is substantially the same or exactly the same as the output of processing the entire high resolution input image directly with a large scale high complexity CNN module (without the need to divide the input image). The merged results are then processed by a second stage of the sub-picture based CNN system, which may also be implemented using one or more small scale hardware CNN modules or using software. As such, the disclosed CNN system may enable high complexity tasks, such as processing high resolution input images, without requiring large-scale, high complexity, expensive hardware modules, thereby improving the tradeoff between performance and cost. Therefore, the sub-image based CNN system is highly applicable to embedded systems with limited resources, such as various surveillance cameras, machine vision cameras, drones, robots, self-driving, and mobile phones.

Small-scale low-cost hardware CNN module

Fig. 1A shows a block diagram of a small-scale hardware CNN module 100 for processing a low-resolution input image. In some embodiments, CNN module 100 may be used to extract features of input images with limited resolution and perform various DL inferences, depending on application requirements. As can be seen in fig. 1A, CNN module 100 includes at least two sub-modules, denoted CNN1 and CNN2, respectively. In some embodiments, the CNN module 100 is configured to limit the input image 102 size to no more than 1280 pixels, e.g., an image resolution of 32x40 pixels. This limitation on the input image size also severely limits the types of applications to which CNN module 100 can be adapted.

Fig. 1B shows a more detailed implementation of the hardware CNN module 100. As can be seen in fig. 1B, the first sub-module CNN1 in fig. 1A further includes in series a plurality of alternating sets of Convolution (CONV) layers, modified linear unit (ReLU) layers (not shown), and pooling layers. Further, for each of a plurality of CONV layers, such as the CONV (1) layer, the sub-module CNN1 uses a set of convolution filters to extract a particular set of features from the input image 102. Each of the plurality of CONV layers in sub-module CNN1 is followed by a corresponding ReLU layer (not shown) and a pooling layer, such as a POOL (1) layer; the pooling layer POOL (1) is used to reduce the filtered image size generated by the corresponding CONV layer while preserving some of the extracted features.

As also shown in fig. 1B, the second sub-module CNN2 in fig. 1A further includes a series of sets of alternating Fully Connected (FC) layers and ReLU layers (not shown). Each of a plurality of FC layers in sub-module CNN2, such as the FC (1) layer, is configured to perform matrix multiplication. Each of the plurality of FC layers (except the last FC layer) is followed by a corresponding ReLU layer (not shown). Although not explicitly shown in fig. 1B, each of the plurality of ReLU layers in CNN1 and CNN2 is configured to provide non-linear characteristics to the CNN system. Finally, at the output of the last FC layer (e.g., FC (n) layer), a decision block (also not shown) is used to predict the output based on the last FC layer, thereby generating the output 104 of the CNN block 100. In some embodiments, the first sub-module CNN1 includes 1-8 sets of CONV layers, ReLU layers, and pooling layers, while the second sub-module CNN2 includes 3-8 sets of Fully Connected (FC) layers and ReLU layers.

In some embodiments, the number of convolution filters in each of the plurality of CONV layers is at most 50 and only 3 x 3 filters are allowed. In addition, the convolution step is fixed to 1, and zero padding is not used. In some embodiments, the pooling layer in CNN1 may use a max pooling technique to select a maximum from each of the 2x 2 regions in the filter image. In some embodiments, both maximum pooling and average pooling may be used, however, the pooling window size is fixed at 2x 2 and the step size is fixed at 2. In other words, after each pooling layer, the image is reduced in width and height by half.

Taking the example of a hardware CNN module within a Hi3519 system on a chip, the maximum input size of the first FC layer is 1024, and the number of neurons in the intermediate FC layer is 256 at most. The size of the CNN module output is at most 256. Due to these constraints, the hardware CNN module within the Hi3519 system-on-chip is generally only suitable for performing simple applications such as handwritten digit recognition and car license plate recognition. For more challenging applications such as face recognition, direct application of small-scale CNN modules such as CNN module 100 is not satisfactory for at least the following reasons. First, the maximum input resolution of 1280 pixels (such as 40 × 32) is very limited because a face image down-sampled to this resolution loses too much important face information. Secondly, the learning capabilities of the small-scale CNN module 100 are also very limited.

Layered CNN architecture and system based on sub-images

Fig. 2A shows a block diagram of a conventional full image-based CNN system 200 for processing high resolution input images. As can be seen, the conventional CNN system 200 may receive the entire high-resolution input image 202 on the first convolution layer CONV (1) and begin performing feature extraction operations on the high-resolution input image 202. As such, the conventional CNN system 200 can directly process the entire high-resolution input image 202 without dividing the input image. However, the conventional CNN system 200 also requires the use of large-scale expensive chips capable of processing such high-resolution input images, such as the Nvidia described earlier^TMAnd (3) a chip.

Fig. 2B shows a block diagram of a sub-picture based CNN system 210. In the disclosed CNN system 210, small-scale CNN modules with limited resolution, such as the CNN module 100 described in connection with fig. 1A and 1B or hardware CNN modules within a Hi3519 system-on-chip, may be a component of the sub-image based CNN system 210. As mentioned above, such small-scale CNN modules have limitations on the maximum size of the input image, e.g., a maximum of 1280 pixels. To be able to process a high resolution input image 202 using this small-scale CNN module (e.g., image pixels exceeding 1280), the disclosed CNN system 210 includes an input module 212 that divides the high resolution input image 202 into a set of smaller sub-images 204, where each of the sub-images 204 has a maximum size that is less than or equal to the input image allowed by the small-scale CNN module used as a component of the CNN system 210. In some embodiments, the input module 212 may divide the high resolution input image 202 by appropriately setting the overlap between adjacent sub-images 204, as shown in fig. 2B. It is to be noted that the set of four sub-images 204 with two rows and two columns of spaces and overlapping portions shown in fig. 2B is for ease of understanding the concept thereof, and does not represent an actual division.

As shown in fig. 2B, CNN system 210 includes a two-tier processing architecture based on the use and/or reuse of one or both of the two hardware sub-modules CNN1 and CNN2 of small-scale CNN module 100 depicted in fig. 1A and 1B. In addition to the input module 212, the CNN system 210 includes a first processing stage 220, a merging module 222, and a second processing stage 224. More specifically, the first processing stage 220 of the CNN system 210 includes at least one CNN1 processing module, such as the CNN1 module 214. In some embodiments, the CNN1 module 214 is implemented by the hardware sub-module CNN1 depicted in fig. 1A and 1B. In other embodiments, the CNN1 module 214 is implemented by the entire CNN module 100 described in fig. 1A and 1B including two sub-modules CNN1 and CNN 2. It is noted that the multiple instances of the CNN1 module 214 shown within the first processing stage 220 represent the use of the same CNN1 module 214 at different times t1, t2, t3, …, and tn, as noted for each such instance. Thus, "CNN 1214 at t 1", "CNN 1214 at t 2", "CNN 1214 at t 3", …, and "CNN 1214 at tn" shown in fig. 2B correspond to the same CNN1 module 214 at different processing times, and should not be construed as a plurality of CNN1 modules having the same number 214. Although not shown, the first processing stage 220 may include additional CNN1 modules similar to CNN module 214. For example, the first processing stage 220 may include two or more identical CNN1 modules.

The second processing stage 224 of the CNN system 210 includes at least one CNN2 module 216. In some embodiments, the CNN2 module 216 is implemented by the hardware sub-module CNN2 depicted in fig. 1A and 1B. In other embodiments, the CNN2 module 216 is implemented by the entire CNN module 100 described in fig. 1A and 1B including two sub-modules CNN1 and CNN 2. In certain other embodiments, the CNN2 module 216 within the second processing stage 224 may be implemented in software rather than in hardware.

In particular, to process the set of sub-images 204 generated by the input module 212, the same CNN1 module 214 may be used multiple times to sequentially process the set of sub-images 204, one sub-image at a time. That is, each instance of the CNN1 module 214 within the first processing stage 220 of the CNN system 210 represents one of multiple applications of the same CNN1 module 214 on one of the sub-images 204 in the set of sub-images 204 at a different processing time. However, since the CNN1 module 214 processes each sub-image 204 very quickly, the total processing time to process the group of sub-images 204 will also be very fast. The output of the multiple applications of the CNN1 module 214 contains an array of feature maps 206 corresponding to the set of sub-images 204 after the multiple-layer convolution, ReLU, and pooling operations.

It is noted that while the embodiment shown in fig. 2B is based on reusing the same hardware CNN1 module 214 in the first processing stage 220 of the CNN system 210, other embodiments may use additional hardware CNN1 modules similar or identical to the CNN1 module 214 in the first processing stage 220 of the CNN system 210 so that multiple hardware CNN1 modules process the group of sub-images 204 in parallel. The actual number of CNN1 modules used for a given design may be determined based on a tradeoff between hardware cost constraints and speed requirements for the given design. For example, some variations of CNN system 210 may include 3 to 5 CNN1 modules in the first processing stage.

As mentioned above, the CNN1 module 214 may be implemented by a dedicated hardware sub-module CNN1, such as described in connection with fig. 1A and 1B, or by the entire CNN module 100 including both CNN1 and CNN2 sub-modules described in connection with fig. 1A and 1B. In the first case, the CNN1 module 214 within the CNN system 210 may include only the CONV layer, the ReLU layer, and the pooling layer. In the second case, implementing CNN1 module 214 in CNN system 210 further includes skipping the FC layer and the corresponding ReLU layer, i.e., skipping sub-module CNN2 within CNN module 100. When skipping the CNN2 sub-module, the CNN1 module 214 generally needs to retain spatial location information in its output profile because the outputs from the CNN1 module 214 will be combined and used for further processing. For some embedded hardware CNN modules, such as those within the Hi3519 system on a chip, the parameters of the embedded CNN module are reconfigurable. By using this property, the purpose of skipping sub-module CNN2 can be achieved by forcing each of the FC layers within CNN module 100 into an identity matrix such that the output from each FC layer becomes a reorganization of the two-dimensional profile into a one-dimensional vector when using the embedded CNN module. In this case, the ReLU layer following each FC layer may still be used as usual. In the partitioned embodiment, for the CNN2 sub-module with three tiers of FC-ReLU combinations, the last two ReLU tiers do not change any data, since the concatenation of multiple ReLU tiers amounts to only one ReLU tier.

Returning to FIG. 2B, after each sub-image 204 in the set of sub-images 204 is sequentially processed by the CNN1 module 214, the output from the CNN1 module 214 containing the array of feature maps 206 becomes the input to the merge module 222, which merge module 222 is configured to merge the array of feature maps 206 to form a complete feature map for the entire input image 202. The merged feature map may then be used as an input to the second processing stage 224 of the CNN system 210. In some embodiments, the output 228 from the second processing stage 224 is the output from the last FC layer of the CNN2 module 216. Ideally, the output 228 is the same as the output 226 of the conventional CNN system 200 in fig. 2A.

In some embodiments, the array of feature maps 206 comprises a set of three-dimensional (3D) matrices (i.e., two-dimensional feature maps and the number of feature maps). For example, the array of signatures 206 may be made up of nine 3D matrices, each matrix being 2x 48 in size, where nine is the number of sub-images 204 with

subscripts

0, 1, 2, …, 8 (i.e., 3 rows and 3 columns of sub-images), 2x 2 is the size of each output signature for each sub-image after processing by the CNN1 module 214, and 48 is the number of signatures for each sub-image. In some embodiments, the merge module 222 is configured to merge the array of feature maps 206 by concatenating all of the 3D output matrices based on the corresponding output matrix indices, thereby forming a merged 3D feature map matrix while preserving the spatial relationship of the set of sub-images 204. In the example above, this step would generate a 6 × 6 × 48 3D matrix. Next, the merged 3D matrix may be flattened into a one-dimensional (1D) vector. In the example above, this would result in a 1D vector with 1728 elements. Finally, the flattened 1D vector is fed to the second processing stage 224.

Fig. 2B shows that the merged feature map 208 generated by the merging module 222 is fed to a second processing stage 224 of the CNN system 210 for further processing. More specifically, the second processing stage 224 of the CNN system 210 includes at least one CNN2 module 216, the CNN2 module 216 further including an FC layer and a ReLU layer set as described above. As mentioned above, the CNN2 module 216 in the CNN system 210 may be implemented by the dedicated hardware sub-module CNN2 described in connection with fig. 1A and 1B. In these embodiments, the CNN2 module 216 within the CNN system 210 may include only the FC layer and the ReLU layer. In still other embodiments, the CNN2 module 216 may be implemented by the entire hardware CNN module 100 described in fig. 1A and 1B including two sub-modules CNN1 and CNN 2. In these embodiments, implementing the CNN2 module 216 in the CNN system 210 further includes skipping the CONV layer-ReLU layer-pooling layer, i.e., skipping sub-module CNN1 within the CNN module 100. In some systems, such as Hi3519, it may be difficult to skip the CONV layer-ReLU layer-pooling layer to directly use the FC layer and the ReLU layer. In these cases, the CNN2 module 216, i.e., the FC layer and the ReLU layer, may be implemented by software. Since most of the complex computations of the CNN system 210 are at the CONV level, implementing the FC and ReLU layers in software typically has little impact on the overall speed of the system. Further, systems such as Hi3519 can also provide additional tools to optimize the speed of software implementing the CNN2 module 216.

As mentioned above, the CNN2 module 216 within the second processing stage 224 may be implemented by software rather than by a hardware CNN module. It should be noted that because the complexity of the FC layer and the ReLU layer is generally much lower than the convolutional layer, most of the complex computations of the CNN system 210 reside in the convolutional layer implemented by the CNN1 module 214. Based on such recognition, the low complexity computing operations implemented by the hardware CNN2 module 216 in the CNN system 210 may be implemented by software instead of the hardware CNN2 or CNN module mentioned above. Furthermore, such a software implementation approach may provide more flexibility than embodiments based on hardware CNN modules.

Face detection CNN framework that this application provided

In the two aforementioned face detection structures, the MTCNN structure is simpler than the cascade CNN because the MTCNN uses three CNN levels and the cascade CNN uses six levels. In addition, the MTCNN may detect face keypoint locations, which facilitates person tracking and determining a pose for each face. Thus, several examples of face detection CNN systems and techniques described below are based on the MTCNN structure using three levels of CNNs. It should be noted, however, that the face detection system and technique is equally applicable to cascaded CNN architectures.

It has been mentioned above that the embedded CNN module of the Hi3519 system on a chip cannot be directly used to implement each stage of the MTCNN of the initial design without addressing its input image size limitation. In fact, the original design of the MTCNN did not meet or conflict with many of the limitations of the embedded CNN block of the Hi3519 system-on-chip. These conflicts include, but are not limited to:

maximum input image size: as mentioned above, the input image of Hi3519 has a maximum pixel value of 1280. In contrast, the input image size of the second level of the MTCNN as originally designed is 24 × 24 × 3 ═ 1728, and the input image size of the third level is 48 × 48 × 3 ═ 6912. Both of these input sizes exceed the upper limit of the input image size of Hi 3519.

Minimum input image size: the minimum width or height of the input image of Hi3519 is 16 pixels. In contrast, the input image size of the first level of the MTCNN of the initial design is 12 × 12, which is too small for Hi 3519.

The number of filters: in the embedded CNN block of Hi3519, the maximum number of filters per Convolution (CONV) layer is 50. In contrast, several CONV layers in MTCNN of the original design have 64 or 128 filters.

CNN architecture: in the embedded CNN module of Hi3519, each CONV layer is followed by a Maximum Pooling (MP) layer. However, the MTCNN generally has two or three continuous convolutional layers, in between which no MP layer exists.

Pooling window size: in the embedded CNN module of Hi3519, the MP layer is designed to support a pooling window size of 2 × 2 pixels, whereas in MTCNN, a maximum pooling window of 3 × 3 is typically used.

CONV layer filter size: in the embedded CNN module of Hi3519, the CONV layer has a 3 × 3 filter, while in MTCNN, the CONV layer typically employs a 5 × 5 filter and a 2 × 2 filter.

Nonlinear function: the MTCNN employs a parameter-modified linear unit (prellu) as a non-linear function, while the inline CNN module of Hi3519 employs a modified linear unit (ReLU).

Fully Connected (FC) layer: the first stage of the MTCNN, originally designed, is a Full Convolutional Network (FCN) to reduce the runtime of this sliding window approach during testing, where the FC layer is not involved. In contrast, Hi3519 requires at least 3 FC layers in one CNN.

Examples of face detection CNN systems and techniques presented herein are designed to address the above-described problems, such that the initial CNN within each level of the MTCNN may be implemented by a small-scale, low-cost CNN, such as the embedded CNN module of Hi 3519.

Fig. 3 illustrates a block diagram of an exemplary face detection system 300 based on a small-scale hardware CNN module, according to some embodiments of the present application. In some embodiments, the face detection system 300 is implemented on a CNN-enabled embedded system, including a small-scale, low-cost system-on-a-chip such as the Hi3519 system-on-a-chip. As shown in fig. 3, the face detection system 300 receives and takes as input video data 302 and generates and takes as output face detection decisions 316. In some embodiments, the input video image 302 is a video frame or video captured by a camera. It is noted that the face detection system 300 includes at least a motion detection module 304, a pyramid and block generation module 306, a first level CNN308, a second level CNN310, a third level CNN 312, and a final decision module 314. The face detection system 300 may also include other modules not shown in fig. 3. Each of the modules in the face detection system 300 will be described in greater detail below.

As can be seen, the motion detection module 304 first receives an input video image 302. In some embodiments, faces within a given video are considered to be motion related. Thus, to reduce computational complexity, the motion detection module 304 may locate and identify regions within each video frame that are associated with motion based on a comparison with previously received video frames. It should be noted that these moving areas include human or non-human objects, such as a moving automobile. In addition, even if a person is moving, the moving area may include a human face and a human body. When the face recognition system 300 is implemented on Hi3519, the motion detection module 302 may be implemented by an inline motion detection hardware module of Hi 3519. The output of the motion detection module 302 includes a set of identified motion regions having different sizes. As part of the output video image 302, each identified moving region is sent to subsequent face detection modules within the face detection system 300 for detecting most or all faces within the moving region. In this embodiment, the non-moving regions within the input video image 302 are generally not considered for face detection. However, some other embodiments of the face detection system presented herein may not include a motion detection module.

In some embodiments, a face tracking module (not shown) may be used in place of or in conjunction with the motion detection module 302. The face tracking module is used to calculate the motion trajectory of the detected face through the face detection system 300. More specifically, the face tracking module calculates a motion trajectory based on the face positions in the previous video frame, and predicts new positions of the detected faces in the new video frame based on the calculated motion trajectory, and subsequently retrieves the faces in the vicinity of the predicted positions. It should be noted that by combining motion detection and face tracking within the face detection system 300, the speed of face detection can be significantly increased.

In some embodiments, the size of a given movement region 318 generated by the movement detection module 304, or generated by the face tracking module, or generated by a combination of motion detection and face tracking has a minimum. The minimum value of the movement region may be determined based on one or more design parameters and the limitations of the small scale hardware CNN module employed in the face detection system 300, such as the face detection system 300 of Hi 3519. In some embodiments, the one or more parameters include a preliminary downsampling factor for the pyramid or block generation module 306 and a minimum input image size for the first level CNN 308. For example, if the preliminary downsampling factor of the pyramid and block generation module 306 is 2: 1, the minimum input image of the first level CNN308 is 16 × 16, the minimum size of the face that can be detected should be 32 × 32. In another example, if the preliminary downsampling factor for pyramid and block generation module 306 is 3: 1, the minimum input image of the first level CNN308 is 16 × 16, the minimum size of the face that can be detected should be 48 × 48. To reduce complexity, the minimum size of the moving area sent to the face detection module is typically larger than the minimum size of faces that can be detected. In some embodiments, the maximum size of the movement region generated by the motion detection module 304 may be as large as the size of the entire input video image 302. For example, the moving area may correspond to an input image that is substantially completely covered by a human face.

As can be seen in fig. 3, the detected movement regions generated by the motion detection module 304 (either by the face tracking module or by a combination of motion detection and face tracking) are processed in a similar manner by other modules within the face detection system 300, including the pyramid and block generation module 306, the first level CNN308, the second level CNN310, the third level CNN 312, and the final decision module 314. Thus, the operations described below with respect to the pyramid and block generation module 306, the first level CNN308, the second level CNN310, the third level CNN 312, and the final decision module 314 will be performed repeatedly for each detected movement region 318. This processing loop performed for each detected movement area is indicated by the dashed line surrounding the module identifications. Thus, the following discussion of the face detection system 300 is directed to, and applies equally to, all of the detected movement regions 318.

In the face detection system 300, each detected movement region 318 is received by the pyramid block generation module 306 as part of the input video image 302. The pyramid block generation module 306 downsamples the moving region 318 using different downsampling factors to convert the moving region 318 into a "pyramid" multi-resolution representation of the moving region 318, thereby allowing subsequent face detection modules to detect faces of different sizes within the moving region 318. More specifically, the high resolution representation of the moving region 318 in the "pyramid" may be used to detect smaller faces in the initial input image 302; while a low resolution representation of the moving region 318 in the "pyramid" may be used to detect a larger face in the initial input image 302.

In some embodiments, the highest resolution representation of the moving region 318 in the pyramid is determined by the input size of the first level CNN308 and the minimum ideal size of faces that can be detected. Note that the input size of the first level CNN398 may be an artificially defined parameter, and the minimum value of the input size is limited by the minimum input size of the first level CNN308, which is constrained by the particular device. For example, for the inline CNN module of Hi3519, the minimum input size is 16 × 16. This constraint indicates that the input size of the first level CNN308 needs to be at least 16 x 16. In addition, the highest resolution representation also determines the smallest face that the face detection system 300 can detect. More particularly, the smallest face that is detectable can be determined by multiplying the input size of the first level CNN308 by the pyramid and downsampling factor employed by the block generation module 306. For example, if the input size employed by the first level CNN308 is 16 x 16 and the initial downsampling factor employed by the pyramid and block generation module 306 is 3, then the smallest face that can be detected is 48 x 48. If the initial downsampling factor used by the pyramid and block generation module 306 is 2 and the input size used by the first level CNN308 is 16 x 16, then the smallest face that can be detected is 32x 32.

It should be noted that the pyramid and block generation modules use down-sampling factors that need to be determined in consideration of the trade-off between face detection accuracy and speed. On the other hand, the initial down-sampling factor may be determined as the ratio of the minimum size of detectable faces to the input size of the first level CNN 308. For example, assuming that the input size of the first level CNN308 is 16 × 16, and the minimum size of detectable faces is about 48 × 48, the initial down-sampling factor should be 3. In some embodiments, the user-specified input size of the first level CNN308 may be greater than the minimum input size of the first level CNN308, i.e., 16 x 16.

In some embodiments, the lowest resolution representation of this moving region in the pyramid should be equal to or close to, but not smaller than, the minimum input size of the first level CNN308, i.e., corresponding to 16 × 16 in Hi 3519. For example, the lowest resolution representation of the moving region 318 may be a 24 × 24 image. The other resolution representation of the region of movement 318 may be between the lowest and highest resolution of the pyramid, and is typically between adjacent resolution representations in a ratio of 2: 1 or 3: the factors of 1 are spaced apart.

For each received moving region 318, the pyramid and block generation module 306 generates a multi-resolution representation of the pyramid for that moving region 318. In other words, the pyramid and block generation module 306 generates a set of images with different resolutions for the same portion of the initial video image 302. In some embodiments, not all images in the pyramid are processed, but instead, the first level CNN308 processes the image blocks based on the user-specified input size described above. For example, if a 16 × 16 input size is used, each image in the pyramid is further divided into a set of 16 × 16 image blocks.

In some embodiments, the pyramid and block generation module 306 divides each image in the pyramid into a set of image blocks using a sliding window approach. More specifically, each image in the pyramid can be searched one by one in user-specified steps through a sliding window of a user-specified size, such as 16 × 16, to generate an image block at each sliding window position; the specified step size is, for example, 2 or 4 pixels in both row and column directions. Thus, the pyramid and block generation module 306 generates and outputs sets of image blocks 320 of the same size corresponding to the sets of multi-resolution representations of the moving region. It should be noted that the high resolution representation of the moving area 318 produces more image blocks than the low resolution representation of the moving area 318. Next, the sets of image blocks 320 are received by the first set of CNNs 308. Based on the hardware configuration, the first stage CNN308 may process the received image blocks sequentially, block-by-block; or, a plurality of image blocks are processed in parallel to accelerate the processing speed. Some embodiments of the first level CNN308 will be described in more detail below.

The first level CNN308 is used to process each received image block corresponding to each sliding window position within each pyramid representation of the moving area 318. Fig. 4 shows a block diagram of an exemplary implementation process 400 of a first-level CNN308 based on small-scale hardware CNN modules, according to some embodiments described herein.

As can be seen in fig. 4, the first stage CNN 400 includes two stages of CONV and MP layers (i.e., CONV (1)/MP (1) and CONV (2)/MP (2)), followed by two FC layers (i.e., FC (1) and FC (2)). In some embodiments, each CONV layer and FC layer (except the last FC layer) is followed by a ReLU layer (not shown in fig. 4). The input to the first level CNN 400 is a three-channel (i.e., R/G/B channel) input image block 402 (i.e., one of the sets of image blocks 320 in fig. 3) of size 16 x 16. In the embodiment shown in the figure, the CONV (1) layer comprises 10 3 × 3 filters with step size 1. Therefore, the output of the CONV (1) layer has a size of 14 × 14 × 10. The MP (1) uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) is 7 × 7 × 10. The CONV (2) layer includes 16 3 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 5 × 5 × 16. The MP (2) uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (2) is 3 × 3 × 16. The outputs of the first and last FC layers are 32 × 1 and 16 × 1 vectors, respectively. In some embodiments, of the final 16 × 1 output vectors, the first two outputs are used to generate a face detection confidence index (also referred to as a "face classifier"); the next 4 outputs are the bounding box coordinates (also referred to as the "bounding box regression operator") of the face in the image block 402 (if a face is detected in the image block 402); the last 10 outputs represent the positions of 5 face keypoints of the detected face, i.e., the left eye, right eye, nose, and two mouth corners (also referred to as "keypoint localization operators"). Thus, the first level CNN 400 is output as a set of candidate face windows/bounding boxes (corresponding to a subset of the image blocks 320 shown in FIG. 3).

It is noted that the combination of the number of layers and filters, the input image size, the filter and pooling window size, the FC layer output size, and other parameters shown in first stage CNN 400 is only one exemplary configuration of first stage CNN 308. First level CNN308 may be constructed with other configurations having one or more parameter values different from those shown in fig. 4 without departing from the scope of the technology described herein. In some embodiments, such as the exemplary first level CNN 400 shown in fig. 4, the first level CNN308 satisfies the constraints of a small-scale hardware CNN module, which may be, for example, an inline hardware CNN module within Hi3519, such that the inline hardware CNN module within Hi3519 may implement the first level CNN 308.

In some embodiments, to eliminate more "false alarms," i.e., alarms that are detected by the first stage CNN308 as image blocks of a face but not actually a face, a filter may be applied to the face detection confidence index at the detection output. The filter only retains image blocks whose face detection confidence index is greater than a threshold (e.g., the threshold is typically set between 0.5 and 0.7). In some embodiments, this filtering operation is implemented after the last FC layer in first stage CNN 308.

It should be noted that, because the multi-resolution representation is generated by using the pyramid technique and the image blocks are generated by using the sliding window technique, a plurality of overlapped but different bounding boxes can be generated around each face of the input image. In some embodiments, for each image block divided into faces by first-level CNN308, a corresponding image region is identified in initial input video image 302. Next, those highly overlapping bounding boxes are merged using non-maximum suppression (NMS) techniques, as described in MTCNN. It should be noted that the NMS operation may operate after the filtering operation performed on the candidate face window as described above. In some embodiments, this NMS operation is implemented within the first level CNN308 within the face detection system 300. After NMS operation, the remaining bounding box may be refined through a bounding box regression operation to refine the location of the bounding box, as also described in MTCNN. Again, this NMS operation may be performed within the first level CNN308 within the face detection system 300. Thus, after one or more other places, the first level CNN 300 outputs a set of candidate bounding boxes, or "candidate face windows," of faces.

In some embodiments, for each candidate face window 322 output by the first level CNN308, there is a corresponding image block located in and truncated from the initial input video image 302, and the truncated image block is then resized to the user-specified input size of the second level CNN 310. Based on this coarse-to-fine approach, the user-specified input size for the second level CNN310 should be larger than the input size for the first level CNN 308. In some embodiments, the input size of the second level CNN310 is 24 × 24. Therefore, the image block of the re-size is also 24 × 24 in size. However, in other embodiments, input sizes similar to, but slightly different from, 24 x 24 may also be employed without departing from the scope of the described techniques. The process of generating a resized image block from the candidate face window 322 may be implemented by hardware, software, or a combination of hardware and software. The corresponding processing module, which is not explicitly shown in the figure, may be located between the first-level CNN308 and the second-level CNN 310. Next, the second level CNN310 receives the resized image block. Based on the hardware configuration, the second stage CNN310 may process the received image blocks 324 sequentially, block-by-block; or, a plurality of image blocks are processed in parallel to accelerate the processing speed. Some embodiments of the second level CNN310 will be described in more detail below.

Fig. 5 shows a block diagram of an exemplary implementation process 500 of a small-scale hardware-based CNN secondary CNN310, according to some embodiments described herein.

As can be seen in fig. 5, the second stage CNN 500 includes three stages of CONV and MP layers (i.e., CONV (1)/MP (1), CONV (2)/MP (2), and CONV (3)/MP (3)), followed by two FC layers (i.e., FC (1) and FC (2)). In some embodiments, each CONV layer and FC layer (except the last FC layer) is followed by a ReLU layer (not shown in fig. 5). In some embodiments, the second level CNN 500 satisfies the constraints of the embedded hardware CNN module of Hi 3519. For example, the input to the second level CNN 500 is a grayscale map 502 (i.e., one of the resized image blocks 324 in FIG. 3) of a single channel of size 24 × 24, rather than the RGB images of size 24 × 24 × 3 used in the second level CNN of the MTCNN. This is because the maximum input size supported by Hi3519 is 1280 pixels (24 × 24 × 3 ═ 1728). However, experimental results show that the performance of using a grayscale rather than a color image is not significantly affected. Thus, the second level CNN 500 can be efficiently implemented with small scale hardware CNN such as embedded CNN within Hi 3519.

In the embodiment shown, the CONV (1) layer includes 28 3 × 3 filters with step size 1. Thus, the output of the CONV (1) layer is 22 × 22 × 10 in size (based on the input size of 24 × 24). The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 11 × 11 × 28. The CONV (2) layer includes 32 3 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 9 × 9 × 32. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 5 × 5 × 32. The CONV (3) layer comprises 48 3 x 3 filters with step size 1. Therefore, the output of the CONV (3) layer has a size of 3 × 3 × 48. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (3) layer is 2 × 2 × 48. The outputs of the first and last FC layers are 128 × 1 and 16 × 1 vectors, respectively. It should be noted that although each CONV layer uses more filters than the first-stage CNN 400 and the FC layer is also larger than the FC layer used by the first-stage CNN 400, the design of the second-stage CNN 500 still satisfies the constraint of the embedded CNN module of Hi 3519.

As can be seen, the output of the last FC layer of second stage CNN 500 is still a 16 × 1 output vector. Wherein, the first two outputs are used for generating a face detection confidence index or a face classifier; the next 4 outputs are the bounding box coordinates or bounding box regression operators for the face in the image block 502 (if a face is detected in the image block 402); the last 10 outputs represent the positions of 5 face keypoints of the detected face, i.e., the left eye, the right eye, the nose, and two mouth corners, i.e., the keypoint location operators. However, since the input image resolution of the second-level CNN 500 is higher than that of the first-level CNN 400, and the CNN 500 has stronger processing power than the CNN 400, the face detection accuracy of the CNN 500 is also higher than that of the CNN 400. Thus, the second stage CNN 500 outputs a set of candidate face windows/bounding boxes (such as the candidate face window shown in fig. 3) corresponding to a subset of the input image block 502.

Similar to the first level CNN308, a confidence index threshold may be applied to the face detection confidence index at the detection output of the second level CNN310, leaving only input image blocks with face detection confidence indices greater than the threshold. In some embodiments, this filtering operation is implemented after the last FC layer in the second stage CNN 310. Similarly, after filtering the candidate bounding boxes, the highly overlapping candidate bounding boxes may be merged using the NMS techniques mentioned above. In some embodiments, this NMS operation may also be implemented in the second level CNN 310. Typically, the candidate face window only remains a small subset after filtering and NMS operation. After the NMS operation, the location of the remaining bounding box may be refined by a bounding box regression operator, which refinement may be implemented by the second level CNN 310.

It is noted that the combination of the number of layers and filters, the input image size, the filter and pooling window size, the FC layer output size, and other parameters shown in the second stage CNN 500 is only one exemplary configuration of the second stage CNN 308. Second level CNN310 may be constructed with other configurations having one or more parameter values different from those shown in fig. 5 without departing from the scope of the technology described herein. For example, the input size of the second level CNN310, i.e., 24 × 24, may not be sampled, but other similar sizes, e.g., 32 × 32, may also be used. In some embodiments, such as the exemplary second level CNN 500 shown in fig. 5, the second level CNN310 satisfies the constraints of a small-scale hardware CNN module, which may be, for example, an inline hardware CNN module within Hi3519, such that the inline hardware CNN module within Hi3519 may implement the second level CNN 310.

In some embodiments, for each candidate face window 326 output by the second level CNN310, there is a corresponding image block located in and truncated from the initial input video image 302, and the truncated image block is then resized to the user-specified input size of the third level CNN 312. Based on this coarse-to-fine approach, the user-specified input size of the third level CNN 312 should be larger than the input sizes of the first and

second level CNNs

308, 310. In some embodiments, the input size of the third level CNN 312 is 46 × 46. Therefore, the size of the resized image block is also 46 × 46. However, in other embodiments, input sizes similar to, but slightly different from, 46 x 46 may also be employed without departing from the scope of the described techniques. The process of generating resized image blocks from candidate bounding boxes may be implemented by hardware, software, or a combination of hardware and software. The corresponding processing module, which is not explicitly shown in the figure, may be located between the second-level CNN310 and the third-level CNN 312. Next, the third stage CNN 312 receives the resized image block for final refinement. Based on the hardware configuration, the third level CNN 312 may process the received image blocks sequentially, block-by-block 328; or, a plurality of image blocks are processed in parallel to accelerate the processing speed.

In principle, the third stage CNN 312 processes the input image block 328 in a manner similar to the first stage CNN308 and the second stage CNN 310. For example, fig. 6 illustrates a block diagram of an exemplary implementation process 600 of the third level CNN 312, according to some embodiments described herein.

As can be seen in fig. 6, third stage CNN 600 also includes three stages of CONV and MP layers (i.e., CONV (1)/MP (1), CONV (2)/MP (2), and CONV (3)/MP (3)), followed by two FC layers (i.e., FC (1) and FC (2)). In the embodiment shown, the CONV (1) layer includes 32 3 × 3 filters with step size 1. Thus, the output of the CONV (1) layer is 44 × 44 × 32 in size (based on an input size of 46 × 46). The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 22 × 22 × 32. The CONV (2) layer includes 503 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 20 × 20 × 50. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 10 × 10 × 50. The CONV (3) layer includes 503 × 3 filters with a step size of 1. Therefore, the output of the CONV (3) layer has a size of 8 × 8 × 50. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (3) layer is 4 × 4 × 50. The outputs of the first and last FC layers are 256 × 1 and 16 × 1 vectors, respectively.

It is noted that the size of the input image block 602 (i.e., one of the resized image blocks 328 in fig. 3) is 46 × 46 × 1 ═ 2116 (i.e., a grayscale map employing only a single channel), and in the third level CNN 600 discussed above, the maximum input size of the third level CNN 312 needs to be greater than 2116. However, if the maximum input size of the CNN module is less than 2116, the CNN module cannot be used to implement the third level CNN 600. Thus, in the embodiment shown in fig. 6, the embedded hardware CNN module of Hi3519, which only supports the maximum input size of 1280 pixels, cannot implement this third level CNN 600, although it is beneficial to optimize network parameters at the design stage.

In order to solve the above problems, the sub-picture based CNN system and technique described in the related patent application of the present application may be employed. More specifically, with the sub-image based CNN system and technique, the input image block 602 may be divided into a set of overlapping sub-images. For example, fig. 7 illustrates an exemplary input image partitioning scheme of 46 x 46 image blocks in some embodiments described herein. As can be seen from the left side of fig. 7, the input image block 602 may be divided into a set of 4 overlapping sub-images or blocks, each sub-image or block having a size of 30 x 30 and having an offset (or step size, or overlap) of 16 pixels between adjacent sub-images. It is also noted that in fig. 7, the overlapping configuration of the 4 sub-images is slightly adjusted with less manual compensation, so that the 4 sub-images can be better visualized and better distinguished. However, these artificial compensations are only intended to visualize these overlapping sub-images in the map; in practice, this may not be understood as an actual compensation between these sub-images. In practice, the row coordinates of the 4 sub-images start with 1 and 17, respectively, and the column coordinates of the 4 sub-images start with 1 and 17, respectively. The set of 4 overlapping images without artificial compensation is displayed as a smaller insert with artificial compensation inserted in the upper right corner of the main image.

It is noted that the specific values (i.e., 46 x 46 input image size, 30 x 30 sub-image size, and 16 x 16 step size) are based on theoretical design as described in related patent application 15/441,194, which is incorporated herein by reference. As described above and demonstrated by the related patent application, the use of these design data ensures that each output of the 4 sub-images is combined to be equivalent to the output of the third stage CNN 600, wherein the third stage CNN 600 processes the entire input image block without employing sub-image based CNN techniques.

Fig. 8 illustrates a block diagram of an exemplary implementation process 800 of a third level CNN 312 based on small-scale hardware CNN modules, according to some embodiments of the present application. As can be seen in fig. 8, third level CNN 800 also includes three levels of CONV and MP layers (i.e., CONV (1)/MP (1), and CONV (3)/MP (3)) having the same parameters as the corresponding CONV and MP layers of third level CNN 600. The third level CNN 800 further comprises an input module 802, the input module 802 receiving the 46 x 46 input image blocks 602. The input module 802 is used to divide the image block 602 into 4 sub-images with the size of 30 × 30; this sub-picture is smaller than the maximum input picture size of the embedded hardware CNN within Hi 3519. More detailed operation of the input module 802 can be found in related patent application 15/441,194 (e.g., input module 212 shown in fig. 2B), the contents of which are incorporated herein by reference.

In some embodiments, the three-level CONV and MP layers of the third level CNN 800 are used to sequentially process the 4 sub-images 804. As can be seen in fig. 8, for a given 30 x 30 sub-picture 804 (which sub-picture 804 is obviously a part/sub-picture of the picture block 602), the CONV (1) layer comprises 32 3 x 3 filters with a step size of 1. Therefore, the output of the CONV (1) layer has a size of 28 × 28 × 32. The MP (1) layer uses a pooling window of 2 × 2 with a step size of 2. Therefore, the output size of the MP (1) layer is 14 × 14 × 32. The CONV (2) layer includes 503 × 3 filters with step size 1. Therefore, the output of the CONV (2) layer has a size of 12 × 12 × 50. The MP (2) layer uses a pooling window of 2x 2 with a step size of 2. Therefore, the output size of the MP (2) layer is 6 × 6 × 50. The CONV (3) layer includes 503 × 3 filters with a step size of 1. Therefore, the output of the CONV (3) layer has a size of 4 × 4 × 50. The MP (3) layer uses a pooling window of 2x 2 with a step size of 2. Thus, the output size of the MP (3) layer is 2 × 2 × 50, i.e., 50 × 2 × 2 feature map 806. For the set of 4 sub-images 804, the MP (3) layer generates the output of 4 sets of 2 × 2 × 50 feature maps 806.

As shown in fig. 8, the third-level CNN 800 further comprises a merging module 808, wherein the merging module 808 is configured to receive and merge the 4 sets of 2 × 2 × 50 feature maps 806 to form a complete feature map of the complete input image block 602, wherein the input image block 602 is the input of the third-level CNN 800. More detailed operation of the merge module 808 can be found in related patent application 15/441,194 (e.g., merge module 222 shown in fig. 2B), the contents of which are incorporated herein by reference. As described in the related patent application, the output signatures associated with the set of 4 sub-images 804 do not have overlapping portions and can be combined directly before the first FC layer to generate the same output as the third level CNN 600 in fig. 6. The combined result, i.e., the output of the third level CNN 800, is 50 sets of 4 × 4 feature maps 810, one of which is shown on the right side of fig. 7.

In some embodiments, the embedded hardware CNN of Hi3519 is used to implement the three levels CONV and MP shown in the third level CNN 800. However, the embedded hardware CNN of Hi3519 also includes at least three FC layers. In one embodiment, to accommodate these FC layers required by Hi3519, third level CNN 800 further includes two virtual FC layers (not explicitly shown in the figure) having the same matrix parameters. Furthermore, in Hi3519 there is one ReLU layer after each FC layer. However, as disclosed in the related patent application, the ReLU layer does not affect the output of the virtual FC layer because a plurality of serially connected ReLU layers is equivalent to one ReLU layer.

It should be noted that the input image size of the third-level CNN 800 is not necessarily 46 × 46. May be another size smaller than the maximum input size of the embedded hardware CNN of Hi3519 (in which case the input need not be divided into sub-images). Other larger feasible sizes may also be employed as the input image size for this third level CNN 800, and the requirements for such feasible sizes may be found in related patent application 15/441,194, which is incorporated herein by reference. For example, other possible input image sizes for this third level CNN 800 may be 62 × 62. With this image size, the input image block 802 may be divided into 9 overlapping sub-images, each 30 x 30 in size, with the adjacent sub-images having a step size of 16 in both the horizontal and vertical directions.

Returning to fig. 3, if the third-level CNN 312 of the face detection system 300 is implemented by using the third-level CNN 800, the third-level CNN 312 outputs 50 groups of 4 × 4 feature maps 810, and the feature maps 810 are input to the final decision module 314. In some embodiments, the final decision module 314 includes a plurality of FC layers that operate on the received feature map and generate a final decision for the input video image 302, such as the face detection decision 316 shown in fig. 3.

FIG. 9 illustrates a block diagram of an exemplary implementation process 900 of the final decision module 314 according to some embodiments of the present application. As can be seen in fig. 9, a set of 50 sets of 4 × 4 feature maps 810 are received and processed by a reorganization module that is configured to combine and reorganize the set of two-dimensional feature maps into a one-dimensional vector of size 800 × 1. The one-dimensional vector is further processed by two levels of FC layers, FC (1) and FC (1), which then outputs a face detection decision 316 for a given detected movement region 318. In the illustrated embodiment, the face detection decision 316 may include a face classifier 904, a bounding box regression operator 906, and a face keypoint location operator 908. As described above, the marker position operators 908 within the face detection decision 316 may include the 5 face key points of the detected face, i.e., the left eye, the right eye, the nose, and the two corners of the mouth. Although the two FC layers within the final decision module 900 are 256 and 16, respectively, in other embodiments the final decision module 314 may have a different FC layer size than the final decision module 900. It should be noted that the final decision module 900 can be implemented in software or processed on a CPU on Hi3519, since the final decision computation complexity is much lower than any of the three levels CNN308, 310 and 312.

Fig. 10 presents a flow chart depicting an exemplary face detection process 1000 utilizing the face detection system 300 disclosed herein as being executed on a CNN enabled embedded system in accordance with some embodiments of the present invention. In some embodiments, the CNN enabled embedded system comprises a small scale low cost system on chip, such as a Hi3519 system on chip. The start of the face detection process 1000 is indicated when a video image/frame is received at the input of the face detection system disclosed herein (step 1002). In some embodiments, the video image is acquired by a high resolution camera, such as a surveillance camera, a machine vision camera, a camera on an autonomous vehicle, or a mobile phone camera.

Next, in the face detection system 1000, a motion detection operation may be performed on the input video image/frame to locate and identify a set of moving areas within the video frame (i.e., image blocks within the video frame that are associated with motion), step 1004. In some embodiments, the motion detection operation may be implemented using an embedded background elimination module within the CNN enabled embedded system to detect moving areas within the video image/frame. The output of the motion detection operation includes a set of identified motion regions within the video frame. In some embodiments, the motion detection operation may be replaced by or combined with a face tracking operation. It should be noted that by combining motion detection and face tracking in the face detection process 1000, the face detection speed can be significantly increased. In some embodiments, the face detection process 1000 omits this motion detection operation.

Next, in the face detection system 1000, for each detected movement region, a pyramid generation operation may be performed on the detected movement region to generate a multi-resolution representation of the detected movement region (step 1006). More specifically, a higher resolution representation of the detected moving regions may be used to detect smaller faces in the initial input video image; while a lower resolution representation of the detected movement region may be used to detect a larger face in the initial input video image.

Next, in the face detection system 1000, a sliding window operation is performed on each image represented in multi-resolution, generating a set of image blocks for the image (step 1008). In some embodiments, the size of the sliding window is determined by a first input size of the first CNN processing stage configured with a first complexity.

Next, in the face detection system 1000, a first CNN processing stage is used to process all image blocks corresponding to each sliding window position of each multi-resolution representation of the detected moving area to generate a first set of candidate face windows (step 1010). In some embodiments, each window in the first set of candidate face windows is associated with a confidence index and a set of bounding box coordinates. In some embodiments, each candidate face window is also associated with 5 face key points, namely the left eye, the right eye, the nose and two corners of the mouth. In some embodiments, the first CNN processing stage satisfies the limitations of a small-scale hardware CNN module, such as an inline hardware CNN module within Hi3519, such that the CNN processing stage may be implemented by the inline hardware CNN module within Hi 3519.

Next, in the face detection system 1000, a second CNN processing stage is used to process the first set of resized image blocks corresponding to the first set of candidate face windows to generate a second set of candidate face windows (step 1012). In some embodiments, the second CNN processing stage has a second complexity that is higher than the first complexity. In some embodiments, the first set of resized image blocks has a size equal to a second input size of the second CNN processing stage, wherein the second input size is larger than the first input size of the first CNN processing stage. Thus, the second CNN processing stage processes higher resolution input image blocks with higher face detection accuracy than the first CNN processing stage. In some embodiments, each window in the second set of candidate face windows is associated with a confidence index and a set of bounding box coordinates. In some embodiments, each candidate face window is also associated with 5 face key points, namely the left eye, the right eye, the nose and two corners of the mouth. In some embodiments, the second CNN processing stage meets the limitations of a small-scale hardware CNN module, such as an inline hardware CNN module within Hi3519, such that the CNN processing stage may be implemented by the inline hardware CNN module within Hi 3519.

Next, in the face detection system 1000, the third CNN processing stage is used to process a second set of resized image blocks corresponding to the second set of candidate face windows to generate a third set of candidate face windows (step 1014). In some embodiments, the third CNN processing stage has a third complexity that is higher than the first and second complexities. In some embodiments, the second set of resized image blocks has a size equal to a third input size of the third CNN processing stage, wherein the third input size is larger than the first and second input sizes of the first and second CNN processing stages. Thus, the third CNN processing stage processes higher resolution input image blocks with higher face detection accuracy than the first and second CNN processing stages. In some embodiments, each window in the third set of candidate face windows is associated with a confidence index and a set of bounding box coordinates. In some embodiments, each window in the third set of candidate face windows is also associated with 5 face keypoints, namely the left eye, the right eye, the nose, and two mouth corners. It should be noted that the

steps

1006 and 1014 are repeated for each detected motion region within the initial input video frame.

In some embodiments, this third CNN processing stage is also ideally implemented using small-scale hardware CNN modules, such as the in-line hardware CNN module within Hi 3519. However, since the input size of the third CNN processing stage may be larger than the maximum input size of the small-scale hardware CNN module, the sub-image-based CNN method needs to be adopted.

Fig. 11 presents a flowchart describing an exemplary process 1100 for processing a second set of resized image blocks (i.e., step 1014 of process 1000) using a sub-image based CNN system, in accordance with some embodiments described herein.

Initially, a given resized image block is divided into a set of sub-images having a smaller image size (step 1102). In some embodiments, the set of sub-images comprises a two-dimensional array of overlapping sub-images. For example, a 46 × 46 image block may be divided into a set of 4 overlapping sub-images, where each sub-image is 30 × 30 in size and there is a 16-pixel offset between adjacent sub-images. Further, the size of the sub-picture is smaller than the maximum input size of a small-scale hardware CNN module of the embedded hardware CNN module such as Hi 3519.

Next, the small scale hardware CNN module sequentially processes the set of sub-images to generate a feature map for the array (step 1104). In some embodiments, the step of processing each sub-image with the small-scale hardware CNN module comprises applying multiple levels of CONV and MP layers on the sub-image.

Next, the feature maps of the array output by the small scale hardware CNN module are merged into a set of merged feature maps (step 1106). More specifically, the combined feature map is equivalent to the complete feature map of the entire high resolution resized image block generated by the large scale CNN that processes the entire high resolution resized image block directly without partitioning. Next, the second CNN module processes the combined feature map to predict whether the resized image block is a face (step 1108). In some embodiments, the step of processing the set of merged profiles comprises applying a multi-level FC layer on the set of merged profiles.

It should be noted that although the embodiments of the face detection system disclosed above apply the subimage-based CNN technique to the last CNN stage of a cascaded CNN system, in other embodiments, the face detection system may apply the subimage-based CNN technique to more than one stage of the cascaded CNN system, for example, to the last two stages of the cascaded system.

FIG. 12 illustrates an exemplary embedded system within which the disclosed sub-image based face detection system functions according to some embodiments described herein. The embedded system 1200 may be integrated or implemented as a surveillance camera, a machine vision camera, an unmanned aerial vehicle, a robot, or an autonomous vehicle. As can be seen in fig. 12, the embedded system 1200 may include a bus 1202, a processor 1204, a memory 1206, a storage 1208, a camera 1210, a CNN subsystem 1212, an output device interface 1214, and a network interface 1216.

Bus 1202 collectively represents all system, peripheral, and chipset buses to which the various components of embedded system 1200 may be connected. For example, the bus 1202 may communicatively connect the processor 1204 with memory 1206, storage 1208, camera 1210, CNN subsystem 1212, output device interface 1214, and network interface 1216.

The processor 1204 retrieves instructions from the memory 1206 for execution and retrieves data for processing to control various components of the embedded system 1200. Processor 1204 may include any type of processor, including but not limited to microprocessors, large scale computers, digital signal processors, electronic organizers, device controllers, and computing engines within an appliance, as well as any other processor now known or later developed. Further, processor 1204 may include one or more cores. The processor 1204 itself may include a cache memory for storing code and data for execution by the processor 1204.

Memory 1206 may include any type of memory that can store code and data for execution by processor 1204. This includes, but is not limited to, Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), flash memory, Read Only Memory (ROM), and any other type of memory now known or later developed.

The storage 1208 may include any type of non-volatile memory device that may be integrated with the embedded system 1200. This includes, but is not limited to, magnetic, optical, and magneto-optical storage devices, as well as storage devices based on flash memory and/or storage with battery backup power.

The bus 1202 is also connected to a camera 1210. The camera 1210 is configured to capture still images and/or video images at a predetermined resolution and transfer the captured images or video data over the bus 1202 to various components within the embedded system 1200, such as to the memory 1206 for buffering and to the CNN subsystem 1212 for DL face detection. The camera 1210 may be a digital camera. In some embodiments, camera 1210 is a digital camera equipped with a wide-angle lens. The images acquired by the cameras 1210 may have different resolutions, including high resolutions, such as 1280 × 720p, 1920 × 1080p, or other high resolutions.

The CNN subsystem 1212 is configured to receive an acquired video image, such as a high-resolution video image acquired via the bus 1202, perform the aforementioned face detection operation on the received video image, and then generate a face detection result of the acquired video image. In particular, CNN subsystem 1212 may include one or more small-scale hardware CNN modules. For example, CNN subsystem 1212 may include one or more Hi3519 systems on chip, where each Hi3519 system on chip includes a CPU with embedded hardware CNN and executable software CNN functionality. In some embodiments, the CNN subsystem 1212 performs functions according to one of the embodiments of the face detection system 300 disclosed herein.

Also connected to bus 1202 is an output device interface 1214, which output device interface 1214 may display results generated by CNN subsystem 1212, for example. Output devices used with output device interface 1214 include, for example, printers and display devices, such as cathode ray tube displays (CRTs), light emitting diode displays (LEDs), Liquid Crystal Displays (LCDs), organic light emitting diode displays (OLEDs), plasma displays, or electronic paper.

Finally, as shown in FIG. 12, the bus 1202 also connects the embedded system 1200 to a network (not shown) through a network interface 1216. As such, the embedded system 1200 may be part of a network, such as a local area network ("LAN"), a wide area network ("WAN"), or an intranet, or a network of networks, such as the internet. Any or all of the components of embedded system 1200 may be used with the disclosed subject matter.

The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, units, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The hardware used to implement the various illustrative logics, logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, separate gate or transistor logic, separate hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of receiver devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Alternatively, some steps or methods may be performed by circuitry that is specific to a given function.

In one or more exemplary aspects, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable storage medium or a non-transitory processor-readable storage medium. The steps of a method or algorithm disclosed herein may be embodied in processor-executable instructions that can reside on a non-transitory computer-readable or processor-readable storage medium. A non-transitory computer-readable or processor-readable storage medium may be any storage medium that can be accessed by a computer or a processor. By way of example, and not limitation, such non-transitory computer-readable or processor-readable storage media can comprise RAM, ROM, EEPROM, flash memory, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of non-transitory computer-readable and processor-readable media. Additionally, the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable storage medium and/or computer-readable storage medium, which may be incorporated into a computer program product.

Although this patent document contains many specifics, these should not be construed as limitations on the scope of any disclosed or claimed technology, but rather as descriptions of features specific to particular embodiments of particular technologies. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

This patent document describes only a few implementations and examples, and other implementations, enhancements, and variations can be made based on what is described and illustrated in this patent document.

Claims

1. A method for performing face detection on a video image using at least one small-scale hardware convolutional neural network module, the method comprising:

receiving a video sequence acquired by a camera;

for each video frame in the video sequence, detecting a candidate image area corresponding to a moving object in the video frame;

processing, by a hardware CNN module, a detected candidate image region using a sliding window method having a first image size to generate a first set of candidate face windows within the detected candidate image region, wherein the first image size is between a minimum input size and a maximum input size of the hardware CNN module;

processing, with the hardware CNN module, a first set of sub-regions within the candidate image region based on a second image size, wherein the first set of sub-regions corresponds to the first set of candidate face windows, to generate a second set of candidate face windows within the detected candidate image region, wherein the second image size is between a minimum and a maximum input size of the hardware CNN module;

processing, with the hardware CNN module, a second set of sub-regions within the candidate image region based on a third image size, wherein the second set of sub-regions corresponds to the second set of candidate face windows, to generate a third set of candidate face windows within the detected candidate image region, wherein the third image size is larger than a maximum input size of the hardware CNN module;

and analyzing each third group of candidate face windows and judging whether each third group of candidate face windows contains a face.

2. The method of claim 1, wherein the first image size is determined based on a minimum input size of the hardware CNN module.

3. The method of claim 1, wherein the first image size is a minimum input size of the hardware CNN module.

4. The method of claim 1, wherein detecting the candidate image region corresponding to the moving object within the video frame comprises performing a background removal operation.

5. The method of claim 1, wherein prior to said processing the detected candidate image regions using a sliding window method having a first image size, the method further comprises generating a pyramidal multi-resolution representation of the detected candidate image regions.

6. The method of claim 5, wherein processing the detected candidate image regions based on the first image size comprises:

for each image in the multi-resolution representation of the detected candidate image region, generating a set of image blocks for the image using a sliding window having a first image size;

a first set of candidate face windows is generated from each set of image blocks corresponding to the position of the sliding window for each multi-resolution representation of the detected candidate image region.

7. The method of claim 1, wherein each window in the first set of candidate face windows is associated with a confidence index and a set of bounding box coordinates defining a location of the candidate face window within the detected candidate image region.

8. The method of claim 1, wherein the second image size is larger than the first image size.

9. The method of claim 1, wherein processing a first set of sub-regions within the candidate image region based on a second image size, wherein the first set of sub-regions correspond to the first set of candidate face windows by:

resizing each of a first set of sub-regions within the candidate image region corresponding to the first set of candidate face windows to a second image block having the second image size;

a second set of candidate face windows is generated from a second set of image blocks having a second image size.

10. The method of claim 1, wherein processing a second set of sub-regions within the candidate image region based on a third image size, wherein the second set of sub-regions corresponds to the second set of candidate face windows comprises resizing each sub-region in the second set of sub-regions to a third image block having a third image size.

11. The method of claim 10, further comprising processing the second set of candidate face windows by processing a third set of resized image blocks having a third image size.

12. The method of claim 11, wherein each resized third group of image blocks having a third image size is processed by:

dividing the resized image block having a third image size into a set of sub-images having a fourth image size that is smaller than the maximum input size of the hardware CNN module;

sequentially processing the set of sub-images with the hardware CNN module to generate an array feature map;

merging the array feature maps into a set of merged feature maps corresponding to the resized image block having the third size; and is

The set of merged feature maps is processed to generate a third set of candidate face windows.

13. The method of claim 1, wherein the hardware CNN module is a small-scale hardware CNN module.

14. The method of claim 1, wherein the hardware CNN module is embedded within a chipset or a system-on-a-chip.

15. The method of claim 14, wherein the system-on-chip comprises a Haesi Hi3519 system-on-chip.

16. An embedded system for performing face detection in an acquired video image, the system comprising:

a processor;

a memory connected to the processor;

image capture means coupled to said processor and said memory for capturing video images;

a hardware CNN module connected to the processor and the memory;

the processor is configured to receive an acquired video image and detect a candidate image region corresponding to a moving object within the video image;

wherein the hardware CNN module is configured to:

processing a detected candidate image region with a first image size to generate a first set of candidate face windows within the detected candidate image region, wherein the first image size is between a minimum input size and a maximum input size of the hardware CNN module;

processing a first set of sub-regions within the candidate image region with a second image size, wherein the first set of sub-regions corresponds to the first set of candidate face windows, and generating a second set of candidate face windows within the detected candidate image region, wherein the second image size is between the minimum and maximum input sizes of the hardware CNN module;

processing a second set of sub-regions within the candidate image region with a third image size, wherein the second set of sub-regions corresponds to the second set of candidate face windows, and generating a third set of candidate face windows within the detected candidate image region, wherein the third image size is larger than a maximum input size of the hardware CNN module; the processor is further configured to analyze each third set of candidate face windows to determine whether each third candidate face window contains a face.

17. The system of claim 16, the hardware CNN module generating a third set of candidate face windows by:

resizing each sub-region within a second set of sub-regions within the candidate face window to a third set of image blocks having a third image size, wherein the second set of sub-regions correspond to the second set of candidate face windows;

dividing a third group of image blocks having a third image size into a group of sub-images having a fourth image size, the fourth image size being smaller than the maximum input size of the hardware CNN module;