CN113034457B

CN113034457B - Face detection device based on FPGA

Info

Publication number: CN113034457B
Application number: CN202110292634.9A
Authority: CN
Inventors: 刘亚军; 王志雄; 梁永宁
Original assignee: Guangzhou Suotu Intelligent Electronics Co ltd
Current assignee: Guangzhou Suotu Intelligent Electronics Co ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2023-04-07
Anticipated expiration: 2041-03-18
Also published as: CN113034457A

Abstract

The invention discloses a face detection device based on FPGA, comprising: the device comprises a main control instruction analysis module, a neural network calculation module, an image zooming module, a target frame processing module and an output module; the main control instruction analysis module, the neural network calculation module, the image scaling module, the target frame processing module and the output module are all designed on a hardware FPGA based on a hardware design language. By adopting the embodiment of the invention, the face is correctly detected from the continuous image frames acquired by the FPGA serving as a hardware implementation platform, so that lower power consumption and higher circuit design flexibility are achieved.

Description

Face detection device based on FPGA

Technical Field

The invention relates to the technical field of face detection, in particular to a face detection device based on an FPGA (field programmable gate array).

Background

The current face detection technology is mainly realized on a cloud platform of a computer and the like in a software mode, and the algorithm deployment and operation cost is high; the current face detection and recognition algorithm based on the convolutional neural network is mainly realized by taking frames such as Pythrch, tensorflow, caffe and the like as a building platform and taking a CPU (Central processing Unit) or GPU (graphics processing Unit) as a hardware basis. The CPU with the X86 structure is adopted as a hardware basis, the characteristic of parallelism in the convolutional neural network cannot be fully realized, and the processing speed of the network is influenced. If the GPU is used as a hardware basis, although the characteristic of parallel processing of a convolutional neural network can be fully developed, the expensive cost and the power consumption of the GPU make portable equipment such as the Internet of things which is rapidly developed at present prohibitive.

Disclosure of Invention

The embodiment of the invention provides a face detection device based on an FPGA (field programmable gate array), which selects the FPGA as an implementation platform of a related algorithm based on an MTCNN (multiple-transistor convolutional neural network), and fully realizes the parallelism of the convolutional neural network with lower power consumption and higher circuit design flexibility.

A first aspect of an embodiment of the present application provides a face detection device based on an FPGA, where the face detection device includes: the device comprises a main control instruction analysis module, a neural network calculation module, an image zooming module, a target frame processing module and an output module; the main control instruction analysis module, the neural network calculation module, the image scaling module, the target frame processing module and the output module are all designed on a hardware FPGA based on a hardware design language;

the main control instruction analysis module is used for analyzing configuration parameters of a current layer of a neural network, configuring a convolutional neural network required by the neural network calculation module, and performing coordination control on the target frame processing module and the image scaling module;

the neural network calculation module comprises three convolution neural networks of P-Net, O-Net and R-Net; the neural network computing module is used for screening a face target frame and computing bias data of the face target frame; the face target frame is obtained by mapping the coordinates which are greater than the threshold value in the fractional matrix output by the convolutional neural network to an original image;

the image scaling module is used for constructing an image pyramid of an input image and scaling the size of the image input between the P-Net and the R-Net and between the R-Net and the O-Net;

the target frame processing module is used for obtaining a new human face target frame according to the screened human face target frame and the bias data;

the output module is used for generating a normalized face detection result according to the output result of the neural network calculation module.

In a possible implementation manner of the first aspect, the face detection apparatus further includes an image enhancement module, where the image enhancement module is configured to enhance the input image.

In a possible implementation manner of the first aspect, the image enhancement module specifically includes a convolution calculation sub-module, a convolution kernel selection sub-module, and a convolution calculation sub-module;

the convolution calculation submodule comprises a first convolution submodule and a second convolution submodule; the first convolution submodule comprises a horizontal cell processing algorithm and a corresponding convolution kernel; the second convolution submodule comprises a bipolar cell processing algorithm and a corresponding convolution kernel;

and the convolution kernel selection submodule is used for selecting a corresponding convolution kernel according to the value range of the central pixel point of the convolution window.

In a possible implementation manner of the first aspect, the image scaling module specifically includes a main control sub-module, an output coordinate parsing sub-module, a bilinear difference algorithm data processing sub-module, and an odd-even BRAM group address data processing sub-module;

the bilinear difference algorithm data processing submodule comprises a bilinear difference algorithm integer and decimal extraction submodule and a bilinear difference algorithm data calculation submodule;

the parity BRAM group address data processing sub-module comprises a parity BRAM group address resolution sub-module and a parity BRAM group data output sub-module.

In a possible implementation manner of the first aspect, the output module includes a Softmax calculation sub-module; the Softmax calculation submodule is a classification module and is used for normalizing input data into real numbers between 0 and 1.

In a possible implementation manner of the first aspect, the neural network computation module specifically includes a convolution data computation sub-module, a convolution feature map storage sub-module, and a convolution data accumulation and cache sub-module;

the convolution data calculation submodule adopts a calculation mode of multiplying and summing a single convolution kernel parameter and multiple rows of input characteristic diagrams;

the convolution characteristic diagram storage sub-module adopts a double BRAM group storage characteristic diagram form;

and the convolution data accumulation and cache submodule is used for accumulating and summing the data of the convolution characteristic diagram storage submodule and outputting the data to obtain new multi-channel characteristic diagram data.

In a possible implementation manner of the first aspect, the neural network computation module further includes a pooling computation submodule and a full-connection computation submodule;

the pooling calculation submodule is used for compressing the size of the feature map;

the full-connection calculation submodule is used for recombining the local features obtained by the convolution data calculation submodule into a complete image through matrix calculation.

In a possible implementation manner of the first aspect, the target box processing module includes an NMS calculation sub-module, and the NMS calculation sub-module is configured to calculate a Score maximum value of the face target box.

Compared with the prior art, the embodiment of the invention provides a face detection device based on an FPGA, which realizes the face detection based on a neural network on an FPGA hardware platform: constructing an image pyramid in an image scaling module according to an input image, screening a face target frame and carrying out NMS (network management system) processing after the face target frame is calculated by a P-Net layer in a neural network calculation module; and performing further face target frame screening by the R-Net after NMS processing, and generating a final face target frame by the O-Net, wherein the face target frame screening process needs the participation of a target frame processing module.

In addition, an image enhancement module is introduced, so that the face detection network can process face images under dim light conditions such as dark and the like on the premise of not additionally consuming resources to train a neural network, and the detection rate of the face detection network is improved.

Drawings

Fig. 1 is a schematic structural diagram of a face detection apparatus based on an FPGA according to an embodiment of the present invention;

fig. 2 is a schematic view of a working flow of a face detection device based on an FPGA according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a neural network computing module according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an image scaling module according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a target frame processing module according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a face detection device based on an FPGA, where the face detection device includes: the system comprises a main control instruction analysis module 10, a neural network calculation module 20, an image scaling module 30, a target frame processing module 40 and an output module 50; the main control instruction analysis module 10, the neural network computing module 20, the image scaling module 30, the target frame processing module 40 and the output module 50 are all designed on the basis of a hardware design language on a hardware FPGA;

the main control instruction analyzing module 10 is configured to analyze configuration parameters of a current layer of a neural network, configure a convolutional neural network required in the neural network computing module 20, and perform coordination control on the target frame processing module 40 and the image scaling module 30;

the image scaling module 30 is used for constructing an image pyramid of an input image and scaling the size of the image input between P-Net and R-Net and between R-Net and O-Net;

the neural network calculation module 20 comprises three convolutional neural networks of the P-Net, the O-Net and the R-Net; the neural network computing module 20 is configured to filter a face target frame and compute bias data of the face target frame; the face target frame is obtained by mapping the coordinates which are greater than the threshold value in the fractional matrix output by the convolutional neural network to an original image;

the target frame processing module 40 is configured to obtain a new face target frame according to the filtered face target frame and the offset data;

the output module is configured to generate a normalized face detection result according to the output result of the neural network computing module 20.

The MTCNN main network architecture is formed by three convolutional neural network cascades, which are respectively as follows: P-Net, R-Net and O-Net, and a bounding box target frame processing algorithm is correspondingly arranged behind each neural network,

the MTCNN convolutional neural network part is designed by taking the principles of high resource utilization rate, reconfigurability, instruction configuration and the like as guidance, aims at low cost and high-speed processing, and can be better applied to miniaturized, low cost and real-time processing intelligent equipment such as the future 5G Internet of things. And aiming at three different networks (P-Net, O-Net and R-Net) of MTCNN, a configurable instruction mode is adopted, so that different neural networks can be configured according to different current requirements. The processing architecture of the hardware part of the whole MTCNN divides the whole processing architecture into two major computational processing parts (the neural network computing module 20 and the target frame processing module 40) and one image scaling part (the image scaling module 30) according to actual design requirements.

In the embodiment of the invention, BRAM (Block RAM) is adopted as a storage resource on an FPGA (field programmable gate array) chip. And for the whole hardware architecture design, innovative designs such as image line block storage, a convolution mode capable of configuring the size of a convolution kernel at will, a storage mode of an odd-even RAM of image Resize and the like are also adopted, so that the whole MTCNN-based face detection can dynamically and flexibly configure an internal architecture to realize the face detection function.

The first part is a processing architecture of the neural network, and takes convolution operation and pooling operation as main bodies, and accelerates the whole neural network processing process in a mode of parallel input channels. Meanwhile, for the storage of the weight, a channel parallel storage mode is adopted, and the weight data of the same channel is stored in the same BRAM, so that the extraction process of the weight data is accelerated. Meanwhile, a 49bits instruction is arranged in the top layer instruction module.

Network configuration parameters of each layer of three neural networks of P-Net, O-Net and R-Net in the MTCNN are stored in BRAM, so that the whole hardware architecture can generate required layers according to instruction configuration.

The second part is a bounding box border data processing and modification part, which exists behind each layer of neural network. In this part, since there are many bounding boxes to be processed, the access method of the block BRAM is also adopted to accelerate the processing. Illustratively, image scaling module 30 is used to construct an image pyramid of the input image and scale the input images of R-Net and O-Net to 24 × 24 and 48 × 48 sizes, respectively. The FPGA hardware architecture of the whole MTCNN face detection part comprises four parts, namely a neural network computing module 20 (mainly based on convolution and comprising pooling, full connection and the like), a target frame processing module 40, an image scaling module 30 and a main control instruction analysis module 10.

In the hardware design of the face detection apparatus of this embodiment, the hardware processing of the face detection apparatus is divided into the following main operation states according to the modules of the face detection apparatus, as shown in fig. 3.

S _0: and in the waiting state, when the enable signal is pulled high, the state starts to enter the working state, and the state is transferred to S _1.

S _1: the pyramid builds the state. The state is zoomed into different scales according to the size of an input image, so that an MTCNN input image pyramid with different scales is constructed, and the construction is completely transferred to a state S _2.

S _2: working state of a neural network P-Net in the MTCNN, if the calculation of the neural network of the current P-Net is completed and all pyramid images are not completely calculated, transferring to a state S _2, otherwise, transferring to a state S _3

S _3: this state is the bounding box target box processing state of P-Net. And if the calculation is completed, the state is transferred to the state S _4, otherwise, the state waits.

S _4: this state is the operating state of R-Net and O-Net, including its neural network and corresponding bounding box portion, as well as the scaling of the image input. And if the calculation is completed, transferring to the state S _5, otherwise, waiting.

S _5: and (5) calculating and outputting.

Illustratively, the face detection device further comprises an image enhancement module, and the image enhancement module is used for enhancing the input image.

The leading module image enhancement section is a hardware implementation that mainly functions to enhance the dark face image before input to the MTCNN. The whole is written and designed on the FPGA by adopting Verilog HDL. The system comprises an integral hardware architecture, an image preprocessing module, a convolution calculation module and a feedback regulation module. The BRAM adopted in the design is a storage resource on the FPGA chip. The image input of the part is R, G, B, and the parallel computation of the three channels is finished identically.

The hardware design of the image enhancement part follows the design of software algorithm, and the points which can be designed in parallel in the algorithm are analyzed. By utilizing the advantages of the FPGA, the design ideas of parallel and pipeline architectures are fully adopted, so that the whole image enhancement module can have higher processing speed and lower power consumption.

Referring to fig. 3, the image enhancement module specifically includes a convolution calculation sub-module, a convolution kernel selection sub-module, and a convolution calculation sub-module;

The core structure of the image enhancement module is a first convolution submodule of 15 × 15 and a second convolution submodule of 7*7, which respectively correspond to a horizontal cell processing part and a bipolar cell processing part in a software algorithm.

The BRAM storage part of the convolution module of each layer carries out block processing according to the size of the convolution kernel, so that the processing speed of the convolution part is improved. Except for two main convolution modules, a feedback adjusting part in a corresponding software algorithm is embodied as a group of PE computing arrays in hardware design, so that the part can be designed into a flow architecture, and the processing speed of the whole hardware architecture is increased. For the 15 × 15 convolution part, the partial convolution is different from the general convolution, the convolution kernel of the part is 4 convolution kernels, and the corresponding convolution kernel is determined to be selected according to the value range of the central pixel point of the current convolution, so that the partial convolution kernel data needs to be pre-cached on a chip. Meanwhile, the value range is determined according to the mean value and the variance of all pixel values of the input image, so that the mean value and the variance need to be calculated before the input image of the external image acquisition equipment is stored in the BRAM group 1, and then the input data is respectively stored in the corresponding BRAMs of the BRAM group 1.

It should be noted that the convolution computation submodule is the most important part of the module in image enhancement. Unlike the convolution kernel of the neural network computing module 20, the size of the two convolutions used in the image enhancement algorithm based on the retinal mechanism, i.e., 15 × 15 and 7*7, are both larger than the size of the convolution kernel commonly used in the neural network. Moreover, the input to each calculation is a convolution kernel corresponding to a single channel, with only 15 × 1 (7 × 1). It is possible to simplify the design and design a structure different from the neural network computing module 20 in the MTCNN.

And while updating the data in the sliding window, the convolution kernel selection submodule outputs the selected corresponding convolution kernel to the data input port of the 15 × 15 convolution calculation array, so that the convolution calculation is completed.

Referring to fig. 4, exemplarily, the image scaling module 40 specifically includes a main control sub-module, an output coordinate parsing sub-module, a bilinear difference algorithm data processing sub-module, and an odd-even BRAM group address data processing sub-module;

the odd-even BRAM group address data processing sub-module comprises an odd-even BRAM group address resolution sub-module and an odd-even BRAM group data output sub-module.

The two major portions of the image are scaled, as in the previous embodiment, the first portion is to scale the input image artwork to construct the image pyramid, and the second portion is to scale the filtered bounding box of P-Net and O-Net to the size of 24 × 24 and 48 × 48. Therefore, the image scaling module 40 has two design input modes, which are an original image scaling mode and a bounding box scaling mode. The specific input mode is provided by the module at the upper layer.

The image scaling module 40 is mainly divided into three major parts, which are: the system comprises a coordinate analysis submodule, a bilinear difference algorithm data processing submodule and an odd-even BRAM group address data processing submodule.

MTCNN differs from some other object detection algorithms in the construction of the input image pyramid. And constructing an image pyramid of the input image with different scales, so that the human faces with different scales can be detected more accurately on different scales. And the key to constructing face images of different scales is image Resize scaling. The main functions of image Resize are two, namely, scaling the input image of P-Net to construct an image pyramid, and scaling the sizes of the bounding boxes of R-Net and O-Net to the sizes of 24 × 24 and 48 × 48 in the bounding box data processing part of each layer of network. The current interpolation algorithm comprises a nearest interpolation algorithm, a bilinear interpolation algorithm and the like, and the algorithm principle adopted by the module is the bilinear interpolation algorithm. The effect of the bilinear interpolation is better than that of other methods, and the calculation complexity is lower, so the scaling module adopts a bilinear interpolation algorithm. The basic principle is that the extension in two directions is carried out on the basis of the single linear interpolation, and the linear interpolation is carried out in two directions. Thereby obtaining the specific value of the pixel of a certain point after the image is zoomed.

The calculation principle of bilinear interpolation is related to the ratio of the size of the target image to the size of the original image, if src is used _x 、src _y 、src _w 、src _h Respectively representing the abscissa and ordinate of the original image and the width and height of the original image in dst _x 、dst _y 、dst _w 、dst _h Respectively representing the abscissa and ordinate of the target image to be zoomed and the width and height of the image thereof. The corresponding relationship between the coordinates of the target image to be zoomed and the coordinates of the original image can be expressed as a common expressionFormula (II)

src _x ＝dst _x *(src _w /dst _w )

src _y ＝dst _y *(src _h /dst _h )

According to the above formula, the coordinate pixel value of a certain point of the target image is mapped to the coordinate value in the original image according to the proportion of the original image and the target zoom image. If the size of the target image is 6*6 and the size of the original image is 3*3, the pixel value of the point of the scaled image (0,0) is obtained. Then (0 x (3/6), 0 x (3/6)) = > (0 x 0.5 ) = > (0,0) can be obtained from the above two equations, i.e., the pixel value of the point of the zoomed image (0,0) to be obtained is equal to the pixel value of the point of (0,0) in the original image. If the coordinates of the image to be scaled are non-zero, for example (0,1), there will be some decimal coordinates in the corresponding original image. I.e. (0,0.5). Therefore, for the conventional nearest interpolation algorithm, the currently obtained decimal coordinate value is rounded to obtain the actual original image coordinate value, i.e. (0,1). But this method has a large error. The bilinear interpolation algorithm calculates the pixel value of the coordinate of the actual image to be zoomed according to the relation between the decimal coordinate calculated by the formula and the pixel around the actual original image. This reduces the error due to rounding. The specific calculation method is shown below

f(p+m,q+n)＝(1+m)(1-n)f(p,q)+m(1-n)f(p+1,q)+(1-m)nf(p,q+1)+mnf(p+1,q+1)

And setting floating point number coordinates obtained by a formula as (p + m, q + n), wherein p and q are integers corresponding to horizontal and vertical coordinates, and m and n are decimal parts corresponding to the horizontal and vertical coordinates. The pixel value of the coordinate obtained from the zoomed image is the pixel value corresponding to the coordinate (p + m, q + n) in the original image, and is calculated according to the coordinate values of four points (p, q), (p +1,q), (p, q + 1), (p +1, q + 1) in the original image, that is, as shown in the formula. According to the above formula, when the coordinates of the virtual original image in the form of floating point numbers are obtained by calculation through the formula, the pixel point value of the coordinate of a certain point of the zoomed image can be obtained by calculation through the formula.

Illustratively, the output module includes a Softmax computation submodule; the Softmax calculation submodule is a classification module and is used for normalizing input data into real numbers between 0 and 1.

The Softmax calculation submodule normalizes the input data into real numbers between 0 and 1, namely the probability of each classification. The Softmax principle is that all input data are subjected to an exponential function with e as a base, the exponential function is added and summed to be used as a denominator, and data obtained by the exponential function with e as the base of each input data are used as the sum, so that the probability value of the current classification can be obtained.

However, in the FPGA hardware design, the exponential function with e as the base cannot be directly calculated. Therefore, other similar transformation methods such as taylor series expansion, CORDIC algorithm, and look-up table are needed. In the invention, a common Taylor series expansion mode is adopted, and the Taylor series expansion mode is as a formula

For the above formula described in hardware using Verilog HDL language, there is an error, and in MTCNN, softmax exists after the Score layer is calculated, where the Score layer is the Score corresponding to the face box obtained from the current input feature map obtained from the last output layer of each convolutional neural network layer, that is, the Score belonging to the face and the Score belonging to the background, that is, the formula is calculated in this embodiment as shown in the formula

Wherein f is the score data of the face judged by the convolutional neural network output, and b is the score data of the background judged by the convolutional neural network output. Meanwhile, only two input items are provided for the Softmax layer, the calculation times and errors are reduced for the convenience of calculation, and therefore the formula can be simplified as shown in the following formula:

therefore, the calculation times of the multiplication in the Taylor expansion can be reduced by the change of the formula, so that the resources can be saved, and the error can be reduced. The Taylor expansion module only needs to calculate e ^b-f And (4) finishing. When using the taylor expansion, only the first four terms of the taylor equation are used, and in order to reduce the error in the calculation, the four terms of the taylor expansion participating in the calculation are each enlarged by 6 times, i.e., multiplied by 6, and the final result of the final addition and summation is divided by 6. Therefore, the probability that the face frame obtained by the current feature map corresponds to the face or the background can be calculated.

Referring to fig. 1, exemplarily, the neural network computation module 20 specifically includes a convolution data computation sub-module 200, a convolution feature map storage sub-module, and a convolution data accumulation and cache sub-module;

the convolution data calculation sub-module 200 adopts a calculation mode of multiplying and summing a single convolution kernel parameter and multiple rows of input feature maps;

The convolution data calculation sub-module 200 performs the integral convolution operation mainly by multiplying the feature map data and the weight data and accumulating the output module. While the convolutional data computation sub-module 200 uses two large BRAM tile groups to store the profile data.

Therefore, the convolution data calculation sub-module 200 adopts a calculation mode of multiplying and summing a single convolution kernel parameter and a plurality of rows of input feature maps on a hardware architecture. Namely, the characteristic diagram storage BRAM outputs data of a single column of 6 rows at one time, and the weight storage BRAM outputs data one time and multiplies the output result of the characteristic diagram to obtain data of a single column of 6 rows of the output characteristic diagram.

Taking the size of the convolution kernel of 3*3 as an example, 9 data in the convolution kernel are respectively marked with 9 colors, which respectively correspond to 9 sliding windows, and the sliding window of each color slides to the right 6 times. And sliding the window for 6 times, multiplying the window by the weight data of the corresponding color respectively to obtain new 6 lines of data, storing the data into six Add _ cache convolution data accumulation and cache sub-modules respectively, and finally convolving and outputting an image 6*6 data block. After the convolution is finished and the data block of 6*6 is output, the first calculation of the convolution window returns to the seventh line of the characteristic diagram, the sliding windows of all colors on the characteristic diagram continue to slide rightwards six times according to the convolution sliding method, the sliding windows of all colors on the characteristic diagram are multiplied by convolution kernel data of corresponding colors to obtain new convolution output data of 6*6, and the new convolution output data are stored in 6 Add _ cache convolution data accumulation and cache sub-modules respectively.

And circularly reciprocating to complete the convolution operation of the whole feature map. The traditional convolution mode is partially similar to the image enhancement convolution, and a sliding window with the same size as the current convolution kernel is adopted to slide on the feature map to be convolved. This approach, however, has limitations in that the convolved data output is limited by the maximum convolution kernel size. If the maximum size of the convolution kernel of the convolutional neural network in the current design configuration is 7*7, the ports of the output module of the convolution kernel weight data are limited to 49 data output ports at most. If there is a new neural network and the convolution size exceeds 7*7, the number of data ports of the currently designed module is limited. If the method adopted in this section is adopted, the method is not limited by the size of the convolution kernel, because only one weight data is output by the convolution kernel each time, the method is not limited by the size of the convolution kernel, and the method can be adapted to convolution kernels of any size, thereby greatly expanding the reconfigurability of the convolution architecture. The processing speed of the convolution operation can be accelerated by adopting the BRAM blocking mode. Each clock can output six data, and a column of characteristic diagram data is updated, so that the speed is obviously improved compared with the conventional method that one clock of a single BRAM outputs one characteristic diagram data.

Illustratively, the neural network computing module 20 further includes a pooling computing submodule 201 and a fully-connected computing submodule 202;

the pooling calculation sub-module 201 is used for compressing the size of the feature map;

the fully connected computation submodule 202 is configured to recombine the local features obtained by the convolution data computation submodule into a complete image through a matrix computation.

As shown in fig. 5, the pooling calculation submodule 201 hardware design is mainly composed of comparators. The BRAM is stored in a block mode, and the BRAM is stored in 6 BRAMs in total. Therefore, each time the pooling module can input two comparison data in one clock, the whole comparison operation is completed in three clocks, and three comparators are needed in total.

In the MTCNN hardware architecture designed by the embodiment of the invention, P-Net is a full convolution neural network architecture, no full connection layer exists, and a full connection layer exists at the end of the neural network of R-Net and O-Net. The corresponding fully-connected computation submodule 202 can therefore be designed for the characteristics of that part. The principle of the fully connected layer is to expand the input feature map in one dimension to form an input vector of 1*N dimensions. And then, correspondingly multiplying the feature map data by the weight data with the same dimensionality, adding the feature map data and the weight data, and outputting a new feature map output point data. In MTCNN, there are 64-dimensional inputs and 128-dimensional outputs, respectively; 128-dimensional input, 256-dimensional output; 128 dimensional inputs, 2 and 4 dimensional outputs and 256 dimensional inputs, 2 and 4 dimensional outputs. Therefore, aiming at the network characteristics, the operation architectures of two full connection layers are respectively designed. And for the first-layer fully-connected parts of the R-Net and O-Net neural networks, storing characteristic diagram data of input channels in a BRAM block mode. The maximum input channel for the first fully-connected layer of R-Net is 64,128 for the first fully-connected layer of O-Net, so the BRAM is blocked into 128 blocks for the first fully-connected layers of R-Net and O-Net as inputs for the fully-connected profile data. And for the full connection of the second part, the part is the full connection layer of the R-Net and O-Net second layer parts, and the output calculation results are all data sizes in the form of 1*N. Therefore, for the part of full-connection design, the storage mode of the register group is adopted to store the image data of the characteristic diagram, so that the power consumption generated in the process of storing the data into the BRAM and extracting the data from the BRAM again can be reduced, and the consumption of BRAM resources is reduced.

Illustratively, the frame processing module includes an NMS computing sub-module for computing a Score maximum of the face target frame.

The criterion of the area repetition degree mainly used by the NMS in the common face detection is IOU, i.e. the area intersection ratio. In the NMS computation submodule design herein, there are two modes to choose from, i.e., IOU and IOM, respectively, for comparison of area repetition.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. The utility model provides a face detection device based on FPGA which characterized in that includes: the device comprises a main control instruction analysis module, a neural network calculation module, an image zooming module, a target frame processing module and an output module; the main control instruction analysis module, the neural network calculation module, the image scaling module, the target frame processing module and the output module are all designed on a hardware FPGA based on a hardware design language;

the image scaling module is used for constructing an image pyramid of an input image and scaling the size of the image input between P-Net and R-Net and between R-Net and O-Net;

the image zooming module specifically comprises a main control sub-module, an output coordinate analysis sub-module, a bilinear difference algorithm data processing sub-module and an odd-even BRAM group address data processing sub-module; the bilinear difference algorithm data processing submodule comprises a bilinear difference algorithm integer and decimal extraction submodule and a bilinear difference algorithm data calculation submodule; the odd-even BRAM group address data processing sub-module comprises an odd-even BRAM group address resolution sub-module and an even-even BRAM group data output sub-module;

the neural network calculation module comprises three convolutional neural networks of the P-Net, the O-Net and the R-Net; the neural network computing module is used for screening a face target frame and computing bias data of the face target frame; the face target frame is obtained by mapping the coordinates which are greater than the threshold value in the fractional matrix output by the convolutional neural network to an original image;

the neural network calculation module specifically comprises a convolution data calculation submodule, a convolution feature map storage submodule and a convolution data accumulation and cache submodule; the convolution data calculation submodule adopts a calculation mode of multiplying and summing a single convolution kernel parameter and multiple rows of input characteristic diagrams; the convolution characteristic diagram storage sub-module adopts a double BRAM group storage characteristic diagram form; the convolution data accumulation and cache submodule is used for accumulating and summing the data of the convolution characteristic diagram storage submodule and outputting the data to obtain new multi-channel characteristic diagram data;

the target frame processing module is used for obtaining a new face target frame according to the screened face target frame and the bias data;

2. The FPGA-based face detection apparatus of claim 1 further comprising an image enhancement module for enhancing said input image.

3. The FPGA-based face detection apparatus according to claim 2, wherein the image enhancement module specifically includes a convolution calculation sub-module, a convolution kernel selection sub-module, and a convolution calculation sub-module;

4. The FPGA-based face detection apparatus of claim 1, wherein said output module comprises a Softmax computation sub-module; the Softmax calculation submodule is a classification module and is used for normalizing input data into real numbers between 0 and 1.

5. The FPGA-based face detection apparatus of claim 1 wherein said neural network computing module further comprises a pooling computing sub-module and a fully connected computing sub-module;

6. The FPGA-based face detection apparatus of claim 1 wherein said target box processing module comprises an NMS computation submodule for computing a Score maximum for said face target box.