CN111814814B

CN111814814B - Single-stage target detection method based on image super-resolution network

Info

Publication number: CN111814814B
Application number: CN201910286446.8A
Authority: CN
Inventors: 刘怡光; 畅青; 薛凯; 史雪蕾; 杨艳
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-04-10
Filing date: 2019-04-10
Publication date: 2022-04-12
Anticipated expiration: 2039-04-10
Also published as: CN111814814A

Abstract

The invention innovatively introduces an image super-resolution reconstruction technology into a target detection network and provides a novel and stable single-stage target detection method. Firstly, performing super-resolution reconstruction on an original picture by using a convolutional neural network to generate a clear reconstructed picture with high resolution; then, a target detection network is built on the super-resolution reconstruction network; and finally, detecting large and small targets on the original picture and the reconstructed picture respectively: for medium and large targets with sufficient pixel quantity, the target identification and detection are still carried out on the original picture by utilizing a neural network; and the detection of the small target is carried out on the reconstructed picture, and then the detection result is mapped back to the original picture. Experiments prove that the method can remarkably improve the detection precision and recall rate of small targets, fuzzy targets and shielding targets while ensuring the detection performance of large and medium targets.

Description

Single-stage target detection method based on image super-resolution network

Technical Field

The invention relates to a single-stage target detection method based on an image super-resolution network, which is used for improving the identification efficiency and the positioning accuracy of a target detection model on a target, particularly a tiny target in a picture. Belongs to the field of computer vision.

Background

Target detection is taken as a basic work of computer vision, and has important research values in the fields of pedestrian detection, license plate recognition, unmanned driving and the like, so that the target detection is widely concerned for a long time. At present, the top-level target detection method almost adopts a deep convolutional network architecture, and is mainly divided into two genres: one is a two-stage target detection method taking the master of the faster RCNN and based on a candidate region paradigm. Such detectors first generate candidate regions (region artifacts), and then perform object classification and position refinement on the candidate regions. The other is an end-to-end single-stage target detection method taking RetinaNet, SSD and the like as the main components, and the method does not need a region general stage, but directly generates the class probability and the position coordinate value of the predicted target. Whether the method is a single-stage detection method or a two-stage detection method, the development and the improvement are to obtain higher detection precision and higher detection speed.

Although the target detection model and the method are rapidly developed, the target detection precision and the recall rate are greatly improved, the target information is increasingly lacked due to the fact that the number of pixels occupied by small targets, fuzzy targets and shielding targets in the picture is too small, and the small targets, the fuzzy targets and the shielding targets are continuously rolled and pooled through a neural network, so that the detection precision and the recall rate of the small targets cannot be improved all the time, and the method becomes an important factor which is beneficial to the performance of a target detection framework.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the convolution neural network is utilized to continuously enrich the pixel information of small targets, fuzzy targets and shielding targets, and further the precision and recall rate of target detection are greatly improved.

The solution of the invention is: the image super-resolution reconstruction technology is innovatively introduced into a target detection network. Firstly, super-resolution reconstruction is carried out on an original image by utilizing a convolutional neural network to generate a clear reconstructed image with high resolution. Then, the detection of large and small targets is respectively carried out on the original image and the reconstructed image: for the medium and large-sized targets with sufficient pixel quantity, the neural network is still utilized to identify and detect the targets on the original image, while for the small targets, the detection is carried out on the reconstructed image, and finally, the result is mapped back to the original image.

The invention aims to realize the solution of the complaint, and the method comprises the following steps:

1. and (5) building and training a super-resolution reconstruction network. The original image can generate a reconstructed image with richer pixel information after passing through the super-resolution reconstruction network.

2. And (3) constructing a target detection skeleton network module on the feature map generated by the super-resolution reconstruction network, and further extracting the features of the original image.

3. And performing feature extraction on the reconstructed image by using the convolutional neural network to generate a feature map of the reconstructed image.

4. And constructing a characteristic pyramid module on the skeleton network and the characteristic graph of the reconstructed image, and performing characteristic fusion on the characteristics of different scales by using the characteristic pyramid.

5. And building a target classification and coordinate regression network on the fused feature map.

6. And (4) network training is carried out by adopting a multitask loss function. And keeping the parameters of the image super-resolution reconstruction network unchanged in the training process.

Drawings

Description of the drawings fig. 1 is a detailed architecture diagram of an image super-resolution reconstruction network.

Description of the drawings figure 2 is a general network architecture diagram of the present invention.

Detailed description of the preferred embodiments

The method is described in further detail below with reference to the accompanying drawings:

1. referring to the attached figure 1 of the specification, the method firstly builds and trains an image super-resolution reconstruction network. Firstly, the original image is subjected to feature extraction by 56 5 × 5 convolution kernels and 64 3 × 3 convolution kernels to generate a feature map B₀(ii) a Next, 12 convolution checks of 1 × 1 are used to check B₀Performing feature compression (the feature map is reduced from 64 layers to 12 layers) to reduce network parameters; then entering a characteristic mapping stage: performing convolution operation on the compressed feature map by using 12 convolution kernels with the size of 3 multiplied by 3 for 3 times; in order to provide richer information for the reconstruction stage, the method uses 56 1 × 1 convolution kernels for feature expansion (the feature map is expanded from 12 layers to 56 layers); finally useAnd performing image reconstruction by using the deconvolution kernel to generate a reconstructed image.

2. In B₀And constructing a residual error network as a network skeleton module of the overall detection method. The residual error (ResNet) network can make the network deeper and easier to optimize by means of jump connection. Generating a skeleton characteristic diagram of 4 layers through continuous convolution and pooling operations of the residual blocks: { B₁, B₂, B₃, B₄}。

3. Referring to the description and the attached drawing fig. 2, the feature pyramid module is generated by connecting with the residual error network in a top-down manner. Wherein, P₄From B₄Formed by convolution of 1 × 1; p₃From P₄After upsampling, the obtained product is mixed with B₃Performing element-level addition, and performing 3-by-3 convolution to form the element-level addition; p₂，P₁Is generated from P₃Similarly; for P₀Firstly, feature extraction is carried out on a reconstructed picture by using 64 convolution kernels of 7 × 7, and S is generated by 3 × 3 convolution and 2 × 2 pooling operation₀Then P is added₁Perform upsampling with S₀The element-level addition is performed by 256 convolution kernels of 3 × 3.

4. In order to make the target positioning more accurate, the method is characterized in a characteristic diagram { P }₀, P₁, P₂, P₃, P₄All set up 9 types of default frames on it, correspond to 3 different yards {2 }⁰,2^1/3,2^2/3And 3 different aspect ratios 1:1,1:2,2: 1. The default boxes cover an area of 32²,64²,128²,256²,512²}. The positioning of the target is actually achieved by predicting the offset value of the target with respect to the default box coordinates.

5. And predicting the category of the target and the coordinate offset value of the target relative to a default box by adopting a full convolution network. Prediction for target class: further pairs { P } using 256 convolution kernels of 3 × 3₀, P₁, P₂, P₃, P₄Performing feature extraction, performing K × A convolution with 3 × 3 convolution kernels, and obtaining a final target category score by using a sigmoid activation function. Where a =9 is the number of classes of default boxes on each level, and K is the number of target classes. Prediction of target position: further pairs { P } using 256 convolution kernels of 3 × 3₀, P₁, P₂, P₃, P₄And (5) performing feature extraction, and performing convolution by using 4 multiplied by A convolution kernels to obtain a coordinate offset value of the target relative to each default frame. The offset values and the corresponding default frame coordinate values are added to obtain a predicted frame of the network to the target in the image, and it is noted that for the feature map P₀The generated prediction frame is mapped back to the original image.

6. And filtering the result. There are a large number of invalid boxes for the prediction box generated using the convolutional network. The score values of the target category prediction are sorted according to the network, only 300 prediction boxes with the highest score are reserved, and then the final prediction result is generated after the non-maximum value suppression with the threshold value of 0.5 is carried out.

Claims

1. A single-stage target detection method based on an image super-resolution network is characterized by comprising the following steps: firstly, performing super-resolution reconstruction on an original image by using a convolutional neural network to generate a clear reconstructed image with high resolution; then, the detection of large and small targets is respectively carried out on the original image and the reconstructed image: for the medium and large-sized targets with sufficient pixel quantity, the neural network is still utilized to identify and detect the targets on the original image, the detection on the small targets is carried out on the reconstructed image, and finally the result is mapped back to the original image, and the specific steps are as follows:

(1) constructing and training an image super-resolution reconstruction network, and performing feature extraction on an original image by using 56 5 × 5 convolution kernels and 64 3 × 3 convolution kernels to generate a feature map B₀(ii) a Next, 12 convolution checks of 1 × 1 are used to check B₀Performing feature compression to reduce network parameters; then entering a characteristic mapping stage: performing convolution operation on the compressed feature map by using 12 convolution kernels with the size of 3 multiplied by 3 for 3 times; and 56 convolution kernels of 1 × 1 are used for feature expansion; finally, performing image reconstruction by using a deconvolution kernel to generate a reconstructed image;

(2) in super-resolution reconstruction network generationBuilding a residual error network on the feature map, and generating a skeleton feature map with 4 layers: { B₁, B₂, B₃, B₄}；

(3) Carrying out feature extraction on the reconstructed image to generate a feature map S₀By top-down and side-to-side connections, in { S₀, B₁, B₂, B₃, B₄Establishing a characteristic pyramid structure (P)₀, P₁, P₂, P₃, P₄}, wherein: p₄From B₄Formed by convolution of 1 × 1; p₃From P₄After upsampling, the obtained product is mixed with B₃Performing element-level addition, and performing 3-by-3 convolution to form the element-level addition; p₂，P₁Is generated from P₃Similarly; for P₀Firstly, feature extraction is carried out on a reconstructed picture by using 64 convolution kernels of 7 × 7, and S is generated by 3 × 3 convolution and 2 × 2 pooling operation₀Then P is added₁Perform upsampling with S₀Element-level addition and 256 convolution kernels of 3 × 3 to form the element-level addition;

(4) in the feature map { P₀, P₁, P₂, P₃, P₄All set up 9 types of default frames on it, correspond to 3 different yards {2 }⁰,2^1/3,2^2/3And 3 different aspect ratios 1:1,1:2,2:1, the default boxes covering an area of 32²,64²,128²,256²,512²Predicting the category of the target and the coordinate offset value of the target relative to a default frame by using a full convolution network, and predicting the category of the target: further pairs { P } using 256 convolution kernels of 3 × 3₀, P₁, P₂, P₃, P₄Performing feature extraction, performing K × A convolution with 3 × 3 convolution kernels, and then obtaining a final target category score by using a sigmoid activation function, wherein A =9 is the number of categories of default frames on each level, and K is the number of target categories; prediction of target position: further pairs { P } using 256 convolution kernels of 3 × 3₀, P₁, P₂, P₃, P₄Performing feature extraction, and convolving with 4 × A convolution kernelsThen obtaining a coordinate deviation value of the target relative to each default frame; adding the deviation values and the corresponding default frame coordinate values to obtain a prediction frame of the network to the target in the image; and apply the feature map P₀Mapping the generated prediction frame back to the original image;

(5) sorting the score values of the target category prediction according to the network, only keeping 300 prediction frames with the highest score, and then performing non-maximum value suppression with a threshold value of 0.5 to generate a final prediction result;

and (3) combining the steps (1), (2), (3), (4) and (5) to complete the construction of the whole method.