CN113111736A

CN113111736A - Multi-stage characteristic pyramid target detection method based on depth separable convolution and fusion PAN

Info

Publication number: CN113111736A
Application number: CN202110325504.0A
Authority: CN
Inventors: 包晓安; 马铉钧; 包梓群; 邵一鸣; 马云龙; 许铭洋; 张娜
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-07-13

Abstract

The invention discloses a multilevel characteristic pyramid target detection method based on depth separable convolution and fusion PAN, and belongs to the field of target detection. The method comprises the following steps: 1) data acquisition: acquiring video data of a target to be detected, slicing the video data, and converting continuous video data into continuous images; 2) preprocessing the image; 3) performing target detection on the preprocessed image, and acquiring a multi-scale fusion feature map of the image by utilizing a multi-level feature pyramid network of depth separable convolution and fusion PAN; 4) and predefining a multi-length-width ratio and a multi-scale detection frame according to the size of the receptive field of the multi-scale fusion characteristic diagram, and completing the positioning and classification of the targets by using the detection frame to realize the high-precision detection of the multi-scale targets. The invention improves the characteristic pyramid, deepens the network parameters under the condition of reducing the parameter quantity and the calculated quantity, and obtains the multi-scale fusion characteristic, thereby improving the accuracy and the efficiency of detecting the target.

Description

Multi-stage characteristic pyramid target detection method based on depth separable convolution and fusion PAN

Technical Field

The invention belongs to the field of target detection, and particularly relates to a multilevel characteristic pyramid target detection method based on depth separable convolution and fusion PAN.

Background

With the improvement of computer computing power, especially the application of a graphics processor and the development of deep learning technology, the convolutional neural network develops rapidly in the field of target detection. Since the 21 st century, video image processing, which is closely related to computer computing, has also been greatly developed. However, the huge calculation amount and parameter amount have great influence on the calculation amount of image processing, and the traditional convolutional neural network and the characteristic pyramid network are difficult to process the pictures of the multi-target object. For some images with serious occlusion, blurred images and different sizes of target objects in the images due to the distance, the traditional pyramid network not only needs a lot of parameters and calculated amount, but also the detected result is not necessarily accurate.

Aiming at the problem of video image target detection, the multi-stage characteristic pyramid detection method based on the depth separable convolution and fusion PAN structure provided by the invention not only can reduce the calculated amount and parameter amount to a great extent, but also can improve the performance of detecting a target object.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-level characteristic pyramid target detection method based on depth separable convolution and fusion PAN, which is used for processing the video and image target detection problems.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a multilevel characteristic pyramid target detection method based on depth separable convolution and fusion PAN comprises the following steps:

1) data acquisition: acquiring video data of a target to be detected, slicing the video data, and converting continuous video data into continuous images;

2) preprocessing the image;

3) performing target detection on the preprocessed image, and acquiring a multi-scale fusion feature map of the image by utilizing a multi-level feature pyramid network of depth separable convolution and fusion PAN;

the multi-level characteristic pyramid network with the depth separable convolution and the fusion PAN comprises a backbone network and a multi-level FPN network with a PAN structure; the backbone network downsamples an input image to obtain feature maps with different sizes, depth separable convolution is adopted in each downsampling, the feature maps are fused through upsampling to obtain a fused feature map containing features with different depths, and the fused feature map is sent to a multi-stage FPN network with a PAN structure;

the multi-stage FPN network with the PAN structure is formed by connecting a plurality of feature pyramids with the same structure in series, the down-sampling layer of each feature pyramid is formed by depth separable convolution, and the up-sampling layer is formed by depth separable convolution and up-sampling convolution; the input of the first characteristic pyramid is a fusion characteristic graph output by the backbone network, and the fusion characteristic graph is connected with the output of the last up-sampling layer of the previous-stage characteristic pyramid according to the channel direction and used as the input of the next-stage characteristic pyramid; different feature pyramids are used for extracting features of different depths, and the output of each level of feature pyramid is connected according to the channel direction to obtain a multi-scale fusion feature map;

4) and predefining a multi-length-width ratio and a multi-scale detection frame according to the size of the receptive field of the multi-scale fusion characteristic diagram, and completing the positioning and classification of the targets by using the detection frame to realize the high-precision detection of the multi-scale targets.

Compared with the prior art, the invention has the beneficial effects that:

the invention discloses a method for detecting a target by using a multi-stage feature pyramid fused with a PAN structure based on depth separable convolution. And extracting a feature map by using downsampling, detecting a target by using a multi-stage feature pyramid network, and adding a PAN structure behind the FPN of each stage. The network structure of the deep separable convolution replaces the original convolution network structure, so that the network depth can be deepened, and the parameter quantity and the calculated quantity are reduced. The multi-stage feature pyramid network is composed of a plurality of feature pyramids which are identical in structure and use depth separable convolution, the feature pyramids are connected in series, features with the same size and obtained by different pyramids are fused, and detection is carried out by utilizing the fused feature pyramids. The PAN structure is characterized in that a bottom-up characteristic pyramid is added behind the FPN layer, so that the accuracy and efficiency of target detection can be improved.

Drawings

FIG. 1 is a schematic diagram of the structure of a deep separable convolutional network employed in the present invention;

FIG. 2 is a schematic diagram of a multi-level feature pyramid structure employed in the present invention;

FIG. 3 is a schematic diagram of the present invention using a multi-level feature pyramid of depth separable convolution and fusion PAN for target detection.

Detailed Description

The invention is further explained below with reference to the drawings.

The invention is divided into two main parts: a backbone network portion and an improved multi-level feature pyramid network portion. The backbone network part: firstly, taking out and down-sampling an input image to obtain feature maps with different sizes, using a depth separable network for each down-sampling, then fusing the features through up-sampling to obtain feature maps containing features with different depths, and sending the fused feature maps into an improved multistage pyramid. Improved multi-level pyramid network part: the network is composed of a plurality of feature pyramids with the same structure, and each feature pyramid outputs three features with different sizes; and fusing the characteristics of different pyramids, and detecting the target object.

Depth separable convolution:

the depth separable convolution is performed in two parts: respectively, channel-by-channel convolution and point-by-point convolution. For example, for an M x M, three channel color input picture. In the first step, channel-by-channel convolution is firstly carried out, the number of convolution kernels is the same as that of input layer channels, the size of the convolution kernels is assumed to be 3 x 3, and three characteristic graphs are generated when the step is completed. The second step is point-by-point convolution, wherein the input is the output of the previous step, the convolution kernel size of the point-by-point convolution is 1 multiplied by 3, 3 is the number of channels of the input layer of the second step, and the number of output channels is L (actually, how many convolution kernels exist in the point-by-point convolution part and how many output channels exist). As can be seen, the parameters of the depth separable convolution are: 3 × 3 × 3+3 × L; if the convolutional neural network is common, the parameters are as follows: KxKx3 xL; as the number of input channels increases and the convolution kernel size becomes larger, the amount of parameters required is significantly reduced. As in fig. 1.

Improved multilevel feature pyramid network:

the multi-stage feature pyramid has the main effects of fusing a plurality of processed feature maps, enhancing the performance of a detection target object and reducing the missing rate. The part is formed by connecting a plurality of characteristic pyramid networks with the same structure in series, and the specific series connection mode is shown in fig. 2. And then, processing the feature map by adding a pyramid structure (PAN structure) behind each feature pyramid. Except the first characteristic pyramid, the input characteristic graphs of other pyramids are obtained by connecting the last layer of the pyramid of the previous FPN structure with the output of the backbone network according to the channel direction, different characteristic pyramids are used for extracting characteristics of different depths, and each characteristic pyramid is formed by depth separable convolution. After each convolution layer, a Batch Normalization process and a linear correction unit were used as activation functions.

The invention provides a multilevel characteristic pyramid target detection method based on depth separable convolution and fusion PAN, which comprises the following steps:

2) preprocessing the image;

In one embodiment of the present invention, the preprocessing method in step 2) is a filtering method, a square region is obtained by taking pixel points on a picture as a center, the gray values of each pixel point in the region are sorted, the sorted middle value is taken as a new value of the gray value of the center pixel, and the image is traversed in a sliding window manner.

In one embodiment of the present invention, the depth separable convolution comprises an input layer, a channel-by-channel convolution layer, a point-by-point convolution layer, and an output layer; the input of the input layer is a three-channel image, firstly, the input three-channel image is subjected to channel-by-channel convolution operation, three convolution kernels are utilized to carry out convolution on three channels respectively, and three characteristic graphs are generated; and performing point-by-point convolution on the three characteristic graphs by using a three-dimensional convolution kernel, and synthesizing the three characteristic graphs into one characteristic graph to be output.

In one embodiment of the present invention, the backbone network is composed of four convolutional layers and two upsampling layers, after the preprocessed image is input into the backbone network, the preprocessed image is sequentially processed by the four convolutional layers, an output of the fourth convolutional layer is connected to the first upsampling layer, and an output of the first upsampling layer is connected to an output of the third convolutional layer in a channel direction and then used as an input of the second upsampling layer; the output of the second up-sampling layer is connected with the output of the third convolution layer according to the channel direction and then used as the output of the backbone network.

In one specific implementation of the present invention, the multi-stage FPN network with a PAN structure is formed by connecting a plurality of feature pyramids with the same structure in series, where each feature pyramid includes an input layer, four down-sampling layers, and two up-sampling layers;

after an input layer of the characteristic pyramid acquires an image, the image is sequentially processed by a first down-sampling layer and a second down-sampling layer, the output of the second down-sampling layer is used as the input of a first up-sampling layer, the output of the first up-sampling layer and the output of the first down-sampling layer are added according to elements, and then the input of the second up-sampling layer is used; the output of the second up-sampling layer is added with the image acquired by the input layer of the feature pyramid according to elements, and the added result is output as a first feature map on one hand and input as a third down-sampling layer on the other hand; adding the output of the third down-sampling layer and the input of the second up-sampling layer according to elements, wherein the result of the addition is output as a second feature map on one hand and is input as a fourth down-sampling layer on the other hand; adding the output of the fourth down-sampling layer and the output of the second down-sampling layer according to elements, and outputting the added result as a third feature map;

connecting the output of the last up-sampling layer in the previous characteristic pyramid with the output of the backbone network according to the channel direction, and using the output as an input image of an input layer of a next-stage characteristic pyramid; and each feature pyramid outputs three feature graphs with different sizes, and the feature graphs with corresponding sizes are connected according to the channel direction to obtain a final multi-scale fusion feature graph.

In one embodiment of the present invention, the manipulation of the detection frame is completed in step 4) by using a target detector, wherein the target detector is MaskR-CNN or RetinaNet.

In one embodiment of the invention, the optimization of the loss function during training uses a stochastic gradient descent algorithm.

In one embodiment of the present invention, as shown in fig. 3, the implementation process is as follows:

(1) carrying out 8-time down-sampling, 16-time down-sampling and 32-time down-sampling on an input image to obtain a feature map, wherein each down-sampling is performed by using a depth separable convolution network; the illustration takes these three samples as an example.

(2) And fusing the features through upsampling to obtain a feature map containing features of different depths.

(3) And inputting the fused feature map into the improved multi-level pyramid.

(4) And the first characteristic pyramid extracts characteristics from the characteristic graphs and outputs 3 characteristic graphs with different sizes, the output of the last layer of the FPN network structure and the output of the backbone network are connected in a channel mode to obtain the input of the next characteristic pyramid, and each characteristic pyramid is formed by depth separable convolution.

(5) When the extraction of the features is completed by a plurality of connected feature pyramids, each feature pyramid outputs three feature graphs with different sizes, the feature graphs with corresponding sizes are connected according to the channel direction, and the features which are connected and fused according to the channel direction are obtained as follows:

Xi＝Concat(Xi1,Xi2,Xi3,Xi4,……，Xin),n＝1,2,3……，i＝1,2,3

wherein Xi1 represents the ith feature map output by the first feature pyramid, Xi is the ith feature map after fusion, and n represents the number of multi-level feature pyramids. Therefore, the characteristics of all the characteristic pyramids can be fused, and the performance of detecting the target object is improved.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A multilevel characteristic pyramid target detection method based on depth separable convolution and fusion PAN is characterized by comprising the following steps:

2) preprocessing the image;

2. The multilevel feature pyramid target detection method based on depth separable convolution and fusion PAN according to claim 1, wherein the preprocessing method in step 2) is a filtering method, a square region is taken by taking a pixel point on a picture as a center, the gray values of each pixel point in the region are sorted, the sorted middle value is taken as a new value of the gray value of the center pixel, and the image is traversed in a sliding window manner.

3. The multi-level feature pyramid target detection method based on depth-separable convolution and fusion PAN according to claim 1, wherein the depth-separable convolution comprises an input layer, a channel-by-channel convolution layer, a point-by-point convolution layer, an output layer; the input of the input layer is a three-channel image, firstly, the input three-channel image is subjected to channel-by-channel convolution operation, three convolution kernels are utilized to carry out convolution on three channels respectively, and three characteristic graphs are generated; and performing point-by-point convolution on the three characteristic graphs by using a three-dimensional convolution kernel, and synthesizing the three characteristic graphs into one characteristic graph to be output.

4. The multilevel feature pyramid target detection method based on the depth separable convolution and fusion PAN according to claim 1, wherein the backbone network is composed of four convolution layers and two upsampling layers, after the preprocessed image is input into the backbone network, the preprocessed image is sequentially processed by the four convolution layers, an output of a fourth convolution layer is connected with a first upsampling layer, and an output of the first upsampling layer is connected with an output of a third convolution layer according to a channel direction and then used as an input of a second upsampling layer; the output of the second up-sampling layer is connected with the output of the third convolution layer according to the channel direction and then used as the output of the backbone network.

5. The method for detecting the multilevel characteristic pyramid target based on the depth separable convolution and the fusion PAN according to claim 1, wherein the multilevel FPN network with the PAN structure is formed by connecting a plurality of characteristic pyramids with the same structure in series, and each characteristic pyramid comprises an input layer, four down-sampling layers and two up-sampling layers;

connecting the output of the last up-sampling layer in the previous characteristic pyramid with the output of the backbone network according to the channel direction, and using the output as an input image of an input layer of a next-stage characteristic pyramid; each feature pyramid outputs three feature graphs with different sizes, and the feature graphs with corresponding sizes are connected according to the channel direction to obtain a final multi-scale fusion feature graph, wherein the final multi-scale fusion feature graph comprises the following steps:

Xi＝Concat(Xi1,Xi2,Xi3,Xi4,……，Xin),n＝1,2,3……，i＝1,2,3

wherein Xi1 represents the ith feature map output by the first feature pyramid, Xi is the ith feature map after fusion, and n represents the number of multi-level feature pyramids.

6. The multi-stage feature pyramid target detection method based on deep separable convolution and fusion PAN according to claim 1, wherein the manipulation of the detection frame in step 4) is completed by using a target detector, wherein the target detector is MaskR-CNN or RetinaNet.

7. The multi-level feature pyramid target detection method based on depth separable convolution and fusion PAN according to claim 1, wherein the optimization of the loss function in the training process adopts a random gradient descent algorithm.