CN113392960B

CN113392960B - Target detection network and method based on mixed hole convolution pyramid

Info

Publication number: CN113392960B
Application number: CN202110646653.7A
Authority: CN
Inventors: 殷光强; 殷康宁; 候少麒; 梁杰; 丁一寅; 曾宇昊
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2022-08-30
Anticipated expiration: 2041-06-10
Also published as: CN113392960A

Abstract

The invention relates to the technical field of digital image processing, in particular to a target detection network and a method based on a mixed cavity convolution pyramid, wherein the target detection network comprises a backbone network, a mixed receptive field module, a low-layer embedded characteristic pyramid module and a detection module; the backbone network extracts target picture features by using a layered cascade network structure; the mixed receptive field module performs feature enhancement on the highest-layer feature map output from the topmost end of the backbone network; the low-layer embedded feature pyramid module fuses high-layer features downwards on the basis of the feature pyramid, and generates a final feature graph to be detected in a low-layer embedded mode; the detection module is used for positioning and classifying the characteristic diagram to be detected and outputting a result. By the target detection network and the target detection method, the problems of missed detection and wrong detection caused by scale and shielding can be effectively solved.

Description

Target detection network and method based on mixed hole convolution pyramid

Technical Field

The invention relates to the technical field of digital image processing, in particular to a target detection network and a target detection method based on a mixed cavity convolution pyramid.

Background

Object detection is one of the most widespread applications in real life, with the task of focusing on specific objects in the picture. The traditional target detection method can be divided into a single-stage target detection method and a two-stage target detection method, wherein the two-stage method has the core that a region proposing method is adopted, an input image is selectively searched, a region proposing frame is generated, then, a convolutional neural network is used for extracting characteristics of each region proposing frame, and then, a classifier is used for classifying. The single-stage method is to directly output the target detection result through a convolutional neural network.

Through a series of varieties, the common point of the two methods gradually evolves to the method that a large number of Anchor frames are required to be generated in advance in the detection process, and the algorithms are collectively called Anchor-based target detection algorithms. The anchor box is a group of rectangular boxes obtained by utilizing a clustering algorithm on a training set before training, and represents the length and width sizes of the main distribution of the targets in the data set. During reasoning, n candidate rectangular frames are extracted from the anchor frames on the feature diagram, and then the rectangular frames are further classified and regressed. Compared with a two-stage algorithm, the processing of the candidate frame still passes through two steps of foreground coarse classification and multi-class fine classification.

The single-stage target detection algorithm lacks fine processing of a two-stage algorithm, and is not good in performance when the problems of multi-scale and shielding of targets and the like are faced. In addition, although the problem of candidate frame calculation amount explosion caused by selective search is relieved to a certain extent by the Anchor-based algorithm, the generation of a large number of Anchor frames with different sizes in each grid still causes calculation redundancy, and most importantly, the generation of the Anchor frames depends on a large number of super-parameter settings, and the positioning accuracy and the classification effect of the target are seriously influenced by manual parameter adjustment.

In the prior art, a patent with publication number CN110222712A discloses "a multi-item target detection algorithm based on deep learning", the proposed target detection algorithm obtains an augmented RoI set through a multi-scale sliding window and selective search, and takes over the generation of an intensive RoI set through an exhaustive mode with the multi-scale sliding window, which is large in calculation amount and low in efficiency.

The patent publication No. CN112115883A discloses a "method and an apparatus for suppressing non-maximum value based on Anchor-free target detection algorithm", which proposes that a centret network model is used to perform target detection by predicting the upper left corner, the lower right corner and the central point of an object, and a non-maximum value suppression method is used to avoid the situation that there are multiple detection boxes in the same target object, but more complicated post-processing is required to group each pair of corners belonging to the same target, which is inefficient.

The patent with publication number CN112101153A discloses a "remote sensing target detection method based on a receptive field module and a multi-feature pyramid", which performs feature extraction on a visible light remote sensing image through a VGG network to obtain feature maps with different sizes, then performs cascade fusion on the feature maps and obtains an optimized feature map through a step length convolution feature pyramid, and then performs multi-scale output detection through receptive field information mining. The method utilizes the feature maps with different sizes, but the feature map fusion mode is redundant, and the performance of the backbone network is poor, so that the final detection result is influenced.

Disclosure of Invention

In order to solve the technical problems, the invention provides a target detection network and a method based on a mixed cavity convolution pyramid, which can effectively solve the problems of missed detection and false detection caused by scale and shielding.

The invention is realized by adopting the following technical scheme:

a target detection network based on a hybrid void convolution pyramid is characterized in that: the system comprises a backbone network, a mixed reception field module, a low-level embedded characteristic pyramid module and a detection module; the backbone network extracts the characteristics of the target picture by using a hierarchical cascading network structure; the mixed receptive field module is used for carrying out feature enhancement on the highest layer feature map output from the topmost end of the backbone network; the low-layer embedded feature pyramid module is used for fusing high-layer features downwards on the basis of a feature pyramid and generating a final feature graph to be detected in a low-layer embedding mode; the detection module is used for positioning and classifying the characteristic diagram to be detected and outputting a result.

The low-layer embedded characteristic pyramid module is used for generating a final characteristic diagram to be detected, and specifically comprises the following steps:

a. the low-layer embedded characteristic pyramid module fuses the current-layer characteristic graph with the high-layer characteristic graph subjected to channel compression and upsampling to form a composite characteristic graph, and embedding high-layer semantic information is completed;

b. fusing the composite feature map and the downsampled low-level feature map to form a mixed feature map, and completing embedding of low-level detail information;

c. and (4) generating a final characteristic diagram to be detected after each mixed characteristic diagram passes through the composite convolution layer.

The fusion mode in the step a and the step b is element-by-element channel-by-channel addition.

And the composite convolution layer in the step c is formed by connecting a 3 x 3 convolution layer, a BN layer and a LeakyReLU activation layer.

The mixed receptive field module comprises four parallel branches, including a 1 × 1 convolutional layer branch and three 3 × 3 convolutional layer branches with void rates of 1, 2 and 4 respectively; and after splicing the feature maps obtained by the cavity convolution layers with different cavity rates in parallel by the mixed receptive field module, performing feature information fusion by adopting the convolution layers of 1 multiplied by 1, and reducing the channel dimension to a specified number.

The backbone network is a single-stage detection network based on a Res2Net50 network, an echo-free mechanism of an FCOS is introduced in the prediction of a target, pixel-by-pixel prediction is carried out, and a Centeress branch network is added in a Loss function part.

The feature map output by the backbone network comprises C3, C4 and C5, and the feature map size is 100 × 100, 50 × 50 and 25 × 25.

A target detection method based on a mixed cavity convolution pyramid is characterized in that: the method comprises the following steps:

constructing a backbone network based on an Achor-free mechanism, obtaining characteristic graphs C3, C4 and C5 through the backbone network, and outputting a highest-level characteristic graph C5 output by the backbone network to a low-level embedded characteristic pyramid module after carrying out characteristic enhancement through a mixed receptive field module;

II, forming composite characteristics by the aid of the low-layer embedded characteristic pyramid module and characteristic graphs C4 and C3 output by the main network through up-sampling and down-sampling operations, generating a characteristic graph to be detected after the composite characteristics pass through a composite convolution layer, and conveying the characteristic graph to be detected to a detection module for target positioning and classification tasks;

training the network, testing the model of each round, storing the best training model weight, testing the real-time performance of the mixed receptive field module and the low-level embedded characteristic pyramid module by using a corresponding test set, and training to obtain a network model;

and iv, detecting the target by using the trained network model, and outputting a detection result.

In the training of the network in step iii, the loss function is as follows:

wherein p is _x,y Representing the class prediction probability, t _x,y Expressing regression prediction coordinates, and N expressing the number of positive samples; k is an indication function, if the current prediction is determined to be a positive sample, the current prediction is 1, and if not, the current prediction is 0;

L _cls the specific expression form is a Focal local Loss function:

wherein y is a sample label, y' predicts the probability that a sample is a positive case, and gamma is a focusing parameter;

L _reg for the GIoU Loss function, the specific calculation process is as follows:

L _reg ＝1-GIoU

where A and B represent the prediction and real boxes, IoU is their intersection-to-parallel ratio, and L is calculated by first calculating their minimum convex set C, i.e., the minimum bounding box bounding A, B, and then combining C with the minimum convex set to calculate GIoU _reg 。

Compared with the prior art, the invention has the beneficial effects that:

1. the invention improves the structure of the characteristic pyramid, provides a low-layer embedded characteristic pyramid module, can effectively solve the problem that target detection is insufficient in processing multi-scale change, fuses shallow-layer characteristic information and high-layer characteristic information, adds normalization processing and activation functions to fused output, and optimizes model training.

The invention designs a mixed reception field module, and increases the reception field to acquire more global feature detail information by utilizing multi-size cavity convolution and combining with the multi-scale output characteristic of the feature pyramid under the condition of controlling the parameter quantity of the model so as to solve the problem of shielding of a target.

The method introduces an Anchor-free mechanism, combines a low-layer embedded characteristic pyramid module and a mixed receptive field module, can reduce invalid calculation caused by redundant candidate frames, can improve positioning accuracy, and effectively solves the problems of missing detection and the like.

2. According to the invention, the target detection network can solve the multi-scale and shielding problems of a target detection scene, can be used in a plug-and-play manner, introduces an Anchor-free algorithm, combines a low-layer embedded characteristic pyramid module and a mixed receptive field module, can reduce invalid calculation caused by redundant candidate frames, can improve positioning accuracy, and solves the problems of large model parameter quantity, large redundant calculation, low applicability, low efficiency and easy omission in the face of actual conditions in the existing target detection task.

3. The backbone network adopts an echo-free mechanism of introducing FCOS (fuzzy C-operating system) to predict pixel points, target detection is carried out without depending on a predefined anchor frame or an proposed area, invalid calculation brought by a redundant candidate frame is reduced, positioning accuracy is improved, the problems of detection omission and the like are effectively solved, a center mechanism is utilized to quickly filter negative samples, low-quality prediction frames at positions far away from a target center are restrained, the weight of the prediction frames close to the target center is increased, and detection performance is improved. Introducing the Res2Net50 network replaces the single 3 x 3 convolutional layer used in the ResNet50 with a hierarchical cascaded feature set in a given redundancy block that is more optimized in terms of network width, depth and resolution.

4. The hybrid receptive field module is different from other networks which carry out feature processing after feature fusion of multiple layers (C3, C4 and C5) is carried out, but before the feature fusion, the hybrid receptive field module is embedded between C5 and a feature pyramid P5 of a backbone network, the characterization capability of C5 features is improved, and final detection and prediction are carried out after the hybrid receptive field module passes through a low-layer embedded feature pyramid module. The use of the convolutional layers with different void rates improves the adaptability of the model to targets with different scales, after the spliced characteristic diagram, the convolutional layers with 1 multiplied by 1 are adopted for carrying out characteristic information fusion, the channel dimensionality is reduced to the specified number, and the flexibility of the mixed receptive field module is improved.

5. Compared with the characteristic pyramid, the characteristics output by the low-level embedded characteristic pyramid module in the invention not only contain rich semantic information, but also contain specific detail information, thereby realizing double promotion of multi-scale target detection effect and positioning precision.

Drawings

The invention will be described in further detail with reference to the following description taken in conjunction with the accompanying drawings and detailed description, in which:

FIG. 1 is a schematic diagram of the overall structure of a target detection network according to the present invention;

FIG. 2 is a schematic flow chart of a target detection method according to the present invention;

FIG. 3 is a schematic diagram of a hybrid receptor field module according to the present invention;

FIG. 4 is a schematic diagram of a low-level embedded feature pyramid module according to the present invention;

FIG. 5 is a schematic view of a composite convolution layer according to the present invention.

Detailed Description

Example 1

As a basic implementation manner of the invention, the invention comprises a target detection network based on a mixed cavity convolution pyramid, which comprises a backbone network, a mixed receptive field module, a low-level embedded characteristic pyramid module and a detection module. The backbone network extracts target picture features by using a layered cascade network structure; and the mixed receptive field module is used for carrying out characteristic enhancement on the highest-layer characteristic diagram output from the topmost end of the backbone network. And the low-layer embedded characteristic pyramid module is used for fusing high-layer characteristics downwards on the basis of the characteristic pyramid and generating a final characteristic diagram to be detected in a low-layer embedding mode. The detection module is used for positioning and classifying the characteristic diagram to be detected and outputting a result.

The backbone network can be a single-stage detection network based on a Res2Net50 network, the feature extraction capability is stronger without increasing the calculation load, an echo-free mechanism of an FCOS is introduced in the aspect of target prediction to predict pixel points, a centerless branch network is added in a Loss function part, a low-quality detection frame is restrained, and the detection performance is improved.

A target detection method based on a mixed hole convolution pyramid comprises the following steps:

Example 2

As a best implementation mode of the invention, the invention comprises a target detection network based on a hybrid void convolution pyramid, and with reference to the attached drawing 1 of the specification, the target detection network comprises a backbone network, a hybrid reception field module, a low-level embedded feature pyramid module and a detection module.

The backbone network adopts a single-stage detection network structure, introduces an echo-free mechanism of FCOS (fiber channel operating system), performs pixel-by-pixel prediction, does not depend on a predefined anchor frame or a proposed area to perform target detection, reduces invalid calculation caused by redundant candidate frames, improves positioning accuracy, effectively solves the problems of missed detection and the like, utilizes a center mechanism, quickly filters negative samples, inhibits low-quality prediction frames at positions far away from a target center, increases the weight of the prediction frames close to the target center, and improves detection performance. The expression of Centeress is shown in formula (1) < CHEM > ^* 、r ^* 、t ^* 、b ^* The distances from the pixel points to the left, right, upper and lower parts of the prediction frame are represented, and the values are between 0 and 1, so that the closer the Centeness value to the target real center is, the larger the Centeness value is, and the farther the Centeness value is from the target real center is, the smaller the Centeness value is.

The backbone network introduces a Res2Net50 network, using a hierarchical cascaded set of features in a given redundancy block instead of the single 3 x 3 convolutional layer used in the ResNet50, which is more optimized in terms of network width, depth and resolution. When passing through C3, C4 and C5, the feature map sizes are 100 × 100, 50 × 50 and 25 × 25.

The mixed receptive field module is used for splicing the feature maps which are obtained by the cavity convolution layers with different cavity rates in parallel, so that the obtaining capability of the network on the global features is improved, and the grid effect caused by single cavity convolution is compensated. The hybrid receptive field module of the present application uses all the hole convolution layers to effectively solve the target occlusion problem.

Referring to the description and the attached drawing fig. 3, in order to fully exert the performance of the hybrid receptive field module, the hybrid receptive field module of the present invention is different from other networks in that feature fusion is performed after multi-layer (C3, C4, C5) feature fusion is performed, but before feature fusion, the hybrid receptive field module is embedded between C5 and feature pyramid P5 of a backbone network, so as to improve the characterization capability of C5 features, and final detection prediction is performed after the hybrid receptive field module passes through the low-layer embedded feature pyramid module. The mixed receptive field module of the invention is composed of four parallel branches, a convolution layer branch of 1 × 1, and three convolution layer branches of 3 × 3 with void rates of 1, 2 and 4 respectively. 3 x 3 cavity convolution with a cavity rate of 4 can acquire more global context feature details, enhance reasoning capability and solve the problem of target occlusion; and the convolution layers with different void ratios are used, so that the adaptability of the model to targets with different scales is improved.

The high-level features output by the C5 have rich semantic information, and are different from the combination of the conventionally adopted cascade features, and the parallel feature combination adopted by the invention can train the network parameters more suitable for the current data set. After the parallel branch 1 passes through the 1 × 1 convolutional layer, the detailed information of the image can be kept as much as possible under the condition of not changing the size of the feature diagram, and meanwhile, the number of channels of the feature diagram can be controlled, so that the subsequent calculation amount is reduced; the convolution kernel of 3 multiplied by 3 has smaller parameters, which can process the characteristic information and further reduce the calculation of the network; the hole convolution can obtain more global feature detail information, the reasoning capability is enhanced, the shielding target is well identified, and the adaptability of the model to the multi-scale target is improved while the grid effect is eliminated due to the arrangement of different hole rates. The parallel branch 2 is a convolution of 3 x 3 with a void rate equal to 1 and is suitable for detecting small and medium-sized targets, the parallel branch 3 is a convolution of 3 x 3 with a void rate equal to 2 and is suitable for detecting medium-sized targets, and the parallel branch 4 is a convolution of 3 x 3 with a void rate equal to 4 and is suitable for detecting medium and large-sized targets.

After the spliced feature map, feature information fusion is carried out by adopting a convolution layer of 1 multiplied by 1, the channel dimensionality is reduced to a specified number, and the flexibility of the mixed receptive field module is improved.

The feature pyramid enables feature graphs of each layer to have strong semantic information by fusing high-layer features downwards, and prediction can be performed respectively. Compared with a characteristic pyramid, the characteristics output by the low-layer embedded characteristic pyramid module of the application not only contain rich semantic information, but also contain specific detail information, and double promotion of multi-scale target detection effect and positioning accuracy is achieved.

Referring to the specification and the attached figure 4, C5' is a feature diagram after passing through a low-layer embedded feature pyramid module, and referring to the attached figure 5, a composite convolutional layer (formed by cascading a 3 × 3 convolutional layer, a BN layer and a leakyreu active layer) aims at processing fused features, optimizing model training and improving the nonlinear expression capability of the features.

The low-level embedded characteristic pyramid module firstly fuses a current-level characteristic graph and a high-level characteristic graph subjected to channel compression and upsampling in a mode of adding element by element and channel by channel to form a composite characteristic graph and complete the embedding of high-level semantic information; secondly, fusing the composite feature map and the downsampled low-level feature map to form a mixed feature map, and completing embedding of low-level detail information; and finally, after each mixed feature map is subjected to the designed composite convolution layer, generating a final feature map to be detected and entering the next module.

A target detection method based on a mixed cavity convolution pyramid refers to the attached figure 1 of the specification, and comprises the following steps:

building a backbone network based on an Achor-free mechanism, obtaining feature maps C3, C4 and C5 through the backbone network, and outputting a highest-level feature map C5 output by the backbone network to a low-level embedded feature pyramid module after feature enhancement is carried out on the highest-level feature map C5 output by the backbone network through a mixed receptive field module;

Wherein, in the process of training the network, the loss function is as follows:

L _cls the specific expression form is a Focal local Loss function:

wherein y is a sample label, y' predicts the probability that a sample is a positive case, and gamma is a focusing parameter; compared with the common cross entropy Loss function, the Focal local increases a gamma factor, and the influence of simple samples is reduced by controlling the value of gamma to focus more on difficult samples.

L _reg ＝1-GIoU

where A and B represent the prediction and real boxes, IoU is their intersection-to-parallel ratio, and their minimum convex set C, i.e. the minimum bounding of bounding A, B, is calculated firstBox, then combine C this minimum convex set to calculate GIoU, and thus L _reg 。

In summary, after reading the present disclosure, those skilled in the art should make various other modifications without creative efforts according to the technical solutions and concepts of the present disclosure, which are within the protection scope of the present disclosure.

Claims

1. A target detection network based on a hybrid void convolution pyramid is characterized in that: the system comprises a backbone network, a mixed receptive field module, a low-level embedded characteristic pyramid module and a detection module; the backbone network extracts target picture features by using a layered cascade network structure; the mixed receptive field module is used for carrying out feature enhancement on the highest-layer feature map output from the topmost end of the backbone network; the low-layer embedded feature pyramid module is used for fusing high-layer features downwards on the basis of a feature pyramid and generating a final feature graph to be detected in a low-layer embedding mode; the detection module is used for positioning and classifying the characteristic diagram to be detected and outputting a result; the backbone network is a single-stage detection network based on a Res2Net50 network, an echo-free mechanism of an FCOS is introduced in the prediction of a target, pixel-by-pixel prediction is carried out, and a Centeress branch network is added in a Loss function part; the mixed receptive field module comprises four parallel branches, including a 1 × 1 convolutional layer branch and three 3 × 3 convolutional layer branches with void rates of 1, 2 and 4 respectively; the mixed receptive field module splices together feature maps obtained by parallel void convolutional layers with different void rates, performs feature information fusion by adopting 1 multiplied by 1 convolutional layers, and reduces the channel dimension to a specified number;

c. after each mixed characteristic diagram passes through the composite convolution layer, generating a final characteristic diagram to be detected; wherein, the composite convolution layer is formed by connecting a 3 multiplied by 3 convolution layer, a BN layer and a LeakyReLU activation layer.

2. The hybrid hole convolution pyramid-based target detection network of claim 1, wherein: the fusion mode in the step a and the step b is element-by-element channel-by-channel addition.

3. The hybrid hole convolution pyramid-based target detection network of claim 1, wherein: the feature maps output by the backbone network comprise C3, C4 and C5, and the feature maps are 100 × 100, 50 × 50 and 25 × 25 in size.

4. A target detection method based on a mixed cavity convolution pyramid is characterized in that: the method comprises the following steps: constructing a backbone network based on an Achor-free mechanism, obtaining characteristic graphs C3, C4 and C5 through the backbone network, and outputting a highest-level characteristic graph C5 output by the backbone network to a low-level embedded characteristic pyramid module after carrying out characteristic enhancement through a mixed receptive field module; the backbone network can be a single-stage detection network based on a Res2Net50 network, an echo-free mechanism of an FCOS is introduced for predicting targets, pixel-by-pixel prediction is carried out, and a Centeress branch network is added to a Loss function part; the mixed receptive field module consists of four parallel branches, a 1 × 1 convolutional layer branch, and three void rates are respectively 1, 2 and 4 of 3 × 3 convolutional layer branches; the mixed receptive field module splices characteristic graphs obtained by parallel void convolutional layers with different void rates together, performs characteristic information fusion by adopting 1 multiplied by 1 convolutional layers, and reduces the channel dimension to a specified number;

II, forming composite characteristics by the aid of the low-layer embedded characteristic pyramid module and characteristic graphs C4 and C3 output by the main network through up-sampling and down-sampling operations, generating a characteristic graph to be detected after the composite characteristics pass through a composite convolution layer, and conveying the characteristic graph to be detected to a detection module for target positioning and classification tasks; the composite convolutional layer is formed by connecting a 3 multiplied by 3 convolutional layer, a BN layer and a LeakyReLU activation layer;

5. The method of claim 4, wherein the mixed-hole convolution pyramid-based target detection method comprises: in the process of training the network in step iii, the loss function is as follows:

wherein p is _x,y Representing the class prediction probability, t _x,y Expressing regression prediction coordinates, and N expressing the number of positive samples; k is an indicator function, and is 1 if the current prediction is determined to be a positive sample, otherwise is 0;

L _cls the specific expression form is a Focal local Loss function:

L _reg ＝1-GIoU