CN115995020A

CN115995020A - Small target detection algorithm based on full convolution

Info

Publication number: CN115995020A
Application number: CN202211705281.1A
Authority: CN
Inventors: 高明; 苗功勋; 熊英超; 徐家伟
Original assignee: Zhongfu Safety Technology Co Ltd
Current assignee: Zhongfu Safety Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-21

Abstract

The invention provides a small target detection algorithm based on full convolution, which comprises a convolution network model part, a loss calculation part and a training parameter adjustment part; the convolution network model part is used for extracting image features and predicting targets; the loss calculation part is used for calculating the prediction loss during training to acquire a gradient guiding network model for weight learning; the training parameter adjusting part is used for feeding the data with the labels into the network model for forward reasoning, carrying out reverse gradient feedback through the gradient of the loss function, and adjusting the network learning rate and the data set according to the verification precision to obtain the optimal model weight. The invention uses multi-scale feature fusion to improve the extraction capability of different scale target features, uses double-scale target prediction to solve the problem that small targets have small information occupation ratio in a feature map, is easily influenced by large target features, and uses the real boundary calculation loss of the targets to guide a network to learn target boundary features more accurately.

Description

Small target detection algorithm based on full convolution

Technical Field

The invention belongs to the technical field of small target detection, and particularly relates to a small target detection algorithm based on full convolution.

Background

Image-based object detection tasks are an important research focus in the field of computer vision. Targets with pixel areas smaller than 32×32 are regarded as small targets in the coco target detection dataset, and this type of target has been a difficulty of research because of the small amount of information. In recent years, due to the progress of computing equipment and deep learning theory, the accuracy of target detection tasks is greatly improved, and research on small target detection gradually shows a certain effect.

Researchers have four main directions of research aiming at the characteristics of small targets. Firstly, aiming at the condition of small target pixel occupation ratio, increasing the existence of small targets in an image through a data enhancement strategy, such as random clipping, random scaling, target region copy and paste, GAN generating small targets and the like; secondly, aiming at multi-scale feature extraction with small target area and small size, feature matrixes with different depths and different scales are fused in a backbone network, so that the extraction capability of information with different scales of images, such as FPN and various derivation methods of the FPN, is improved; thirdly, regarding global feature fusion of images, researchers consider that related relations exist between objects and scenes and between objects in general, for example, the possibility of fish in water is greater than the possibility of fish in the sky. The method generally uses a channel attention mechanism and a space attention mechanism to realize the learning and the utilization of global characteristic information; fourth, in the anchor frame-free mechanism for small targets, in the target detection method based on the anchor frame, the size of the anchor frame is usually set according to prior experience, and the content in the anchor frame is calculated to classify or regress. In the method, positive and negative sample division is needed, a small target generates larger precision fluctuation when slightly deviation is generated in the cross-merging ratio calculation, so that the small target is difficult to learn, and after the target detection method without an anchor frame appears, researchers improve the detection effect of the small target, for example, an enhanced feature extraction network is used for directly predicting the center point and the size of the target frame.

The method based on data enhancement, multi-scale feature extraction and global feature fusion shows general performance improvement in the field of target detection, and can be used as a plug-in module flexibly in various models of target detection.

The method breaks through the definition of the prefabricated anchor frame based on the anchor-free frame target detection algorithm, and alleviates the imbalance problem of small target sample positive samples in the target detection process on a prediction mechanism, but the anchor-free frame detection method still generally uses the maximum external rectangle of the target to calculate the target frame intersection ratio to generate loss, so as to guide network learning.

Disclosure of Invention

The invention provides a small target detection algorithm based on full convolution, which uses multi-scale feature fusion to improve the extraction capability of target features of different scale information, uses double-scale target prediction to solve the problem that the information duty ratio in a small target feature map is less and is easily influenced by large target features, and uses the real boundary calculation loss of a target to guide a network to learn target boundary features more accurately.

The invention adopts the following technical scheme for solving the technical problems:

the small target detection algorithm based on full convolution comprises a convolution network model part, a loss calculation part and a training parameter adjustment part;

the convolution network model part comprises a backbone network module, a multi-scale feature fusion module and a double-scale prediction module, wherein the backbone network module sequentially performs feature extraction of different scales on the image by using a backbone network; the multi-scale feature fusion module fuses the feature matrix information of different scales; the double-scale prediction module predicts a large-scale target and a small-scale target respectively and finally outputs a result;

the convolution network model part is used for extracting image features and predicting targets; the loss calculation part is used for calculating the prediction loss during training to acquire a gradient guiding network model for weight learning; the training parameter adjusting part is used for feeding the data with the labels into the network model for forward reasoning, carrying out reverse gradient feedback through the gradient of the loss function, and finally adjusting the network learning rate and the data set according to the verification precision; the algorithm comprises the following specific steps:

s1, constructing a network model;

s2, constructing a loss function based on the target boundary distance;

and S3, training and adjusting parameters.

Further, the backbone network module is specifically as follows:

inputting a batch image pixel matrix I into a backbone network module:

I＝[B,C,H,W]

wherein B is the number of batch images, C is the number of channels, the images with 3 channels are usually R red, G green and B blue color features when input, H is the image height, and W is the image width;

three different scale features C3, C4 and C5 are output after the backbone module, wherein 3, 4 and 5 represent the power of 2 times of the feature matrix scale downsampling.

Further, the multi-scale feature fusion module is specifically as follows:

dividing an input feature matrix into two parts in a channel dimension, performing convolution operation and the like on a first part, directly shorting a second part to the tail of the output of the module, and splicing with the result of the first part; finally, feature matrixes P3 and P4 of two scales are obtained.

Further, the dual-scale prediction module is specifically as follows:

the general semantic segmentation prediction structure is used for P3 and P4 to respectively obtain two prediction results with different scales, and the method is specifically implemented as follows:

taking the target with the target connected domain area larger than 32 multiplied by 32 as a large target, and taking charge of prediction by a P4 feature matrix to obtain a large target result R4; taking the target with the target connected domain area smaller than 32 multiplied by 32 as a small target, and taking charge of prediction by a P3 feature matrix to obtain a small target result R3;

when the prediction result and the label are subjected to calculation loss, different label graphs are generated according to the size targets:

when the P3 predicted result loss is calculated, a small target label is used for calculation, and when the P4 predicted result loss is calculated, a large target label is used for calculation;

in the small target label graph, a small target is taken as a first-level label area, and a large target is taken as a second-level label area; in the large target label graph, vice versa;

in the training stage, the first-level label area normally calculates the loss caused by each pixel, the second-level label area does not calculate the loss caused by the pixels predicted as the background area, the target which does not belong to the scale is not predicted, and the influence caused by the feature conflict is prevented;

and taking out the connected domain conforming to the small target rule in the small target result graph R3, and covering the connected domain to the large target result graph R4 to obtain a final prediction result R.

Further, the step S2 specifically includes: since the full convolution network performs class prediction for each pixel, it is more likely that scattered speckle errors occur, so the loss function based on the target boundary distance is:

/>

wherein L is _pixelloss Representing the loss per pixel, class represents the set of all classes, y _true A label, y, representing that the pixel is in a certain class _pred Representing the predicted score for that pixel in a certain class,

the weight coefficient is determined according to the distance between a certain category and the nearest connected domain in the label, and the calculation formula is as follows:

on the upper part

Representing pixel i at y _true In the tag region of the belonging category, < >>

Indicating that pixel i is not at y _true In the label region of the belonging category, dis represents y where pixel i is nearest to _true The distance of the belonging category label region, the weight may further constrain the mispredictions away from the correct label region.

Further, the step S3 is specifically as follows:

s31, collecting target image data required by a task, and designating a label for the data according to a label format of semantic segmentation to obtain a data set required by training;

step S32, dividing the data set into a training set, a verification set and a test set according to the proportion, wherein the general proportion is 7:1:2, and the data set can be modified according to the data quantity condition;

step S33, feeding the training set into the network model constructed in the step S1 for forward calculation, obtaining a prediction result, calculating a gradient by using the loss function constructed in the step S2, and reversely returning to adjust model parameters;

step S34, after training a plurality of batches, according to the accuracy performance of the verification set, adjusting the learning rate parameter, and simultaneously observing whether the model loss descending trend is positively correlated with the accuracy ascending trend of the verification set so as to avoid the occurrence of the fitting phenomenon;

and step S35, finally, testing by using a testing set according to training results of training a plurality of rounds, and selecting an optimal network model as a result model to store for the next small target detection reasoning.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. the small target detection algorithm based on full convolution provided by the invention uses double-scale prediction, uses a network model to concentrate on detection of small targets on small scale, reduces interference conditions between large targets and small targets, constructs label diagrams with different scales, and effectively improves accuracy of the small targets.

2. In order to reduce the speckle prediction error caused by migration of the semantic segmentation network to the target detection task, the small target detection algorithm based on the full convolution provided by the invention provides a loss function aiming at the target boundary, strengthens the loss weight of the target external prediction error, guides the network to focus on the target area and reduces the speckle prediction error.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a network model architecture of the present invention;

FIG. 3 is a diagram of a multi-scale information fusion architecture of the present invention;

fig. 4 is a diagram of a different tag of the present invention.

Detailed Description

The present invention is described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth in detail. The present invention will be fully understood by those skilled in the art without the details described herein. Well-known methods, procedures, flows, components and circuits have not been described in detail so as not to obscure the nature of the invention.

The invention is divided into three parts: firstly, a convolutional network model part extracts image features and predicts targets; the loss calculation part calculates the prediction loss during training to obtain a gradient guiding network model for weight learning; thirdly, the parameter adjusting part is trained, data with labels are fed into the network model to conduct forward reasoning, reverse gradient feedback is conducted through the gradient of the loss function, and finally the network learning rate, the data set and the like are adjusted according to verification accuracy. And finally, selecting the optimal weight parameters for target detection reasoning.

The network model structure diagram in the algorithm is shown in fig. 2:

the model is divided into three modules in fig. 2: the first is a backbone network module, which uses backbone network to extract the features of different scales from the image in turn; secondly, a multi-scale feature fusion module fuses feature matrix information of different scales; and thirdly, a double-scale prediction module predicts a large-scale target and a small-scale target respectively and finally outputs a result.

Backbone network module this module can be different to the task degree of difficulty, equipment calculation case difference, flexible use different general backbone network modules. The number of feature scales calculated by different backbone networks may be different, and the feature matrix of the last three scales may be used as output.

According to the invention, ESNet proposed in PP-PicoDet is selected, backbone networks such as ResNet series, SENet, mobileNet series and Efficientnet series can be selected for substitution, and the calculation complexity and feature extraction capability of different backbone networks are different.

The backbone network uses an advanced network searching algorithm and a channel information fusion structure to improve the feature extraction capability of the backbone network, and uses a lightweight network architecture, namely a ShuffleNet and a Ghost block, so that the computational complexity is effectively reduced.

Inputting a batch image pixel matrix I into a backbone network module:

I＝[B,C,H,W]

where B is the number of batch images, C is the number of channels, the input is typically 3 channels of images R (red), G (green), B (blue) color features, H is the image height, and W is the image width.

Three downsampled 8-fold C3, 16-fold C3, 32-fold C5 feature matrices were acquired. In order to further fuse the feature information, the algorithm adopts a feature fusion structure as shown in fig. 3:

the CSP structure is from CSPNet, the input feature matrix is divided into two parts in the channel dimension, the first part is subjected to convolution and other operations, the second part is directly connected to the tail part of the module output in a short way, and the second part is spliced with the result of the first part, so that the diversity of the gradient link is enriched under the condition of reducing the calculated amount.

Finally, feature matrixes P3 and P4 of two scales are obtained.

After the feature matrixes P3 and P4 with two scales are obtained, in order to reduce the unbalanced loss condition caused by targets with different scales, a universal semantic segmentation prediction structure is used for the P3 and the P4 to respectively obtain prediction results with two different scales, and the method is specifically implemented as follows:

in the algorithm, a target with the target connected domain area larger than 32 multiplied by 32 is regarded as a large target, and a P4 feature matrix is responsible for prediction to obtain a large target result graph R4; and taking the target with the target connected domain area smaller than 32 multiplied by 32 as a small target, and taking charge of prediction by a P3 feature matrix to obtain a small target result graph R3.

In a real object detection task, under the condition that the imaging condition is influenced, the size of the object represented in an image is not fixed, so that one object is divided into large objects in one image and small objects in the other image, and the network practical performance is influenced by the fact that the network learning is forcefully restrained. Therefore, when the prediction result and the label are subjected to calculation loss, different label graphs are generated according to the size targets, as shown in fig. 4:

when the P3 predicted result loss is calculated, the small target label is used for calculation, and when the P4 predicted result loss is calculated, the large target label is used for calculation.

In the small target label graph, a small target is taken as a first-level label area, and a large target is taken as a second-level label area; in large target label maps, vice versa.

In the training phase, the primary label area normally calculates the loss caused by each pixel, and the secondary label area does not calculate the loss caused by the pixels predicted as the background area. The method does not restrict the prediction of targets which do not belong to the scale, and prevents the influence caused by feature conflict.

In the prediction stage, the connected domain conforming to the small target rule in the small target result graph R3 is taken out and covered to the large target result graph R4, so as to obtain a final prediction result R.

Constructing a loss function based on target boundary distances

Because the full convolution network performs class prediction on each pixel, the scattered spot errors are more likely to occur, and in order to further enable the network model to constrain the type errors, the algorithm proposes a loss function based on the target boundary distance:

on the upper part

Indicating that pixel i is not at y _true In the label region of the belonging category, dis represents y where pixel i is nearest to _true Distance of the belonging category label area. The weights may further constrain mispredictions away from the correct tag region.

The training and parameter adjusting steps are as follows:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The small target detection algorithm based on full convolution is characterized by comprising a convolution network model part, a loss calculation part and a training parameter adjustment part;

s1, constructing a network model;

s2, constructing a loss function based on the target boundary distance;

and S3, training and adjusting parameters.

2. The small target detection algorithm based on full convolution according to claim 1, wherein the backbone network module is specifically as follows:

inputting a batch image pixel matrix I into a backbone network module:

I＝[B,C,H,W]

3. The small target detection algorithm based on full convolution according to claim 1, wherein the multi-scale feature fusion module is specifically as follows:

4. The full convolution based small target detection algorithm according to claim 1, wherein the dual-scale prediction module is specifically as follows:

5. The small target detection algorithm based on full convolution according to claim 1, wherein step S2 is specifically: since the full convolution network performs class prediction for each pixel, it is more likely that scattered speckle errors occur, so the loss function based on the target boundary distance is:

on the upper part

6. The small target detection algorithm based on full convolution according to claim 1, wherein step S3 is specifically as follows: