CN113011442A

CN113011442A - Target detection method and system based on bidirectional adaptive feature pyramid

Info

Publication number: CN113011442A
Application number: CN202110326343.7A
Authority: CN
Inventors: 李新; 李贺贺
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-22

Abstract

The invention provides a target detection method and a system based on a bidirectional self-adaptive feature pyramid, wherein the method comprises the following steps: acquiring an image to be detected; adopting a pre-trained target detection model to carry out target detection; and finally, performing self-adaptive weighted fusion with the feature maps output by other layers respectively to serve as the final output of the layer. The invention can fully utilize the characteristic information of different scales, thereby obtaining richer characteristics and improving the detection precision.

Description

Target detection method and system based on bidirectional adaptive feature pyramid

Technical Field

The invention belongs to the technical field of image target detection, and particularly relates to a target detection method and system based on a bidirectional self-adaptive feature pyramid.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The target detection is a very important direction for computer vision, can be widely applied to various fields such as video monitoring, traffic control, personnel security, automatic driving, safety systems, medical treatment and the like, and reduces the manpower consumption through the target detection. The image target detection algorithm is widely applied, but the existing target detection method still has defects, such as poor performance, or difficulty in considering both the speed and the accuracy of target detection.

According to the knowledge of the inventor, many target detection methods with better detection effects use a feature pyramid at present, but the existing feature pyramid structure only has two channels from top to bottom and from bottom to top to fuse features, and although the capability of target detection can be improved for a first-order target detection network, the features of the pyramid at the same level are not fully utilized; in addition, the existing feature pyramid structure performs target detection after obtaining outputs of each layer through a bottom-up channel, and mutual influences among different layers are not fully considered, which have certain influences on target detection results.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a target detection method and a system based on a bidirectional self-adaptive feature pyramid, which can fully utilize feature information of different scales, thereby obtaining more abundant features and improving the detection precision.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

a target detection method based on a bidirectional adaptive feature pyramid comprises the following steps:

acquiring an image to be detected;

adopting a pre-trained target detection model to carry out target detection;

and finally, performing self-adaptive weighted fusion with the feature maps output by other layers respectively to serve as the final output of the layer.

Further, the bidirectional adaptive feature pyramid comprises a top-down fusion path and a bottom-up enhancement path, and feature maps of different scales are fused and enhanced.

Further, in each layer of the feature pyramid, the input features of the layer are fused again with the fused features obtained by the transverse connection of the layer.

Further, the top-down fusion path and the bottom-up enhancement path are repeatedly performed a plurality of times.

Further, the target detection model training method comprises the following steps:

acquiring an image data set containing a target to be detected and preprocessing the image data set to obtain an image training set;

and training the built deep learning neural network based on the image training set to obtain a target detection model, wherein the deep learning neural network comprises a backbone network, a bidirectional self-adaptive feature pyramid, a spatial pyramid pooling layer and a full-connection layer.

Further, an image test set is obtained after an image data set containing a target to be detected is obtained and preprocessed;

taking n images as a group, and respectively carrying out processes such as turning, scaling, color gamut change and the like on each image; randomly cutting each image; and splicing the n randomly cut images to obtain a training image.

Further, an image test set is obtained after an image data set containing the target to be detected is obtained and preprocessed, and the image test set is used for testing and optimizing a target detection model.

One or more embodiments provide a target detection system based on a bidirectional adaptive feature pyramid, including:

the data acquisition module is used for acquiring an image to be detected;

the target detection module is used for detecting a target by adopting a pre-trained target detection model;

One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the bi-directional adaptive feature pyramid-based object detection method when executing the program.

One or more embodiments provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the bi-directional adaptive feature pyramid-based object detection method.

The above one or more technical solutions have the following beneficial effects:

by introducing the bidirectional self-adaptive feature pyramid, the precision of the first-order target detection network can be effectively improved, so that the target detection precision of the first-order target detection network on the image is higher, the speed is higher, and the missing rate and the false detection rate are lower.

The bidirectional self-adaptive feature pyramid adds the self-adaptive feature fusion module on the basis of obtaining output features of each layer through a top-down fusion path and a bottom-up enhancement path, performs self-adaptive weighting on the output of each layer to obtain final output of each layer by introducing self-adaptive weight information on the basis of obtaining output features of each layer through the bottom-up enhancement path, performs self-adaptive feature fusion by using features between different layers, fully combines mutual influence between different layers, better balances feature information of different scales, and improves the precision of a first-order target detection network.

Compared with the existing feature pyramid, the bidirectional self-adaptive feature pyramid adds an extra edge to each layer, and fuses the input features of the layer and the fusion features obtained by transverse connection of the layer again, so that richer features can be obtained through fusion.

And moreover, the feature fusion module is repeated for many times, so that feature information of different levels is more fully utilized.

In the model training process, training images with rich target scales and types are adopted, so that the model can be well adapted to different image target detection, the effect is good in practical application, and the model is high in robustness and high in generalization capability.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is an exemplary illustration of an enhanced target image in accordance with one or more embodiments of the invention;

FIG. 2 is a diagram of a characteristic pyramid PANet structure;

FIG. 3 is a diagram of a feature pyramid BAFPN structure in accordance with one or more embodiments of the invention;

FIG. 4 is an overall block diagram of an object detection model in one or more embodiments of the invention;

FIG. 5 is a flow diagram illustrating the training of an object detection model in accordance with one or more embodiments of the invention;

FIG. 6 is an image target detection result using the original YOLOv5 model;

FIG. 7 illustrates a target detection result of an image of a target detection model in one or more embodiments of the invention;

FIG. 8 is a comparison of the performance of the original YOLOv5 and a target detection model on the MS COCO data set in one or more embodiments of the invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only,

and are not intended to limit exemplary embodiments in accordance with the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The invention provides a first-order image target detection method based on a characteristic pyramid structure, and the process of carrying out target detection on an image is briefly described in the following by combining the attached drawings.

Step 1: acquiring an image to be detected;

step 2: and carrying out target detection by adopting a target detection model.

The method for acquiring the target detection model specifically comprises the following steps:

s1, acquiring an image training set and preprocessing the image training set; the method specifically comprises the following steps:

s1.1, acquiring an image data set containing a target object and labeling the image data set;

specifically, in this embodiment, a target object included in the picture is selected by using a rectangular frame through a labeling tool, and is labeled and classified, and each labeled picture generates a corresponding txt file, where the txt file includes a target category, a center coordinate of a labeled rectangular frame, and a width and a height of the rectangular frame.

S1.2, preprocessing an image data set;

the pretreatment specifically comprises: taking n images as a group, and respectively carrying out processes such as turning, scaling, color gamut change and the like on each image; randomly cutting each image; and splicing the n randomly cut images to obtain a training image.

And S1.3, dividing the preprocessed image data set into a training set and a testing set.

In this embodiment, four pictures are selected each time, the four pictures are respectively subjected to flipping, scaling, color gamut change and the like, the four pictures are randomly cut and then spliced to one picture, finally, an image data set after data enhancement is divided into a training set and a test set according to a ratio of 7:3, and images before and after data enhancement are as shown in fig. 1.

S2: training the built deep learning neural network based on the image training set to obtain a target detection model, wherein the deep learning neural network adopts a bidirectional self-adaptive feature pyramid to perform feature fusion.

The deep learning neural network comprises a backbone network, a bidirectional self-adaptive feature pyramid, a spatial pyramid pooling layer and a full-connection layer.

Specifically, the backbone network is used for extracting feature maps of different scales and inputting the feature maps to layers corresponding to the corresponding scales in the bidirectional adaptive feature pyramid respectively; the bidirectional self-adaptive feature pyramid comprises a top-down fusion path and a bottom-up enhancement path, and is used for fusing and enhancing feature maps with different scales.

The top-down fusion path is to up-sample the high-level feature map and then transversely connect the up-sampled feature to the previous-level feature, so that the high-level feature information of each level is enhanced; the bottom-up enhanced path is to maximally pool the shallow feature map and connect the shallow feature map to the next layer of features, so as to better store the shallow feature information.

In order to fully utilize feature information of different scales, obtain richer features, and improve prediction accuracy, in this embodiment, in each layer of the feature pyramid, the input features of the layer are fused again with the fusion features obtained by the layer through transverse connection. The output of the previous layer is fused with the characteristics fused again to form the output of the current layer, and the output of the current layer is connected to the characteristics of the next layer to form the fused characteristics, namely the output of the next layer.

In order to more fully utilize the feature information of different layers, the top-down fusion path and the bottom-up enhancement path are repeatedly executed for a plurality of times, in this embodiment, three times, to obtain the output of each layer.

In order to enable feature information of each layer to be fused more fully, self-adaptive weighting fusion is further carried out on the basis that each layer output is obtained through a bottom-up enhancing path. Specifically, at the end of each layer, the feature map output by each layer obtained by the bottom-up enhancement path is further adaptively weighted and fused with the feature maps output by other layers, respectively, to serve as the final output of the layer.

The deep learning neural network built in the embodiment is improved on the basis of the original Yolov5 network model.

The original YOLOv5 network model adopts a feature pyramid PANet, and three feature graphs with different scales extracted by a backbone network pass through different layers (C) as shown in fig. 2₃、C₄、C₅) Inputting into PANET, connecting in network via top-down path, integrating multi-level information via bottom-up enhanced path to obtain P₃、P₄、P₅. The method comprises the steps that the PANET uses adaptive feature pooling to simultaneously perform region pooling on multiple levels, finally full-connection fusion is performed, a bottom-up path enhancement is added, information of multiple levels is integrated and then prediction is performed, but the utilization of image features of different levels is insufficient, and the features of different levels are not fully fused.

As shown in fig. 3, on the basis of the PANet, if the input and output nodes are in the same layer, BAFPN adds an extra edge (i.e., a dotted line on the left side in the figure) to the feature pyramid in the original YOLOv5, and fuses the input features of the layer and the fusion features obtained by transversely connecting the layer again, so that more features can be fused, weight information (i.e., a dotted line on the right side in the figure) is introduced, feature information of different scales is better balanced, and the feature fusion module is repeated three times, so that feature information of different levels is more fully utilized, and finally, the adaptive feature fusion module is added, so that feature information of different scales is better balanced.

The deep learning neural network finally constructed in the embodiment is to replace the feature pyramid structure PANet of the original YOLOv5 with a Bidirectional Adaptive Feature Pyramid (BAFPN), that is, optimize YOLOv 5. Specifically, CSPDarknet53 is used as the backbone network responsible for feature extraction, with bi-directional adaptive feature pyramid (BAFPN) and spatial pyramid pooling as the neck, as shown in fig. 4.

Optimizing YOLOv5 obtaining YOLO Head characteristic diagram with different scales according to the input image, and marking as P₃、P₄、P₅The resolutions are 1/8, 1/16 and 1/32 of the initial input image respectively, the characteristic information contained in the YOLO Head characteristic degrees of different scales is different, the embodiment adds adaptive characteristic fusion to perform adaptive fusion on the characteristic information of different levels, and the adaptive characteristic fusion module adds different weights to the characteristics of different layers for fusion, as shown in formula (1):

y^l＝α^l·X^1→l+β^l·X^2→l+γ^l·X^3→l(1) wherein:

α^l+β^l+γ^l＝1 (2)

y^lis a feature vector, X, on the output feature map of layer l^1→l、X^2→l、X^3→lFeatures of the 1 st, 2 nd and 3 rd layers are subjected to feature conversion, corresponding features with the same size and the same number of channels as the features of the l-th layer, X^1→l、X^2→l、X^3→lMultiplying the feature fusion weight parameters alpha of the 1 st, 2 nd and 3 rd layers corresponding to the l-th layer respectively³、β³And gamma³And adding to obtain the new fusion characteristics output by the l layer, because the addition mode is adopted, the characteristics of different layers are up-sampled or down-sampled and the number of channels is adjusted, so as to ensure that the sizes of the characteristics output by different layers are the same and the number of channels are also the same during the addition.

Inputting the image training set into the optimized YOLOv5 for training, wherein the training process is as shown in fig. 6 and is as follows:

step A: inputting the images in the enhanced training set into an input end of the optimized YOLOv5, and adjusting the image scale to the maximum resolution suitable for feature extraction of the backbone network in charge of feature extraction;

and B: inputting the images in the enhanced training set into a backbone network CSPDarknet53 in charge of feature extraction to perform feature extraction, and generating feature maps of 76 × 76, 38 × 38 and 19 × 19 in three different scales;

and C: inputting the three feature graphs with different scales generated in the step B into the BAFPN, performing feature fusion of the feature graphs with different scales, and outputting the fused features;

step D: respectively inputting the fused results of the features with different scales output in the step C into a YOLO Head for target detection to obtain target detection results;

step E: the network model is continuously trained by using the images in the training set, and the network parameters are continuously optimized by combining a random gradient descent method and a test set, so that the trained network model is finally obtained.

Thus, a target detection model is obtained.

By adopting the target detection model for target detection, fig. 6 shows the original YOLOv5 detection result to detect 17 airplanes, and fig. 7 shows the YOLOv5 detection result after optimization in this embodiment to detect 80 airplanes, the target detection effect is obviously optimized. Compared with the original YOLOv5, as shown in the experimental result of fig. 8, when the test is performed on the object detection data set MS COCO constructed by microsoft, the overall object detection accuracy of the improved YOLOv5 is significantly improved, the average accuracy (mAP) is improved by 4.3%, and meanwhile, a higher speed is ensured, which fully proves that the BAFPN added with the adaptive feature fusion module provided in this embodiment can effectively optimize the first-order object detection network.

In the target detection method provided in one or more embodiments, by introducing the BAFPN and the adaptive feature fusion module, the capability of the first-order target detection network model for fusing multiple layers of features can be effectively improved, and compared with the original first-order target detection network model, the capability of the first-order target detection network model using the BAFPN and the adaptive feature fusion module for detecting a small target is improved, so that the overall performance is optimized, the accuracy of target detection can be improved, and the requirement of image target detection can be better met.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A target detection method based on a bidirectional self-adaptive feature pyramid is characterized by comprising the following steps:

acquiring an image to be detected;

adopting a pre-trained target detection model to carry out target detection;

2. The method as claimed in claim 1, wherein the bi-directional adaptive feature pyramid includes a top-down fusion path and a bottom-up enhancement path, and the feature maps of different scales are fused and enhanced.

3. The method as claimed in claim 2, wherein in each layer of the feature pyramid, the input features of the layer are fused again with the fused features obtained by cross-linking the layer.

4. A bi-directional adaptive feature pyramid based object detection method as claimed in claim 2 or 3, wherein the top-down fusion path and the bottom-up enhancement path are repeatedly performed a plurality of times.

5. The method for target detection based on the bi-directional adaptive feature pyramid as claimed in claim 1, wherein the method for training the target detection model comprises:

6. The method for detecting the target based on the bidirectional adaptive feature pyramid as claimed in claim 5, wherein an image test set is obtained after an image data set containing the target to be detected is obtained and preprocessed;

7. The method for detecting the target based on the bidirectional adaptive feature pyramid as claimed in claim 5, wherein an image test set is obtained after an image data set containing the target to be detected is obtained and preprocessed, and is used for testing and optimizing a target detection model.

8. A target detection system based on a bidirectional adaptive feature pyramid is characterized by comprising:

the data acquisition module is used for acquiring an image to be detected;

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the bi-directional adaptive feature pyramid based object detection method of any one of claims 1-7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the bi-directional adaptive feature pyramid based object detection method according to any one of claims 1-7.