CN115082688A - Multi-scale feature fusion method based on target detection - Google Patents

Multi-scale feature fusion method based on target detection Download PDF

Info

Publication number
CN115082688A
CN115082688A CN202210620848.9A CN202210620848A CN115082688A CN 115082688 A CN115082688 A CN 115082688A CN 202210620848 A CN202210620848 A CN 202210620848A CN 115082688 A CN115082688 A CN 115082688A
Authority
CN
China
Prior art keywords
network
scale
fusion
feature
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210620848.9A
Other languages
Chinese (zh)
Other versions
CN115082688B (en
Inventor
闫连山
董高照
姚涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai New Generation Information Technology Research Institute Of Southwest Jiaotong University
Aidian Shandong Technology Co ltd
Original Assignee
Yantai New Generation Information Technology Research Institute Of Southwest Jiaotong University
Aidian Shandong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai New Generation Information Technology Research Institute Of Southwest Jiaotong University, Aidian Shandong Technology Co ltd filed Critical Yantai New Generation Information Technology Research Institute Of Southwest Jiaotong University
Priority to CN202210620848.9A priority Critical patent/CN115082688B/en
Publication of CN115082688A publication Critical patent/CN115082688A/en
Application granted granted Critical
Publication of CN115082688B publication Critical patent/CN115082688B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/34Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale feature fusion method based on target detection, which comprises the steps of collecting computer vision image samples through a network to establish a multi-scale target detection data set, and dividing the data set into a training set and a test set; the YOLOv5 algorithm is used as a one-stage representation to be responsible for detecting the target object in the image; extracting multi-scale image features through multi-stage and multi-level convolution operation of a backbone network; one branch is connected with a neck network in a traditional characteristic fusion mode, the other branch is connected with the neck network with the same sampling multiplying power in a shortcut mode, and the last branch is connected with a prediction structure with the same sampling multiplying power in the shortcut mode; the method comprises the steps that a three-branch backbone network structure is deeply learned, and characteristic images with different scales in the backbone network are transmitted backwards through three branches to achieve forward and backward transmission of a neural network; the method has the advantages of high target detection accuracy, easy application to large-scale data sets and various network model structures, and simple implementation mode, thereby having wide application prospect and huge market value.

Description

Multi-scale feature fusion method based on target detection
Technical Field
The invention relates to the field of deep learning, in particular to a multi-scale feature fusion method based on target detection.
Background
Before various feature fusion networks come into play, most of large network structures adopt a one-way one-dimensional structure from beginning to end, such as AlexNet, VGGNet, ResNet and the like in the earliest period, YOLOv1 and YOLOv2 in the early period also adopt the structure, until a FPN feature pyramid network structure is published in CVPR 2017, people only gradually realize that the connection mode, the stacking mode and the overall trend of the network structures can be changed besides the fact that the backbone network structure is continuously stacked to simply pursue feature extraction benefits, and the overall structure can be presented in a two-dimensional mode, and the structure is adopted in YOLOv3 in the later period and a Neck Neck network in the later period is independently presented. Later, a large union Tencent optimization diagram in Port is provided by CVPR 2018, an improved network-PANET Path aggregation network based on FPN is provided, a one-dimensional Bottom-up Path Augmentation structure is added in a fusion mode of an FPN structure from the perspective of network output, and the method mainly considers that shallow features of the network contain a large number of fine-grained features and plays a vital role in a classification task of pixel levels, namely fusion of target detection in different scales and example segmentation. Then, a Google Brain team publishes a characteristic pyramid network NAS-FPN based on neural architecture search in a CVPR 2019, and the pyramid network is used for carrying out automatic ML automatic machine learning on a PAN network structure, namely, the optimal connection mode and parameters based on the PAN network structure are automatically searched through the machine learning. However, the three network structures are all constructed by being constrained to a two-dimensional plane, so that good feature transmission and information fusion cannot be performed with the backbone network when the network model is connected in a back-up and back-down manner for many times, and especially when the network structures are connected in a back-up and back-up manner for many times, the connection between the feature information of the deep network and the backbone network is weakened. In addition, the way of AutoML like NAS-FPN is very demanding for computation, and usually, AutoML has a good GPU and its computation time is up to several hundred days. Later, the Google Brain team published a bipartite feature pyramid network structure of bipartite feature pyramid (bipartite feature pyramid) in the CVPR in 2020, each layer of modules in the FPN network model was regarded as a node, a three-dimensional stereo connection mode was introduced, the feature transfer and feature fusion modes of the whole network were improved from a three-dimensional perspective, and the whole network model was made to jump from the first two-dimensional plane connection mode to paper, so that the stereo third-dimensional connection was increased.
Currently, in the YOLO algorithm, only the network structure of FPN and PAN is used: the method is characterized in that a FPN network structure is adopted by YOLOv3, a PAN network structure is adopted by YOLOv4 and YOLOv5 which are published simultaneously, the two structures are both CVPR 2018 and earlier structures, next, a latest BiFPN connection mode is adopted on YOLOv5, performance improvement brought on YOLOv5 by the mode is analyzed, existing defects are overcome, improvement is carried out, and a new network structure-AS-BiFPN is designed.
Disclosure of Invention
The present invention aims to provide a multi-scale feature fusion method based on target detection to solve the problems proposed in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
the invention discloses a multi-scale feature fusion method based on target detection, which comprises the steps of collecting computer vision image samples through a network to establish a multi-scale target detection data set, and dividing the data set into a training set and a test set; a one-stage representation is used as a YOLOv5 (You Only Look Oncevent 5) algorithm to be responsible for detecting the target object in the image; extracting multi-scale image features through multi-stage and multi-level convolution operation of a backbone network; one branch is connected with a neck network in a traditional characteristic fusion mode, the other branch is connected with the neck network with the same sampling multiplying power in a Shortcut (Shortcut) mode, and the last branch is connected with a prediction structure with the same sampling multiplying power in the Shortcut mode; the method comprises the steps that a three-branch backbone network structure is deeply learned, and characteristic images with different scales in the backbone network are transmitted backwards through three branches to achieve forward and backward transmission of a neural network; the method has the characteristics of high target detection accuracy, easiness in application to large-scale data sets, easiness in application to various network model structures, easiness in implementation, better fidelity to characteristic image information of different scales and the like, and therefore has wide application prospects and huge market values.
The invention discloses a multi-scale feature fusion method based on target detection, which is characterized in that the method comprises the following steps of:
step S1, collecting computer vision image samples through a network to establish a multi-scale target detection data set, and establishing and dividing the data set into a training set and a testing set;
step S2, extracting the characteristics of the image by using a Backbone network (Backbone) of a YOLOv5 target detection algorithm;
step S3, realizing multi-scale fusion by using a three-branch feature fusion method of a Backbone network (Backbone), a Neck network (neutral) and a Prediction structure (Prediction), repeatedly learning weight parameters on each structural branch through deep learning, and continuously reducing the difference between a target value and a predicted value during training according to a training mode of the deep learning, namely obtaining an optimized network structure under a target domain data set by taking a minimum loss function as a learning criterion, wherein the fusion mode is based on FPN (feature neural networks);
step S4, improving and forming PAN on the basis of FPN, and utilizing accurate positioning signals stored in low-level features to promote a feature pyramid framework;
and step S5, improving and forming BiFPN on the basis of PAN, and enabling the network to learn the weights of different input features by itself through BiFPN.
Further, the FPN in step S3 is decomposed into three progressive stages, which includes the following steps:
step S31, in the stage of generating characteristics by the Backbone network, the task in the deep learning computer vision field is to generate abstract semantic characteristics based on the commonly used and pre-trained Backbone network, and then to adjust the image morphological characteristics extracted by the Backbone network respectively aiming at different application scenes; the characteristics generated by the Backbone network backhaul are divided according to the stages and are respectively recorded as
Figure 277216DEST_PATH_IMAGE001
N is a natural number, wherein the number n is the same as the number of the stage, and represents different stage characteristics of the image morphological characteristic downsampling, namely the times of halving the resolution, such as
Figure 524657DEST_PATH_IMAGE002
Representing feature maps output by stage2, at resolution of the input picture
Figure 595382DEST_PATH_IMAGE003
Figure 648788DEST_PATH_IMAGE004
Representing feature maps output by stage5, with resolution of input pictures
Figure 426251DEST_PATH_IMAGE005
Step S32, feature fusion stage, FPN using the different resolution features generated in step S31 as input, outputting the fused features, the output features being marked with P as number, the input of FPN being
Figure 906911DEST_PATH_IMAGE002
Figure 137035DEST_PATH_IMAGE006
Figure 994133DEST_PATH_IMAGE007
Figure 626103DEST_PATH_IMAGE004
Figure 277664DEST_PATH_IMAGE008
After being fused, the output is
Figure 260663DEST_PATH_IMAGE009
Figure 655873DEST_PATH_IMAGE010
Figure 142349DEST_PATH_IMAGE011
Figure 230390DEST_PATH_IMAGE012
Figure 497424DEST_PATH_IMAGE013
Expressed by a mathematical formula:
Figure 368428DEST_PATH_IMAGE014
and step S33, outputting a bounding box through the detection head, outputting the fused characteristics through the FPN, and inputting the fused characteristics into the detection head for specific object detection.
Further, the Fusion policy used by the BiFPN in step S5 specifically includes the following steps:
step S51, the unboundfusion policy has the formula:
Figure 974990DEST_PATH_IMAGE015
the formula is the first strategy of deep learning feature fusion, wherein,
Figure 233933DEST_PATH_IMAGE016
the weight parameters which can be learnt represent the weight proportion of data among the nodes of the single deep learning neural network;
Figure 722683DEST_PATH_IMAGE017
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
in step S52, the Softmax-basedfusion policy formula is:
Figure 662957DEST_PATH_IMAGE018
the formula is a second strategy of deep learning feature fusion, wherein,
Figure 186342DEST_PATH_IMAGE016
Figure 350607DEST_PATH_IMAGE019
the weight parameters which can be learnt represent the weight proportion of data among a plurality of deep learning neural network nodes;
Figure 529916DEST_PATH_IMAGE017
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
in step S53, the fastnormalized fusion strategy equation is:
Figure 70619DEST_PATH_IMAGE020
the formula is a third strategy of deep learning feature fusion, wherein,
Figure 448510DEST_PATH_IMAGE016
Figure 986939DEST_PATH_IMAGE019
the weight parameters which can be learnt represent the weight proportion of data among a plurality of deep learning neural network nodes;
Figure DEST_PATH_IMAGE021
is aA very small number to ensure that the denominator is not 0,
Figure 450281DEST_PATH_IMAGE017
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
step S54, integrating two-way cross-scale connection and fast normalization fusion:
Figure 732358DEST_PATH_IMAGE022
in the formula, the content of the active carbon is shown in the specification,
Figure 230336DEST_PATH_IMAGE023
are edge intermediate nodes from top to bottom and from bottom to top,
Figure 736403DEST_PATH_IMAGE024
are edge lateral nodes from top to bottom and from bottom to top,
Figure 890304DEST_PATH_IMAGE025
is an intermediate node of 6 levels in the top-down path,
Figure 772810DEST_PATH_IMAGE026
are the lateral nodes at level 6 in the bottom-up path, all other feature nodes are constructed in a similar manner.
Further, the method also comprises the following steps:
step S61, adding a connection structure on the skip paper on the lateral path to span the path from bottom to top, directly performing information fusion on the backbone network and the prediction structure, and improving the network by updating the weight ratio of different paths during training so as to enhance the characteristic information acquisition capability and the information fusion capability of the prediction structure;
step S62, reserving an edge feature layer and head and tail nodes on the basis of BiFPN, applying weight parameter influence on each structure fusion path during training, avoiding edge feature fusion structures weakened by BiFPN, and connecting all feature layers required to be used in a cross-channel mode in the same way;
step S63, integrating bidirectional cross-scale connection and fast normalization fusion based on steps S61 and S62:
Figure 62977DEST_PATH_IMAGE027
in the formula, the first step is that,
Figure 739946DEST_PATH_IMAGE023
are edge intermediate nodes from top to bottom and from bottom to top,
Figure 177880DEST_PATH_IMAGE024
are edge lateral nodes from top to bottom and from bottom to top,
Figure 801759DEST_PATH_IMAGE025
is an intermediate node of 6 levels in the top-down path,
Figure 8750DEST_PATH_IMAGE026
is a lateral node at level 6 in the bottom-up path, all other feature nodes are constructed in a similar manner;
in step S64, to further improve efficiency, the image two-dimensional tensor convolution operation may adopt a depth separable convolution operation for feature fusion, and add batch normalization and activation after each convolution operation. This step is used depending on the application of different scenes, and is not related to the structure of the present invention.
Further, step S1 includes obtaining multi-scale target images from channels such as personal multi-scale images, Kaggle target detection competition, etc. on the network, and the image target scale division standard given by MS COCO, the size area of the small target form of the image morphology is
Figure 122199DEST_PATH_IMAGE028
Of medium target size area of
Figure 985113DEST_PATH_IMAGE029
The large target size area is
Figure 209421DEST_PATH_IMAGE030
(ii) a When all images are scaled to one size by a Resize function when the images are input into a network structure, scale features of different target sizes are formed in the images with uniform sizes, and therefore an image data set of multi-scale information is established.
Further, in step S2, a stage target detection algorithm is adopted as YOLOv5 (young Only Look Once version 5) as a research basic model; the multi-scale feature fusion method is a hot-pluggable modular method, and is effectively migrated and used on different models, the process described in the step S3 is adopted for changing different models, namely the structural models of the target detection algorithm are respectively improved in sequence by the evolution process of FPN → PAN → BiFPN → AS-BiFPN, and the improvement steps are deep in sequence.
Compared with the prior art, the invention has the advantages that:
the multi-scale cross-structure node connection mode is utilized to realize effective fusion of multi-scale characteristic information, and deep specialized semantic information loss caused by the increase of the depth of a deep learning network structure is avoided. If learning weight parameters are introduced to the feature fusion structures, the multi-scale feature information fusion can be influenced by different iteration times and learning rates in the training process of the network model, and the weight proportion in the multi-scale feature fusion can be learned in the training process. Since the extracted upstream feature information is used without generating deeper feature information, the network structure has a computational complexity of
Figure 536497DEST_PATH_IMAGE031
I.e. by
Figure 758531DEST_PATH_IMAGE032
And furthermore, the feature fusion capability of the traditional network structure represented by FPN and PAN is enhanced under the condition of not changing the computational complexity of the network structure. The invention has the modular design mode of hot plugging, simple realization mode and easy application to various typesThe large-scale data set is easy to be applied in practice, so that the method has wide application prospect and huge market value.
Drawings
Fig. 1 is a diagram of an improved basic BiFPN network structure of the present invention.
FIG. 2 is a schematic diagram of the AS-BiFPN structure of the present invention.
FIG. 3 is a graph of the average precision mean value experiment result of the AS-BiFPN structure.
Fig. 4 is a comparison of experimental results of the optimized network model on large-scale target detection.
Fig. 5 is a comparison of experimental results of the optimized network model on small-scale target detection.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.
Example 1
The invention relates to a multi-scale feature fusion method based on target detection, which comprises the steps of collecting computer visual image samples through a network to establish a multi-scale target detection data set, and dividing the data set into a training set and a test set; a one-stage representation is used as a YOLOv5 (You Only Look Oncevent 5) algorithm to be responsible for detecting the target object in the image; extracting multi-scale image features through multi-stage and multi-level convolution operation of a backbone network; one branch is connected with a neck network in a traditional characteristic fusion mode, the other branch is connected with the neck network with the same sampling multiplying power in a Shortcut (Shortcut) mode, and the last branch is connected with a prediction structure with the same sampling multiplying power in the Shortcut mode; the method comprises the steps that a three-branch backbone network structure is deeply learned, and characteristic images with different scales in the backbone network are transmitted backwards through three branches to achieve forward and backward transmission of a neural network; the method has the characteristics of high target detection accuracy, easiness in application to large-scale data sets, easiness in application to various network model structures, easiness in implementation, better fidelity to characteristic image information of different scales and the like, and therefore has wide application prospects and huge market values.
The invention realizes the following steps through a computer device:
step S1, collecting computer vision image samples through a network to establish a multi-scale target detection data set, and establishing and dividing the data set into a training set and a testing set;
step S1 includes obtaining multi-scale target image from channels such as personal multi-scale image and Kaggle target detection competition on the network, and the image target scale division standard given by MS COCO, the size area of small target shape of image morphology is
Figure 905479DEST_PATH_IMAGE028
With an area of medium target size of
Figure 933478DEST_PATH_IMAGE029
The large target size area is
Figure 52743DEST_PATH_IMAGE030
(ii) a When all images are scaled to one size by a Resize function when the images are input into a network structure, scale features of different target sizes are formed in the images with uniform sizes, and therefore an image data set of multi-scale information is established.
Step S2, extracting the morphological matrix characteristics of the input image by using a Backbone network (Backbone) of a YOLOv5 target detection algorithm;
in step S2, a stage target detection algorithm is adopted as a YOLOv5 (You Only Look one version 5) as a research basic model; the multi-scale feature fusion method is a hot-pluggable modular method, can be effectively migrated and used on different models, and can adopt the process described in the following step S3 for changing different models, namely the structural models of the target detection algorithm are respectively improved in sequence by the evolution process of FPN → PAN → BiFPN → AS-BiFPN, and the improvement steps are deep in sequence.
Step S3, realizing multi-scale fusion by using a three-branch feature fusion method of a Backbone network (Backbone), a Neck network (neutral) and a Prediction structure (Prediction), repeatedly learning weight parameters on each structural branch through deep learning, and obtaining an optimized network structure under a target domain data set by taking a minimized loss function as a learning criterion, wherein the fusion mode is based on FPN (feature Pyramid networks);
FPN can be decomposed into three progressive stages, which include the following steps:
step S31, in the backhaul generation characteristic stage, the task in the deep learning computer vision field is to generate abstract semantic characteristics based on the commonly used pre-trained Backbone network backhaul, and then to perform fine tuning of specific tasks; the characteristics of the backhaul generation are generally classified by stage and are respectively recorded as
Figure 507995DEST_PATH_IMAGE001
N is a natural number, wherein the number n is the same as the number of the stage, and represents the different stage characteristics of the image morphological characteristic down sampling, i.e. the number of times of halving the resolution, such as
Figure 345501DEST_PATH_IMAGE002
Representing feature maps output by stage2, with resolution of input pictures
Figure 911612DEST_PATH_IMAGE003
Figure 947701DEST_PATH_IMAGE004
Representing feature maps output by stage5, with resolution of input pictures
Figure 573855DEST_PATH_IMAGE005
Step S32, feature fusion stage, which is a stage specific to the FPN, and the FPN generally takes the feature with different resolutions generated in the previous step S31 as input and outputs the fused feature. The output features are generally labeled with P as a number. The input of FPN is
Figure 633077DEST_PATH_IMAGE002
Figure 2879DEST_PATH_IMAGE006
Figure 159054DEST_PATH_IMAGE007
Figure 893792DEST_PATH_IMAGE004
Figure 502627DEST_PATH_IMAGE008
After being fused, the output is
Figure 359942DEST_PATH_IMAGE009
Figure 370624DEST_PATH_IMAGE010
Figure 338580DEST_PATH_IMAGE011
Figure 372395DEST_PATH_IMAGE012
Figure 83999DEST_PATH_IMAGE013
Expressed by a mathematical formula:
Figure 214766DEST_PATH_IMAGE033
and step S33, the detection head outputs a bounding box, and the fused features are output by the FPN and then can be input to the detection head for specific object detection.
Step S4, on the basis of FPN, improving and forming pan (pathaggregation network), creating a bottom-up (bottom-up) enhanced path for shortening the information path, and using the accurate positioning signal stored in the low-level feature to promote the feature pyramid architecture.
Step S5, on the basis of PAN, improving BiFPN, which is a new architecture, compared with PANet, it adds cross-layer link, one big characteristic of BiFPN is weight fed feature fusion, that is, giving different scale weight value; the traditional method is to directly stack features of different scales, and BiFPN enables a network to learn the weights of different input features;
BiFPN mainly uses three Fusion strategies, and specifically comprises the following steps:
step S51, the unboundfusion policy has the formula:
Figure 291306DEST_PATH_IMAGE035
the formula is the first strategy of deep learning feature fusion, wherein,
Figure 609155DEST_PATH_IMAGE016
the weight parameters which can be learnt represent the weight proportion of data among the nodes of the single deep learning neural network;
Figure 124450DEST_PATH_IMAGE017
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
step S52, a Softmax-based fusion strategy, whose formula is:
Figure DEST_PATH_IMAGE037
the formula is a second strategy of deep learning feature fusion, wherein,
Figure 47407DEST_PATH_IMAGE016
Figure 357165DEST_PATH_IMAGE019
the weight parameters which can be learnt represent the weight proportion of data among a plurality of deep learning neural network nodes;
Figure 99994DEST_PATH_IMAGE017
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
step S53, fastnormazedfusion strategy, with the formula:
Figure 153400DEST_PATH_IMAGE038
the formula is a third strategy of deep learning feature fusion, wherein,
Figure 930863DEST_PATH_IMAGE016
Figure 411523DEST_PATH_IMAGE019
the weight parameters which can be learnt represent the weight proportion of data among a plurality of deep learning neural network nodes;
Figure 969544DEST_PATH_IMAGE021
a very small number to ensure that the denominator is not 0,
Figure 498745DEST_PATH_IMAGE017
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
step S54, integrating bidirectional cross-scale connection and fast normalization fusion, as shown in fig. 1:
Figure 458611DEST_PATH_IMAGE040
according to the node relationship in the diagram, in the formula,
Figure 110172DEST_PATH_IMAGE023
are edge middle nodes from top to bottom and from bottom to top,
Figure 827592DEST_PATH_IMAGE024
are edge lateral nodes from top to bottom and from bottom to top,
Figure 222801DEST_PATH_IMAGE025
is an intermediate node of 6 levels in the top-down path,
Figure 37174DEST_PATH_IMAGE026
are the lateral nodes at level 6 in the bottom-up path, all other feature nodes are constructed in a similar manner.
Step S6, the design idea based on the two network structures BiFPN and AS-BiFPN specifically includes the following steps:
step S61, adding a connection structure on the skip paper on the lateral path to span the path from bottom to top, directly performing information fusion on the backbone network and the prediction structure, and improving the network by updating the weight ratio of different paths during training so as to enhance the characteristic information acquisition capability and the information fusion capability of the prediction structure;
step S62, reserving an edge feature layer and head and tail nodes on the basis of BiFPN, applying weight parameter influence on each structure fusion path during training, avoiding edge feature fusion structures weakened by BiFPN, and connecting all feature layers required to be used in a cross-channel mode in the same way;
step S63, integrating two-way cross-scale connection and fast normalization fusion on the basis of step S61 and step S62, as shown in fig. 2:
Figure 797319DEST_PATH_IMAGE042
according to the node relationship in the diagram, in the formula,
Figure 64353DEST_PATH_IMAGE023
are edge intermediate nodes from top to bottom and from bottom to top,
Figure 200936DEST_PATH_IMAGE024
are edge lateral nodes from top to bottom and from bottom to top,
Figure 869815DEST_PATH_IMAGE025
is an intermediate node of 6 levels in the top-down path,
Figure 863178DEST_PATH_IMAGE026
is a lateral node at level 6 in the bottom-up path, all other feature nodes are constructed in a similar manner;
in step S64, to further improve efficiency, the image two-dimensional tensor convolution operation may adopt a depth separable convolution operation for feature fusion, and add batch normalization and activation after each convolution operation. This step is used depending on the application of different scenes, and is not related to the structure of the present invention.
In the attached figure 1, a basic BiFPN network structure is improved, the BiFPN network structure enables the front and back feature fusion mode of a model to jump from a two-dimensional plane to paper, cross-structure connection of network transmission during turning back is increased, the connection can effectively transmit and fuse the features of the front and back structures, and contributes to multi-scale feature information. Since the YOLOv5 network adopts a two-dimensional PAN structure, that is, multiple times of feature transfer and foldback are formed during the feature processing of upsampling and downsampling, a three-dimensional network structure feature processing method based on BiFPN is applied to the network model of YOLOv 5.
Due to the existence of the Bottom-Up Path (Bottom-Up Path) in the Yolov5 neck network, an information gap is formed between the backbone network adopting the top-down Path (Up-Bottom Path) and the prediction structure, and no good effect is achieved on feature transfer and feature fusion interaction, so that the connection structure on the jump paper is added on the lateral Path to span the Bottom-Up Path, the backbone network and the prediction structure are directly subjected to information fusion, and the weight ratio of different paths is updated during training so as to improve the network and enhance the feature information acquisition capability and the information fusion capability of the prediction structure. It should be noted that BiFPN is to enhance the feature information processing capability of the EfficientDet network, and therefore, the feature information of the main feature layer is considered heavily and the edge feature layer is weakened, and the fusion contribution of the head-tail node layer is not considered. BiFPN needs to be stacked for multiple times to form a reinforced characteristic network structure for reinforcing the EfficientDet network, but the multiple stacking can cause huge calculation amount, so that the contribution of head and tail nodes and characteristic information of an edge characteristic layer are not considered, and then head and tail short-cut operation is adopted on the BiFPN. The EfficientDet refers to the practice as "efficiency trade-off" which is the purpose of balancing between better computational resource consumption and network performance improvement. The EfficientDet network stacks such models multiple times in order to form a "reinforced feature network structure", thus bringing more model parameters, GFLOPs, and model complexity, and therefore requiring complicated simplification of the network structure.
Fig. 2 is a schematic structural diagram of this embodiment, and a network model does not need to stack network structures multiple times, but a network itself only implements a PAN structure once. Therefore, the Yolov5 network of the BiFPN structure is changed into a further improvement, based on the characteristics of the structure of the Yolov5 and the characteristics of multiple scales of a firework detonation data set, an edge feature layer and a head-tail node are reserved on the basis of the BiFPN, and weight parameter influence is exerted on each structure fusion path during training. The edge feature fusion structure omitted by BiFPN is considered in the modification, all the feature layers required to be used are connected across channels in the same mode, and the improvement is more beneficial to multi-scale feature fusion. Therefore, the starting point of the network structure improvement idea is to need a more comprehensive scale for a detection target of research on one hand, and to make up for the insufficient scale of the BiFPN caused by selectively skipping part of nodes on the other hand.
In addition, since the YOLOv5 Network has no structure stacked multiple times, the Network structure has no particularly strong change in model parameters, GFLOPs, model complexity and the like, and is named AS-bipfn, i.e. All Scale Bidirectional Feature Pyramid Network structure.
The Yolov5 target detection algorithm is respectively improved by adopting BiFPN and AS-BiFPN network structures, and compared with the original FPN network structure. The experimental results are shown in table 1, and the experimental average accuracy mean graph is shown in fig. 3. From the experimental results, it can be seen that YOLOv5s using the BiFPN network structure is 0.8% higher than that of the original YOLOv5s by mAP, and is increased from 78.4% to 79.2% mAP, the precision is slightly lost, the number of network layers is not changed, the number of network parameters is also maintained at the same level, the GFLOPs of the model is increased by 0.2, and in general, the YOLOv5s using the BiFPN network structure has a small increase in capacity.
Figure DEST_PATH_IMAGE043
Fig. 3, 4, and 5 show experimental results of the original YOLOv5 and the YOLOv5 after the AS-bipfn network structure is adopted on multi-scale target detection, which respectively show an experimental result graph of an average precision mean, a large-scale target detection experimental result, and a small-scale target detection experimental result. The experiment graph is combined to see that the YOLOv5s network model using the AS-BiFPN full-scale bidirectional feature pyramid network structure can be continuously promoted on the performance of the experiment result of the BiFPN, the mAP parameter of the network using the AS-BiFPN is promoted by 1% mAP compared with that of the BiFPN network, the accuracy is promoted by 1.5% compared with that of the original YOLOv5s network, and the AP values on the smoke and flame targets are respectively promoted by 1.4% and 0.6%, but the AP value is promoted by 7.6ms to 8.4ms of the BiFPN in the network reasoning time, and the time is increased by 0.8 ms. The network model using AS-BiFPN was 213 layers in the overall hierarchy compared to the original YOLOv5s and YOLOv5s using BiFPN, and did not change, except that the AS-BiFPN network structure had a smaller change in parameter and GFLOPs than BiFPN, respectively, and the parameter changes but maintains the same level, with GFLOPs rising by 0.2 compared to BiFPN and 0.4 compared to original YOLOv5 s. In short, the optimized AS-BiFPN network structure is used, so that the network can improve the capability of the original network on multi-scale target detection under the condition that the network depth, the algorithm module and the network infrastructure are not changed.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain a separate embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (6)

1. A multi-scale feature fusion method based on target detection is characterized in that,
step S1, collecting computer vision image samples through a network to establish a multi-scale target detection data set, and establishing and dividing the data set into a training set and a testing set;
step S2, extracting the morphological matrix characteristics of the input image by using a Backbone network backhaul of a YOLOv5 target detection algorithm;
step S3, realizing multi-scale fusion by using a three-branch feature fusion method of a Backbone network Backbone, a Neck network Neck and a Prediction structure Prediction, repeatedly learning weight parameters on each structure branch through deep learning, continuously reducing the difference between a target value and a predicted value during training according to a deep learning training mode, namely obtaining an optimized network structure under a target domain data set by taking a minimized loss function as a learning criterion, wherein the fusion mode is based on FPN;
step S4, improving and forming PAN on the basis of FPN, and utilizing accurate positioning signals stored in low-level features to promote a feature pyramid framework;
and step S5, improving and forming BiFPN on the basis of PAN, and enabling the network to learn the weights of different input features by itself through BiFPN.
2. The method of claim 1, wherein the FPN decomposition in step S3 is in three progressive stages, and comprises the following steps:
step S31, in the stage of generating characteristics by the Backbone network, the task in the deep learning computer vision field is to generate abstract semantic characteristics based on the commonly used and pre-trained Backbone network, and then to adjust the image morphological characteristics extracted by the Backbone network respectively aiming at different application scenes; the characteristics generated by the Backbone network backhaul are divided according to the stages and are respectively recorded as
Figure DEST_PATH_IMAGE002
N is a natural number, wherein the number n is the same as the number of the stage, and represents different stage characteristics of the image morphological characteristic downsampling, namely the times of halving the resolution, such as
Figure DEST_PATH_IMAGE004
Representing feature maps output by stage2, with resolution of input pictures
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE008
Representing feature maps output by stage5, with resolution of input pictures
Figure DEST_PATH_IMAGE010
Step S32, feature fusion stage, FPN using the different resolution features generated in step S31 as input, outputting the fused features, the output features being marked with P as number, the input of FPN being
Figure 961529DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE014
Figure 940986DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE016
After being fused, the output is
Figure DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE024
Figure DEST_PATH_IMAGE026
Expressed by a mathematical formula:
Figure DEST_PATH_IMAGE028
and step S33, outputting a bounding box through the detection head, outputting the fused characteristics through the FPN, and inputting the fused characteristics into the detection head for specific object detection.
3. The method of claim 2, wherein the Fusion strategy used by the BiFPN of step S5 specifically comprises the following steps:
step S51, the unboundfusion policy has the formula:
Figure DEST_PATH_IMAGE030
the formula is the first formula of deep learning feature fusionA strategy is described in which, among other things,
Figure DEST_PATH_IMAGE032
the weight parameters which can be learnt represent the weight proportion of data among the nodes of the single deep learning neural network;
Figure DEST_PATH_IMAGE034
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
step S52, a Softmax-based fusion strategy, whose formula is:
Figure DEST_PATH_IMAGE036
the formula is a second strategy of deep learning feature fusion, wherein,
Figure 471456DEST_PATH_IMAGE032
Figure DEST_PATH_IMAGE038
the weight parameters which can be learnt represent the weight proportion of data among a plurality of deep learning neural network nodes;
Figure 430447DEST_PATH_IMAGE034
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
step S53, fastnormazedfusion strategy, with the formula:
Figure DEST_PATH_IMAGE040
the formula is a third strategy of deep learning feature fusion, wherein,
Figure 307136DEST_PATH_IMAGE032
Figure 129598DEST_PATH_IMAGE038
the weight parameters which can be learnt represent the weight proportion of data among a plurality of deep learning neural network nodes;
Figure DEST_PATH_IMAGE042
is a very small number to ensure that the denominator is not 0,
Figure 458949DEST_PATH_IMAGE034
representing the image morphological characteristic matrix input by the neural network in the computer vision field for inputting characteristic information;
step S54, integrating two-way cross-scale connection and fast normalization fusion:
Figure DEST_PATH_IMAGE044
in the formula, the first step is that,
Figure DEST_PATH_IMAGE046
are edge intermediate nodes from top to bottom and from bottom to top,
Figure DEST_PATH_IMAGE048
are edge lateral nodes from top to bottom and from bottom to top,
Figure DEST_PATH_IMAGE050
is an intermediate node of 6 levels in the top-down path,
Figure DEST_PATH_IMAGE052
are the lateral nodes at level 6 in the bottom-up path, all other feature nodes are constructed in a similar manner.
4. The method of claim 3, further comprising the steps of:
step S61, adding a connection structure on the skip paper on the lateral path to span the path from bottom to top, directly performing information fusion on the backbone network and the prediction structure, and improving the network by updating the weight ratio of different paths during training so as to enhance the characteristic information acquisition capability and the information fusion capability of the prediction structure;
step S62, reserving an edge feature layer and head and tail nodes on the basis of BiFPN, applying weight parameter influence on each structure fusion path during training, avoiding edge feature fusion structures weakened by BiFPN, and connecting all feature layers required to be used in a cross-channel mode in the same way;
step S63, integrating bidirectional cross-scale connection and fast normalization fusion based on steps S61 and S62:
Figure DEST_PATH_IMAGE054
in the formula, the first step is that,
Figure 272229DEST_PATH_IMAGE046
are edge intermediate nodes from top to bottom and from bottom to top,
Figure 941107DEST_PATH_IMAGE048
are edge lateral nodes from top to bottom and from bottom to top,
Figure 996788DEST_PATH_IMAGE050
is an intermediate node of 6 levels in the top-down path,
Figure 751118DEST_PATH_IMAGE052
are the lateral nodes at level 6 in the bottom-up path, all other feature nodes are constructed in a similar manner.
5. According to claimThe multi-scale feature fusion method based on target detection is characterized in that the step S1 includes obtaining multi-scale target images from online personal multi-scale images and Kaggle target detection competition channels, the image target scale division standard given by MS COCO is adopted, and the size area of small target forms of image morphology is
Figure DEST_PATH_IMAGE056
With a medium target size of
Figure DEST_PATH_IMAGE058
The large target size area is
Figure DEST_PATH_IMAGE060
(ii) a When all images are scaled to one size by a Resize function when the images are input into a network structure, scale features of different target sizes are formed in the images with uniform sizes, and therefore an image data set of multi-scale information is established.
6. The method for multi-scale feature fusion based on object detection as claimed in claim 5, wherein a one-stage object detection algorithm is used as a base model for research as YOLOv5 in step S2; the multi-scale feature fusion method is a modular method capable of hot plug, effective migration and use are carried out on different models, the process described in the step S3 is adopted for changing the different models, namely the structural models of the target detection algorithm are respectively improved in sequence through the evolution process of FPN → PAN → BiFPN → AS-BiFPN, and the improvement steps are deep in sequence.
CN202210620848.9A 2022-06-02 2022-06-02 Multi-scale feature fusion method based on target detection Active CN115082688B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210620848.9A CN115082688B (en) 2022-06-02 2022-06-02 Multi-scale feature fusion method based on target detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210620848.9A CN115082688B (en) 2022-06-02 2022-06-02 Multi-scale feature fusion method based on target detection

Publications (2)

Publication Number Publication Date
CN115082688A true CN115082688A (en) 2022-09-20
CN115082688B CN115082688B (en) 2024-07-05

Family

ID=83249674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210620848.9A Active CN115082688B (en) 2022-06-02 2022-06-02 Multi-scale feature fusion method based on target detection

Country Status (1)

Country Link
CN (1) CN115082688B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084124A (en) * 2019-03-28 2019-08-02 北京大学 Feature based on feature pyramid network enhances object detection method
CN110766098A (en) * 2019-11-07 2020-02-07 中国石油大学(华东) Traffic scene small target detection method based on improved YOLOv3
US20210224581A1 (en) * 2020-09-25 2021-07-22 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, and device for fusing features applied to small target detection, and storage medium
AU2021104243A4 (en) * 2021-07-16 2021-09-09 Ziteng Li Method of Pedestrian detection based on multi-layer feature fusion
CN114078209A (en) * 2021-10-27 2022-02-22 南京航空航天大学 Lightweight target detection method for improving small target detection precision
CN114399633A (en) * 2022-01-19 2022-04-26 北京石油化工学院 Mobile electronic equipment position detection method based on YOLOv5s model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084124A (en) * 2019-03-28 2019-08-02 北京大学 Feature based on feature pyramid network enhances object detection method
CN110766098A (en) * 2019-11-07 2020-02-07 中国石油大学(华东) Traffic scene small target detection method based on improved YOLOv3
US20210224581A1 (en) * 2020-09-25 2021-07-22 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, and device for fusing features applied to small target detection, and storage medium
AU2021104243A4 (en) * 2021-07-16 2021-09-09 Ziteng Li Method of Pedestrian detection based on multi-layer feature fusion
CN114078209A (en) * 2021-10-27 2022-02-22 南京航空航天大学 Lightweight target detection method for improving small target detection precision
CN114399633A (en) * 2022-01-19 2022-04-26 北京石油化工学院 Mobile electronic equipment position detection method based on YOLOv5s model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUN CHEN ETC: ""Effective Feature Fusion Network in BIFPN for Small Object Detection"", 《2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》, 22 September 2021 (2021-09-22) *
权宇;李志欣;张灿龙;马慧芳;: "融合深度扩张网络和轻量化网络的目标检测模型", 电子学报, no. 02, 15 February 2020 (2020-02-15) *

Also Published As

Publication number Publication date
CN115082688B (en) 2024-07-05

Similar Documents

Publication Publication Date Title
CN113011499B (en) Hyperspectral remote sensing image classification method based on double-attention machine system
CN101315663B (en) Nature scene image classification method based on area dormant semantic characteristic
CN108764281A (en) A kind of image classification method learning across task depth network based on semi-supervised step certainly
CN111860693A (en) Lightweight visual target detection method and system
CN108256636A (en) A kind of convolutional neural networks algorithm design implementation method based on Heterogeneous Computing
CN103116766A (en) Increment neural network and sub-graph code based image classification method
CN114118012B (en) Personalized font generation method based on CycleGAN
CN111046917B (en) Object-based enhanced target detection method based on deep neural network
CN113486190A (en) Multi-mode knowledge representation method integrating entity image information and entity category information
CN114973011A (en) High-resolution remote sensing image building extraction method based on deep learning
CN114330516A (en) Small sample logo image classification based on multi-graph guided neural network model
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN114693929A (en) Semantic segmentation method for RGB-D bimodal feature fusion
CN113836319B (en) Knowledge completion method and system for fusion entity neighbors
CN112308089A (en) Attention mechanism-based capsule network multi-feature extraction method
CN113902753A (en) Image semantic segmentation method and system based on dual-channel and self-attention mechanism
CN117454971A (en) Projection type knowledge distillation method based on self-adaptive mask weighting
CN115082688A (en) Multi-scale feature fusion method based on target detection
CN115375922B (en) Light-weight significance detection method based on multi-scale spatial attention
CN115049786B (en) Task-oriented point cloud data downsampling method and system
CN116420174A (en) Full scale convolution for convolutional neural networks
Fang et al. Sketch recognition based on attention mechanism and improved residual network
CN113793627B (en) Attention-based multi-scale convolution voice emotion recognition method and device
Huynh et al. Light-weight Sketch Recognition with Knowledge Distillation
Chen et al. FPAN: fine-grained and progressive attention localization network for data retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant