CN111461217B - Aerial image small target detection method based on feature fusion and up-sampling - Google Patents

Aerial image small target detection method based on feature fusion and up-sampling Download PDF

Info

Publication number
CN111461217B
CN111461217B CN202010247656.9A CN202010247656A CN111461217B CN 111461217 B CN111461217 B CN 111461217B CN 202010247656 A CN202010247656 A CN 202010247656A CN 111461217 B CN111461217 B CN 111461217B
Authority
CN
China
Prior art keywords
feature
channel
output
layer
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010247656.9A
Other languages
Chinese (zh)
Other versions
CN111461217A (en
Inventor
林沪
刘琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010247656.9A priority Critical patent/CN111461217B/en
Publication of CN111461217A publication Critical patent/CN111461217A/en
Application granted granted Critical
Publication of CN111461217B publication Critical patent/CN111461217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an aerial image small target detection method based on feature fusion and up-sampling. The method comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upsampling layer based on learning, and upsampling the features in resolution to obtain a feature set with uniform resolution; group normalization is carried out on the characteristics according to channel grouping; splicing the feature sets to generate fusion features; downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection; network classification and localization targets are detected using the header. The invention relates to a feature fusion and feature up-sampling method for a target detection training and testing stage, which can obviously improve the detection precision of small targets in aerial images and only slightly increase the calculation cost.

Description

Aerial image small target detection method based on feature fusion and up-sampling
Technical Field
The invention relates to the field of aerial image target detection, in particular to an aerial image small target detection method based on feature fusion and up-sampling.
Background
Compared with a monitoring camera with a fixed position and a visual field, the camera on the unmanned aerial vehicle has natural advantages such as convenient deployment, strong maneuverability and wide visual field. These advantages are expected to serve many applications such as security monitoring, search and rescue, and people stream monitoring. In many unmanned aerial vehicle applications, target detection in aerial images is a critical component, and is critical to the development of fully autonomous systems, and thus an urgent need in the industry.
Although convolutional neural networks have achieved significant effects in the field of general target detection, their performance in unmanned aerial vehicle aerial photography scenarios is not satisfactory. The main reason is that the relative scale and absolute resolution of the target are smaller in the image of the unmanned aerial vehicle in the aerial scene than in the ordinary scene. Therefore, the resolution of the corresponding feature response area in the extracted convolution feature map is smaller, and the higher omission ratio is caused. More specifically, the feature images extracted by the convolutional neural network are often reduced by 1/4 or 1/8 relative to the length and width of the input image, so that the characterization capability of the feature images on small-scale targets is further weakened. Therefore, how to increase the feature expression of small scale targets becomes a key point of the system design.
Most of the existing convolutional neural network methods adopt FPN feature fusion networks to improve feature expression of small-scale targets. The specific flow is as follows: extracting a feature set of an input image by using a backbone network; upsampling the high-layer low-resolution feature map by bilinear interpolation, and fusing the bilinear interpolation with the adjacent low-layer feature map in sequence; and detecting by using the fused feature set. However, the existing FPN feature fusion network cannot sufficiently fuse the information of the feature maps with different resolutions, and bilinear interpolation is not an efficient up-sampling method. These two defects result in FPN having limited effectiveness in the detection of small-sized targets.
In summary, the key to improve small target detection under aerial viewing angles is to improve feature fusion strategies and up-sampling methods. The invention provides an aerial image small target detection method based on feature fusion and up-sampling, which comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upsampling layer based on learning, and upsampling the features in resolution to obtain a feature set with uniform resolution; group normalization is carried out on the characteristics according to channel grouping; splicing the feature sets to generate fusion features; downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection; and using the head to detect the network classification and positioning targets, and finally outputting the detection result.
The present invention relates to the following prior art documents:
existing document 1: he imaging, et al, "Deep residual learning for image recognment," Proceedings of the IEEE conference on computer vision and pattern recognment.2016.
Existing document 2: wu Y, he K.group normalization [ C ]// Proceedings of the European Conference on Computer Vision (ECCV). 2018:3-19.
Existing document 3: lin T Y, goyal P, girshick R, et al Focal loss for dense object detection [ C ]// Proceedings of the IEEE international conference on computer vision.2017:2980-2988.
The prior document 1 proposes a feature extraction network, the main component unit of which is a residual module based on residual connection, so that training difficulty of a deep network can be reduced, and features with deeper depth and stronger characterization capability can be learned. The prior document 2 proposes a feature normalization method, which improves the problems that the effect is poor when the original batch normalization is performed in a small batch during network training, and the optimal solution is difficult to converge. Existing document 3 trains a high performance one-stage dense object detector based on FPN network and Focal Loss function. The present invention extracts a feature set of an input picture using the existing document 1; normalizing the feature map by using group normalization grouped by channel dimension in the prior document 2; the feature fusion network is improved on the basis of the existing document 3, and the network is trained using the loss function of the existing document 3.
Disclosure of Invention
The invention aims to improve the detection precision of small targets of aerial images, thereby better completing the tasks of security monitoring, search and rescue, people flow monitoring and the like based on unmanned aerial vehicle target detection. In order to achieve the above purpose, according to the present invention, an aerial image small target detection method based on feature fusion and upsampling is provided, and a channel standardization module and an upsampling layer are configured to perform channel standardization and upsampling on features; then, the features are spliced into fusion features after group normalization; downsampling for a plurality of times on the basis of the fusion of the features to generate a feature pyramid; and classifying and positioning targets by using the head network, and outputting detection results.
The object of the invention is achieved by at least one of the following technical solutions.
An aerial image small target detection method based on feature fusion and up-sampling comprises the following steps:
s1, extracting a feature set of an input image by using a backbone network;
s2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
s3, constructing an up-sampling layer based on learning, and carrying out resolution up-sampling on the standardized features to obtain a feature set with uniform resolution;
s4, carrying out group normalization on the features with uniform resolutions according to channel grouping;
s5, splicing the feature sets after the group normalization to generate fusion features;
s6, downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection;
s7, detecting network classification and positioning targets by using the head, and finally outputting detection results.
Further, in step S1, the backbone network is a residual convolution network, where the residual convolution network includes five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolutions of the output feature graphs of the residual modules are the same; 2 times of downsampling exists between every two adjacent stages, and the length and width of the feature map after downsampling are reduced by two times; the final extracted feature set is a set formed by the last feature map of the second to fifth stages of the backbone network.
Further, in step S2, the channel normalization module is implemented by a convolution layer; the input of the channel standardization module is a feature diagram in a feature set output by the backbone network, and the output of the channel standardization module is a feature diagram of channel dimension standardization; the resolution of the feature map output by the channel normalization module is the same as the resolution of the input feature map; the channel dimension number of the output characteristic diagram of the channel normalization module is a fixed value.
Further, in step S3, the learning-based upsampling layer is formed by cascading a plurality of upsampling modules; the up-sampling layer based on learning has different numbers of up-sampling modules for the feature graphs with different resolutions, and the resolution of the feature graphs finally output is the same; the up-sampling module is formed by connecting a channel expansion layer and a pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature map output by the channel expansion layer is 4 times of the channel dimension number of the input feature map; the channel number of the feature image output by the pixel rearrangement layer is 1/4 of the channel number of the input feature image, and the resolution of the output feature image is 2 times of the resolution of the input feature image.
Further, the formula of the pixel rearrangement layer is as follows:
Figure BDA0002434363910000041
wherein ,
Figure BDA0002434363910000042
representing a pixel rearrangement layer, L representing an input feature map of the pixel rearrangement layer, x and y representing the abscissa of the output feature map, C representing the channel coordinates of the input feature map, r representing the upsampling magnification,/'>
Figure BDA0002434363910000043
Representing a downward rounding, mod represents a remainder.
Further, in step S4, the group normalization of the per-channel grouping includes the following steps:
s4.1, assume i= (I) N ,i C ,i H ,i W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the feature graphs with uniform resolution output in the step S3; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following formula:
Figure BDA0002434363910000051
Figure BDA0002434363910000052
wherein, E refers to the error between the distinguishing floating point numbers in the computer; s represents a pixel set of the feature map I after being grouped according to channels; k represents one pixel in the pixel set S; m represents the size of the pixel set S; the set of pixels S is defined as:
Figure BDA0002434363910000053
wherein G represents the number of packets, which is a predefined hyper-parameter, and the value range of G is an integer multiple of 16;
Figure BDA0002434363910000054
representing the number of channels per group; />
Figure BDA0002434363910000058
Representing a downward rounding, i C 、i N Coordinates of the feature map I on the N, C axis, respectively; k (k) N and kC Representing the coordinates of pixel k on the N, C axis, respectively;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
Figure BDA0002434363910000055
wherein ,
Figure BDA0002434363910000056
representing the normalized feature map, wherein sigma and mu are the mean and variance calculated in the step S4.1;
s4.3, fitting a linear transformation after normalization to compensate possible loss of characteristic expression capacity; the specific transformation formula is as follows:
Figure BDA0002434363910000057
wherein O represents a feature map of group normalized output grouped by channel; gamma and beta represent the scaling and offset parameters of the fit, respectively; where the gamma parameter is initialized to 1 and the beta is initialized to 0.
Further, in step S5, the feature set is spliced, and the generation of the fusion feature refers to a tensor splicing operation; and the tensor splicing operation splices the feature graphs along the dimension direction to obtain a fusion feature tensor.
Further, in step S6, the step of performing downsampling on the fused features for multiple times to construct a feature pyramid refers to generating a series of low-resolution feature graphs from the feature graphs through a plurality of downsampling layers connected in series; the feature map pyramid refers to a set formed by low-resolution feature maps output by a downsampling layer; the resolution of the output low-resolution feature map is 1/2 of the resolution of the feature map input by the downsampling layer.
Further, in step S7, the head detection network sequentially inputs the feature graphs of the feature pyramid output in step S6, and outputs the category and the position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;
further, the calculation steps of the target classification full convolution network are as follows:
s7.1.1, inputting the feature map of the feature pyramid output in the step S6 into a plurality of serially connected buffer convolution layers; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map;
s7.1.2, inputting the feature map output by the buffer convolution layer into a classification prediction layer; the classification prediction layer consists of a layer of convolution layer; let x= (x) N ,x C ,x H ,x W ) 4D tensors indexed in the order of (N, C, H, W) represent classification results output by the classification prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of x are the same as the input profile; by a means ofThe channel number of x is Cls A, cls is the number of target categories, and A is the number of preset anchors;
further, the calculation steps of the target regression full convolution network are as follows:
s7.2.1, inputting the feature map of the feature pyramid output in the step S6 into a series-connected buffer convolution layer; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map;
s7.2.2, inputting the feature map output by the buffer convolution layer into a regression prediction layer; the regression prediction layer consists of a layer of convolution layer; let y= (y) N ,y C ,y H ,y W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the regression result output by the regression prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of y are the same as the input profile; and the channel number C of y is 4 x A, and A is the number of preset anchors.
S7.3, combining the results x and y output by the classification full convolution network and the regression full convolution network to obtain a 4D tensor z= (z) indexed in the (N, C, H, W) sequence N ,z C ,z H ,z W ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the size of C is (4+Cls) A, cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network and comprises the category and the position coordinates of the target.
Compared with the prior art, the invention has the beneficial effects that:
the invention improves the feature fusion flow and the feature up-sampling method, can obviously improve the characterization capability of the feature map, improves the small target detection precision, and only slightly increases the calculation cost.
Drawings
FIG. 1 is a flow chart of a method for detecting small targets in aerial images based on feature fusion and upsampling;
FIG. 2 is a schematic diagram of a network structure for feature fusion in an embodiment of the present invention;
fig. 3 is a schematic diagram of a pixel rearrangement layer according to an embodiment of the present invention.
Detailed Description
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the various embodiments of the disclosure defined by the claims and their equivalents. It includes various specific details to aid understanding, but these are to be considered merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the various embodiments of the present invention described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions may be omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to written meanings, but are used only by the inventors to enable a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.
Examples:
an aerial image small target detection method based on feature fusion and up-sampling, as shown in fig. 1, comprises the following steps:
s1, extracting a feature set of an input image by using a backbone network;
the main network is a residual convolution network, and the residual convolution network comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolutions of the output feature graphs of the residual modules are the same; 2 times of downsampling exists between every two adjacent stages, and the length and width of the feature map after downsampling are reduced by two times; the final extracted feature set is a set formed by the last feature map of the second to fifth stages of the backbone network.
S2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
the channel standardization module is realized by a convolution layer; the input of the channel standardization module is a feature diagram in a feature set output by the backbone network, and the output of the channel standardization module is a feature diagram of channel dimension standardization; the resolution of the feature map output by the channel normalization module is the same as the resolution of the input feature map; the channel dimension number of the output characteristic diagram of the channel normalization module is a fixed value.
In this embodiment, the convolution kernel size of the convolution layer in the channel normalization module is 1, the filling is 1, and the step length is 1; the number of channel dimensions of the feature map output by the channel normalization module is a fixed value of 256.
S3, constructing an up-sampling layer based on learning, and carrying out resolution up-sampling on the standardized features to obtain a feature set with uniform resolution;
the up-sampling layer based on learning is formed by cascading a plurality of up-sampling modules; the up-sampling layer based on learning has different numbers of up-sampling modules for the feature graphs with different resolutions, and the resolution of the feature graphs finally output is the same; the up-sampling module is formed by connecting a channel expansion layer and a pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature map output by the channel expansion layer is 4 times of the channel dimension number of the input feature map; the channel number of the feature image output by the pixel rearrangement layer is 1/4 of the channel number of the input feature image, and the resolution of the output feature image is 2 times of the resolution of the input feature image.
In this embodiment, the channel expansion layer is implemented by a layer of convolution layer, where the convolution kernel size is 1, the filling is 1, the step size is 1, and the channel dimension number of the output feature map is 1024; the channel dimension number of the output feature map of the pixel rearrangement layer is 256;
as shown in fig. 3, the formula of the pixel rearrangement layer is as follows:
Figure BDA0002434363910000091
wherein ,
Figure BDA0002434363910000092
representation ofA pixel rearrangement layer, L representing an input feature map of the pixel rearrangement layer, x and y representing the abscissa of the output feature map, C representing the channel coordinates of the input feature map, r representing the upsampling magnification,/and->
Figure BDA0002434363910000094
Representing a downward rounding, mod represents a remainder.
S4, carrying out group normalization on the features with uniform resolutions according to channel grouping;
the group normalization by channel grouping comprises the steps of:
s4.1, assume i= (I) N ,i C ,i H ,i W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the feature graphs with uniform resolution output in the step S3; wherein N is a batch axis, C is a channel axis, and H and W are characteristic diagram length and width axes respectively; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following formula:
Figure BDA0002434363910000093
Figure BDA0002434363910000101
wherein, E refers to the error between the adjacent floating point numbers in the computer, and the E is 2.220446049250313e-16 in Python language; s represents a pixel set of the feature map I after being grouped according to channels; k represents one pixel in the pixel set S; m represents the size of the pixel set S; the set of pixels S is defined as:
Figure BDA0002434363910000102
wherein G represents the number of packets, which is a predefined hyper-parameter, and the value range of G is an integer multiple of 16, and the value of G is 32 in default;
Figure BDA0002434363910000103
representing the number of channels per group; />
Figure BDA0002434363910000107
Representing a downward rounding, i C 、i N Coordinates of the feature map I on the N, C axis, respectively; k (k) N and kC Representing the coordinates of pixel k on the N, C axis, respectively;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
Figure BDA0002434363910000104
wherein ,
Figure BDA0002434363910000105
representing the normalized feature map, wherein sigma and mu are the mean and variance calculated in the step S4.1;
s4.3, fitting a linear transformation after normalization to compensate possible loss of characteristic expression capacity; the specific transformation formula is as follows:
Figure BDA0002434363910000106
wherein y represents a feature map of group normalized output grouped by channel; gamma and beta represent the scaling and offset parameters of the fit, respectively; where the gamma parameter is initialized to 1 and the beta is initialized to 0.
S5, as shown in FIG. 2, the feature sets after group normalization are spliced to generate fusion features;
the feature set is spliced, and the generation of fusion features refers to the splicing operation of tensors; and the tensor splicing operation splices the feature graphs along the dimension direction to obtain a fusion feature tensor.
S6, downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection;
the feature pyramid is constructed by carrying out multiple downsampling on the fusion features, namely a series of low-resolution feature graphs are generated by the feature graphs through a plurality of downsampling layers which are connected in series; the feature map pyramid refers to a set formed by low-resolution feature maps output by a downsampling layer; the resolution of the output low-resolution feature map is 1/2 of the resolution of the feature map input by the downsampling layer;
in this embodiment, the downsampling layer is implemented by a convolution layer; the convolution kernel of the downsampling layer is 3 in size, the filling is 1, and the step length is 2.
S7, detecting network classification and positioning targets by using the head, and finally outputting detection results;
the head detection network sequentially inputs the feature graphs of the feature pyramid output in the step S6 and outputs the category and position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;
the calculation steps of the target classification full convolution network are as follows:
s7.1.1 in this embodiment, the feature map of the feature pyramid output in step S6 is input into 4 serially connected buffer convolution layers; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map; the convolution kernel of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the output channel number is 256;
s7.1.2, inputting the feature map output by the buffer convolution layer into a classification prediction layer; the classification prediction layer consists of a layer of convolution layer; let x= (x) N ,x C ,x H ,x W ) 4D tensors indexed in the order of (N, C, H, W) represent classification results output by the classification prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of x are the same as the input profile; in this embodiment, the convolution kernel size of the classification prediction layer is 3, the filling is 1, the step length is 1, the number of output channels is cls×a, cls is the number of target classes, and a is the number of preset anchors;
the calculation steps of the target regression full convolution network are as follows:
s7.2.1 in this embodiment, the feature map of the feature pyramid output in step S6 is input into 4 serially connected buffer convolution layers; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map; the convolution kernel of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the output channel number is 256;
s7.2.2, inputting the feature map output by the buffer convolution layer into a regression prediction layer; the regression prediction layer consists of a layer of convolution layer; let y= (y) N ,y C ,y H ,y W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the regression result output by the regression prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of y are the same as the input profile; in this embodiment, the convolution kernel of the regression prediction layer is 3, the padding is 1, the step size is 1, the number of output channels is 4×a, and a is the number of preset anchors.
S7.3, combining the results x and y output by the classification full convolution network and the regression full convolution network to obtain a 4D tensor z= (z) indexed in the (N, C, H, W) sequence N ,z C ,z H ,z W ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the size of C is (4+Cls) A, cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network and comprises the category and the position coordinates of the target.
The above examples of the present invention are only examples for clearly illustrating the present invention, and are not limiting of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (8)

1. The aerial image small target detection method based on feature fusion and up-sampling is characterized by comprising the following steps of:
s1, extracting a feature set of an input image by using a backbone network;
s2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
s3, constructing an up-sampling layer based on learning, and carrying out resolution up-sampling on the standardized features to obtain a feature set with uniform resolution;
s4, carrying out group normalization on the features with uniform resolutions according to channel grouping; the group normalization by channel grouping comprises the steps of:
s4.1, assume i= (I) N ,i C ,i H ,i W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the feature graphs with uniform resolution output in the step S3; wherein N is a batch axis, C is a channel axis, and H and W are characteristic diagram length and width axes respectively; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following formula:
Figure FDA0004096317320000011
Figure FDA0004096317320000012
wherein, E refers to the error between the distinguishing floating point numbers in the computer; s represents a pixel set of the feature map I after being grouped according to channels; k represents one pixel in the pixel set S; m represents the size of the pixel set S; the set of pixels S is defined as:
Figure FDA0004096317320000013
wherein G represents the number of packets, which is a predefined hyper-parameter, and the value range of G is an integer multiple of 16;
Figure FDA0004096317320000021
representing the number of channels per group; />
Figure FDA0004096317320000022
Representing a downward rounding, i C 、i N Coordinates of the feature map I on the N, C axis, respectively; k (k) N and kC Representing the coordinates of pixel k on the N, C axis, respectively;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
Figure FDA0004096317320000023
wherein ,
Figure FDA0004096317320000024
representing the normalized feature map, wherein sigma and mu are the mean and variance calculated in the step S4.1;
s4.3, fitting a linear transformation after normalization to compensate possible loss of characteristic expression capacity; the specific transformation formula is as follows:
Figure FDA0004096317320000025
wherein O represents a feature map of group normalized output grouped by channel; gamma and beta represent the scaling and offset parameters of the fit, respectively; wherein the gamma parameter is initialized to 1 and the beta parameter is initialized to 0;
s5, splicing the feature sets after the group normalization to generate fusion features;
s6, downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection;
s7, detecting network classification and positioning targets by using the head, and finally outputting detection results.
2. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S1, the main network is a residual convolution network, the residual convolution network comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolutions of the output feature images of the residual modules are the same; 2 times of downsampling exists between every two adjacent stages, and the length and width of the feature map after downsampling are reduced by two times; the final extracted feature set is a set formed by the last feature map of the second to fifth stages of the backbone network.
3. The aerial image small target detection method based on feature fusion and upsampling according to claim 1, wherein in step S2, the channel normalization module is implemented by a convolution layer; the input of the channel standardization module is a feature diagram in a feature set output by the backbone network, and the output of the channel standardization module is a feature diagram of channel dimension standardization; the resolution of the feature map output by the channel normalization module is the same as the resolution of the input feature map; the channel dimension number of the output characteristic diagram of the channel normalization module is a fixed value.
4. The aerial image small target detection method based on feature fusion and upsampling according to claim 1, wherein in step S3, the upsampling layer based on learning is formed by cascading a plurality of upsampling modules; the up-sampling layer based on learning has different numbers of up-sampling modules for the feature graphs with different resolutions, and the resolution of the feature graphs finally output is the same; the up-sampling module is formed by connecting a channel expansion layer and a pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature map output by the channel expansion layer is 4 times of the channel dimension number of the input feature map; the channel number of the feature image output by the pixel rearrangement layer is 1/4 of the channel number of the input feature image, and the resolution of the output feature image is 2 times of the resolution of the input feature image.
5. The method for detecting the small target of the aerial image based on feature fusion and upsampling according to claim 4, wherein the formula of the pixel rearrangement layer is as follows:
Figure FDA0004096317320000031
wherein ,
Figure FDA0004096317320000032
representing a pixel rearrangement layer, L representing an input feature map of the pixel rearrangement layer, x and y representing the abscissa of the output feature map, C representing the channel coordinates of the input feature map, r representing the upsampling magnification,
Figure FDA0004096317320000033
representing a downward rounding, mod represents a remainder.
6. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S5, the feature sets are spliced, and the fusion feature is generated by tensor splicing operation; and the tensor splicing operation splices the feature graphs along the dimension direction to obtain a fusion feature tensor.
7. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S6, feature pyramids are constructed by performing down-sampling on the fused features for a plurality of times, namely, a series of low-resolution feature graphs are generated by passing the feature graphs through a plurality of down-sampling layers connected in series; the feature map pyramid refers to a set formed by low-resolution feature maps output by a downsampling layer; the resolution of the output low-resolution feature map is 1/2 of the resolution of the feature map input by the downsampling layer.
8. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S7, the head detection network sequentially inputs the feature map of the feature pyramid output in step S6, and outputs the category and position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;
the calculation steps of the target classification full convolution network are as follows:
s7.1.1, inputting the feature map of the feature pyramid output in the step S6 into a plurality of serially connected buffer convolution layers; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map;
s7.1.2, inputting the feature map output by the buffer convolution layer into a classification prediction layer; the classification prediction layer consists of a layer of convolution layer; let x= (x) N ,x C ,x H ,x W ) 4D tensors indexed in the order of (N, C, H, W) represent classification results output by the classification prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of x are the same as the input profile; the channel number of x is Cls A, cls is the number of target categories, and A is the number of preset anchors;
the calculation steps of the target regression full convolution network are as follows:
s7.2.1, inputting the feature map of the feature pyramid output in the step S6 into a series-connected buffer convolution layer; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map;
s7.2.2, inputting the feature map output by the buffer convolution layer into a regression prediction layer; the regression prediction layer consists of a layer of convolution layer; let y= (y) N ,y C ,y H ,y W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the regression result output by the regression prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of y are the same as the input profile; the channel number C of y is 4 x A, A is the number of preset anchors;
s7.3, combining the results x and y output by the classification full convolution network and the regression full convolution network to obtain a 4D tensor z= (z) indexed in the (N, C, H, W) sequence N ,z C ,z H ,z W ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the size of C is (4+Cls) A, cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network and comprises the category and the position coordinates of the target.
CN202010247656.9A 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling Active CN111461217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010247656.9A CN111461217B (en) 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010247656.9A CN111461217B (en) 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling

Publications (2)

Publication Number Publication Date
CN111461217A CN111461217A (en) 2020-07-28
CN111461217B true CN111461217B (en) 2023-05-23

Family

ID=71682431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010247656.9A Active CN111461217B (en) 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling

Country Status (1)

Country Link
CN (1) CN111461217B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070658B (en) * 2020-08-25 2024-04-16 西安理工大学 Deep learning-based Chinese character font style migration method
CN111967538B (en) * 2020-09-25 2024-03-15 北京康夫子健康技术有限公司 Feature fusion method, device and equipment applied to small target detection and storage medium
CN112580721B (en) * 2020-12-19 2023-10-24 北京联合大学 Target key point detection method based on multi-resolution feature fusion
CN112633156B (en) * 2020-12-22 2024-05-31 浙江大华技术股份有限公司 Vehicle detection method, image processing device, and computer-readable storage medium
CN112990317B (en) * 2021-03-18 2022-08-30 中国科学院长春光学精密机械与物理研究所 Weak and small target detection method
CN113111877A (en) * 2021-04-28 2021-07-13 奇瑞汽车股份有限公司 Characteristic pyramid and characteristic image extraction method thereof
CN113312995B (en) * 2021-05-18 2023-02-14 华南理工大学 Anchor-free vehicle-mounted pedestrian detection method based on central axis
CN114120077B (en) * 2022-01-27 2022-05-03 山东融瓴科技集团有限公司 Prevention and control risk early warning method based on big data of unmanned aerial vehicle aerial photography

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109598290A (en) * 2018-11-22 2019-04-09 上海交通大学 A kind of image small target detecting method combined based on hierarchical detection
CN110097129A (en) * 2019-05-05 2019-08-06 西安电子科技大学 Remote sensing target detection method based on profile wave grouping feature pyramid convolution
CN110929649A (en) * 2019-11-24 2020-03-27 华南理工大学 Network and difficult sample mining method for small target detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109598290A (en) * 2018-11-22 2019-04-09 上海交通大学 A kind of image small target detecting method combined based on hierarchical detection
CN110097129A (en) * 2019-05-05 2019-08-06 西安电子科技大学 Remote sensing target detection method based on profile wave grouping feature pyramid convolution
CN110929649A (en) * 2019-11-24 2020-03-27 华南理工大学 Network and difficult sample mining method for small target detection

Also Published As

Publication number Publication date
CN111461217A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN111461217B (en) Aerial image small target detection method based on feature fusion and up-sampling
CN111524135A (en) Image enhancement-based method and system for detecting defects of small hardware fittings of power transmission line
CN112884064A (en) Target detection and identification method based on neural network
CN111951212A (en) Method for identifying defects of contact network image of railway
CN109034184B (en) Grading ring detection and identification method based on deep learning
CN112201078B (en) Automatic parking space detection method based on graph neural network
Wang et al. Spatial attention for multi-scale feature refinement for object detection
KR102157610B1 (en) System and method for automatically detecting structural damage by generating super resolution digital images
CN116256586B (en) Overheat detection method and device for power equipment, electronic equipment and storage medium
CN111582074A (en) Monitoring video leaf occlusion detection method based on scene depth information perception
CN113139906B (en) Training method and device for generator and storage medium
CN112802048B (en) Method and device for generating layer generation countermeasure network with asymmetric structure
CN110503609A (en) A kind of image rain removing method based on mixing sensor model
CN113240586A (en) Bolt image super-resolution processing method capable of adaptively adjusting amplification factor
CN113111740A (en) Characteristic weaving method for remote sensing image target detection
CN115100409B (en) Video portrait segmentation algorithm based on twin network
CN112418229A (en) Unmanned ship marine scene image real-time segmentation method based on deep learning
KR102239133B1 (en) Apparatus and method of defect classification using image transformation based on machine-learning
CN115860139A (en) Deep learning-based multi-scale ship target detection method
CN115909081A (en) Optical remote sensing image ground object classification method based on edge-guided multi-scale feature fusion
CN111898671B (en) Target identification method and system based on fusion of laser imager and color camera codes
CN115393743A (en) Vehicle detection method based on double-branch encoding and decoding network, unmanned aerial vehicle and medium
CN111047571B (en) Image salient target detection method with self-adaptive selection training process
KR20230085299A (en) System and method for detecting damage of structure by generating multi-scale resolution image
CN114565764A (en) Port panorama sensing system based on ship instance segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant