CN111461217A - Aerial image small target detection method based on feature fusion and up-sampling - Google Patents

Aerial image small target detection method based on feature fusion and up-sampling Download PDF

Info

Publication number
CN111461217A
CN111461217A CN202010247656.9A CN202010247656A CN111461217A CN 111461217 A CN111461217 A CN 111461217A CN 202010247656 A CN202010247656 A CN 202010247656A CN 111461217 A CN111461217 A CN 111461217A
Authority
CN
China
Prior art keywords
feature
sampling
output
layer
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010247656.9A
Other languages
Chinese (zh)
Other versions
CN111461217B (en
Inventor
林沪
刘琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010247656.9A priority Critical patent/CN111461217B/en
Publication of CN111461217A publication Critical patent/CN111461217A/en
Application granted granted Critical
Publication of CN111461217B publication Critical patent/CN111461217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an aerial image small target detection method based on feature fusion and up-sampling. The method comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upper sampling layer based on learning, and carrying out resolution up-sampling on the features to obtain a feature set with uniform resolution; group normalization of the features grouped by channels is performed; splicing the feature sets to generate fusion features; performing down-sampling on the fusion features for multiple times, and constructing a feature pyramid for detection; the head detection network is used to classify and locate the target. The invention relates to a feature fusion and feature up-sampling method used in a target detection training and testing stage, which can obviously improve the detection precision of small targets in aerial images and only slightly increase the calculation overhead.

Description

Aerial image small target detection method based on feature fusion and up-sampling
Technical Field
The invention relates to the field of aerial image target detection, in particular to an aerial image small target detection method based on feature fusion and up-sampling.
Background
Compared with a monitoring camera with a fixed position and a view field, the camera on the unmanned aerial vehicle has natural advantages, such as convenient deployment, strong maneuverability and wide view field. These advantages are expected to provide services for many applications such as security monitoring, search rescue, and people flow monitoring. In many drone applications, target detection in aerial images is a key component, critical to developing fully autonomous systems, and therefore an urgent need in the industry.
Although convolutional neural networks have achieved significant effects in the field of general target detection, their performance in an unmanned aerial vehicle aerial scene is not satisfactory. The main reason is that the relative scale and the absolute resolution of the target are smaller in the image in the aerial scene of the unmanned aerial vehicle than in the image in the ordinary scene. Therefore, the resolution of the corresponding feature response area in the extracted convolution feature map is smaller, and higher omission ratio is caused. More specifically, the feature map extracted by the convolutional neural network is often reduced 1/4 or 1/8 relative to the length and width of the input image, and the characterization capability of the feature map on small-scale targets is further weakened. Therefore, how to increase the feature expression of small-scale targets becomes a key point of the system design.
The existing convolutional neural network method mostly adopts an FPN feature fusion network to improve the feature expression of a small-scale target. The specific process is as follows: extracting a feature set of an input image by using a backbone network; using bilinear interpolation to up-sample the high-layer low-resolution characteristic diagram, and fusing the high-layer low-resolution characteristic diagram with the adjacent low-layer characteristic diagram in sequence; and detecting by using the fused feature set. However, the existing FPN feature fusion network cannot sufficiently fuse information of feature maps with different resolutions, and bilinear interpolation is not an efficient upsampling method. These two drawbacks result in FPN having limited effectiveness in the detection of small size targets.
In summary, the key to improving the small target detection at the aerial photography view angle is to improve the feature fusion strategy and the up-sampling method. The invention provides an aerial image small target detection method based on feature fusion and up-sampling, which comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upper sampling layer based on learning, and carrying out resolution up-sampling on the features to obtain a feature set with uniform resolution; group normalization of the features grouped by channels is performed; splicing the feature sets to generate fusion features; performing down-sampling on the fusion features for multiple times, and constructing a feature pyramid for detection; and classifying and positioning the target by using the head detection network, and finally outputting a detection result.
The present invention relates to the following prior art documents:
prior art document 1: he Kaim, et al, "Deep residual learning for imaging recognition," Proceedings of the IEEE conference on computer vision and dpattern recognition.2016.
Prior document 2: wu Y, He K.group nomenclature [ C ]// Proceedings of the European Conference on Computer Vision (ECCV).2018:3-19.
Prior document 3: L in T Y, Goyal P, Girshick R, et al. focal local for dense object detection [ C ]// Proceedings of the IEEE international conference on computer vision.2017: 2980-.
The invention provides a feature extraction network, which is mainly composed of a residual module based on residual connection, can reduce the training difficulty of a deep network, and learns the features with deeper depth and stronger representation capability.A feature normalization method is provided in the prior document 2, so that the problems that the effect is poor and the optimal solution is difficult to converge when the batch is biased during network training in the prior batch normalization are solved.A high-performance one-stage dense target detector is trained in the prior document 3 based on an FPN network and a Focal L oss loss function.
Disclosure of Invention
The invention aims to improve the detection precision of small targets in aerial images, thereby better completing the tasks of security monitoring, search and rescue, stream of people monitoring and the like based on unmanned aerial vehicle target detection. In order to achieve the purpose, the invention provides an aerial image small target detection method based on feature fusion and up-sampling, wherein a channel standardization module and an up-sampling layer are constructed to carry out channel standardization and up-sampling on features; then, carrying out group normalization on the features and splicing the features into fusion features; performing down-sampling for multiple times on the basis of the fusion characteristics to generate a characteristic pyramid; and classifying and positioning the target by using the head network, and outputting a detection result.
The purpose of the invention is realized by at least one of the following technical solutions.
An aerial image small target detection method based on feature fusion and up-sampling comprises the following steps:
s1, extracting a feature set of the input image by using a backbone network;
s2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
s3, constructing an up-sampling layer based on learning, and performing resolution up-sampling on the normalized features to obtain a feature set with uniform resolution;
s4, carrying out group normalization of grouping the characteristics with uniform resolution according to channels;
s5, splicing the feature sets after group normalization to generate fusion features;
s6, downsampling the fusion features for multiple times, and constructing a feature pyramid for detection;
and S7, using the head to detect the network classification and positioning the target, and finally outputting the detection result.
Further, in step S1, the backbone network is a residual convolution network, the residual convolution network includes five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolution of the output feature maps of the residual modules is the same; 2 times of down sampling exists between every two adjacent stages, and the length and the width of the feature map after down sampling are reduced by two times respectively; and the finally extracted feature set is a set consisting of the last feature map of the two to five stages of the backbone network.
Further, in step S2, the channel normalization module is implemented by a convolutional layer; the input of the channel standardization module is a feature map in a feature set output by the backbone network, and the output of the channel standardization module is a feature map with standardized channel dimensions; the resolution of the feature map output by the channel normalization module is the same as that of the input feature map; the channel dimension number of the output feature map of the channel normalization module is a fixed value.
Further, in step S3, the learning-based upsampling layer is formed by cascading several upsampling modules; for the feature maps with different resolutions input by the learning-based upsampling layer, the number of cascaded upsampling modules is different, and the resolution of the finally output feature maps is the same; the up-sampling module is formed by connecting a layer of channel expansion layer and a layer of pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature diagram output by the channel expansion layer is 4 times of the channel dimension number of the input feature diagram; the number of channels of the feature map output by the pixel rearrangement layer is 1/4 of the number of channels of the input feature map, and the resolution of the output feature map is 2 times of the resolution of the input feature map.
Further, the formula of the pixel rearrangement layer is as follows:
Figure BDA0002434363910000041
wherein ,
Figure BDA0002434363910000042
representing the pixel rearrangement layer, L representing the input feature map of the pixel rearrangement layer, x and y representing the abscissa and ordinate of the output feature map, C representing the channel coordinate of the input feature map, r representing the up-sampling magnification,
Figure BDA0002434363910000043
meaning rounding down and mod meaning remainder.
Further, in step S4, the group normalization by channel includes the steps of:
s4.1, let I ═ I (I)N,iC,iH,iW) A feature map indicating the resolution uniformity output in step S3 as 4D tensors indexed in the order of (N, C, H, W); calculating all of the characteristic maps I according to the following formulaMean μ and variance σ of pixels:
Figure BDA0002434363910000051
Figure BDA0002434363910000052
∈ denotes errors between adjacent floating point numbers in a computer, S denotes a pixel set formed by grouping feature maps I according to channels, k denotes one pixel in the pixel set S, m denotes the size of the pixel set S, and the pixel set S is defined as:
Figure BDA0002434363910000053
wherein G represents the number of groups, which is a predefined hyper-parameter, and the value range of G is an integral multiple of 16;
Figure BDA0002434363910000054
representing the number of channels in each group;
Figure BDA0002434363910000058
denotes rounding down, iC、iNRespectively, coordinates of the feature map I on an N, C axis; k is a radical ofN and kCRespectively, the coordinates of the pixel k on the N, C axis;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
Figure BDA0002434363910000055
wherein ,
Figure BDA0002434363910000056
representing the normalized feature map, sigma and mu being the mean and variance calculated in step S4.1;
s4.3, fitting a linear transformation after normalization to compensate for possible loss of feature expression capacity; the specific transformation formula is as follows:
Figure BDA0002434363910000057
where O represents the profile of the group normalized output grouped by channel, and γ and β represent the fitted scale and offset parameters, respectively, where the γ parameter is initialized to 1 and β is initialized to 0.
Further, in step S5, the splicing the feature sets to generate the fused features refers to a splicing operation of tensors; and splicing the characteristic graphs along the dimension direction by the splicing operation of the tensor to obtain a fusion characteristic tensor.
Further, in step S6, the downsampling the fused features for multiple times to construct the feature pyramid means that the feature map is subjected to a plurality of downsampling layers connected in series to generate a series of low-resolution feature maps; the feature map pyramid is a set formed by low-resolution feature maps output by a down-sampling layer; the resolution of the output low resolution feature map is 1/2 the resolution of the feature map of the downsampled layer input.
Further, in step S7, the head detection network sequentially inputs the feature map of the feature pyramid output in step S6, and outputs the category and position coordinates of the object; the head detection network comprises a classification full convolution network and a regression full convolution network;
further, the calculation steps of the target classification full convolution network are as follows:
s7.1.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into a plurality of buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.1.2, inputting the characteristic diagram output by the buffer convolution layer into the classification prediction layer; the classified prediction layer consists of a convolutional layer; let x be (x)N,xC,xH,xW) The 4D tensor is indexed by the (N, C, H, W) sequence and represents the classification result output by the classification prediction layer; where N is the batch axis, C is the channel axis, and H and W are each longA degree and width axis; the batch, length and width (N, H, W) of the x are the same as the input feature map; the number of the channels of x is Cls A, Cls is the number of target categories, and A is the number of preset anchors;
further, the target regression full convolution network is calculated by the following steps:
s7.2.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into the buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.2.2, inputting the characteristic diagram output by the buffer convolution layer into the regression prediction layer; the regression prediction layer consists of a convolution layer; let y be (y)N,yC,yH,yW) Representing the regression result output by the regression prediction layer by using 4D tensors indexed in the (N, C, H, W) sequence; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the y are the same as the input feature map; and the number C of the channels of the y is 4A, and A is the number of the preset anchors.
S7.3, the results x and y output by the classification full convolution network and the regression full convolution network are combined to obtain (z, z) as the 4D tensor z indexed in the order of (N, C, H, W)N,zC,zH,zW) (ii) a Where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the size of C is (4+ Cls). times.A, Cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network, and includes the category and the position coordinates of the target.
Compared with the prior art, the invention has the beneficial effects that:
the invention improves the characteristic fusion process and the characteristic up-sampling method, can obviously improve the representation capability of the characteristic diagram, improves the small target detection precision, and only slightly increases the calculation overhead.
Drawings
FIG. 1 is a flow chart of a method for detecting small targets in aerial images based on feature fusion and upsampling;
FIG. 2 is a schematic diagram of a feature fusion network according to an embodiment of the present invention;
FIG. 3 is a diagram of a pixel rearrangement layer according to an embodiment of the present invention.
Detailed Description
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of various embodiments of the disclosure as defined by the claims and their equivalents. It includes various specific details to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the various embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions may be omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to the written meaning, but are used only by the inventors to enable a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following description of the various embodiments of the present disclosure is provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.
Example (b):
a method for detecting small targets in aerial images based on feature fusion and up-sampling is disclosed, as shown in FIG. 1, and comprises the following steps:
s1, extracting a feature set of the input image by using a backbone network;
the main network is a residual convolution network which comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolution ratios of the output characteristic graphs of the residual modules are the same; 2 times of down sampling exists between every two adjacent stages, and the length and the width of the feature map after down sampling are reduced by two times respectively; and the finally extracted feature set is a set consisting of the last feature map of the two to five stages of the backbone network.
S2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
the channel standardization module is realized by a convolution layer; the input of the channel standardization module is a feature map in a feature set output by the backbone network, and the output of the channel standardization module is a feature map with standardized channel dimensions; the resolution of the feature map output by the channel normalization module is the same as that of the input feature map; the channel dimension number of the output feature map of the channel normalization module is a fixed value.
In this embodiment, the convolution kernel size of the convolution layer in the channel normalization module is 1, the padding is 1, and the step length is 1; the channel dimension number of the feature map output by the channel normalization module is a fixed value 256.
S3, constructing an up-sampling layer based on learning, and performing resolution up-sampling on the normalized features to obtain a feature set with uniform resolution;
the learning-based up-sampling layer is formed by cascading a plurality of up-sampling modules; for the feature maps with different resolutions input by the learning-based upsampling layer, the number of cascaded upsampling modules is different, and the resolution of the finally output feature maps is the same; the up-sampling module is formed by connecting a layer of channel expansion layer and a layer of pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature diagram output by the channel expansion layer is 4 times of the channel dimension number of the input feature diagram; the number of channels of the feature map output by the pixel rearrangement layer is 1/4 of the number of channels of the input feature map, and the resolution of the output feature map is 2 times of the resolution of the input feature map.
In this embodiment, the channel expansion layer is implemented by a convolution layer, the convolution kernel size is 1, the padding is 1, the step length is 1, and the channel dimension number of the output feature map is 1024; the channel dimension number of the output characteristic map of the pixel rearrangement layer is 256;
as shown in fig. 3, the formula of the pixel rearrangement layer is as follows:
Figure BDA0002434363910000091
wherein ,
Figure BDA0002434363910000092
representing the pixel rearrangement layer, L representing the input feature map of the pixel rearrangement layer, x and y representing the abscissa and ordinate of the output feature map, C representing the channel coordinate of the input feature map, r representing the up-sampling magnification,
Figure BDA0002434363910000094
meaning rounding down and mod meaning remainder.
S4, carrying out group normalization of grouping the characteristics with uniform resolution according to channels;
the group normalization grouped by channel includes the steps of:
s4.1, let I ═ I (I)N,iC,iH,iW) A feature map indicating the resolution uniformity output in step S3 as 4D tensors indexed in the order of (N, C, H, W); where N is the batch axis, C is the channel axis, and H and W are the feature map length and width axes, respectively; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following equations:
Figure BDA0002434363910000093
Figure BDA0002434363910000101
∈ denotes errors between adjacent floating point numbers in a computer, wherein the size of ∈ in Python language is 2.220446049250313e-16, S denotes a pixel set formed by grouping feature maps I according to channels, k denotes one pixel in the pixel set S, m denotes the size of the pixel set S, and the pixel set S is defined as:
Figure BDA0002434363910000102
wherein G represents the number of groups and is a predefined hyper-parameter, the value range of G is an integral multiple of 16, and the value of G is 32 under the default condition;
Figure BDA0002434363910000103
representing the number of channels in each group;
Figure BDA0002434363910000107
denotes rounding down, iC、iNRespectively, coordinates of the feature map I on an N, C axis; k is a radical ofN and kCRespectively, the coordinates of the pixel k on the N, C axis;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
Figure BDA0002434363910000104
wherein ,
Figure BDA0002434363910000105
representing the normalized feature map, sigma and mu being the mean and variance calculated in step S4.1;
s4.3, fitting a linear transformation after normalization to compensate for possible loss of feature expression capacity; the specific transformation formula is as follows:
Figure BDA0002434363910000106
where y represents the profile of the group normalized output grouped by channel, and γ and β represent the fitted scale and offset parameters, respectively, where the γ parameter is initialized to 1 and β is initialized to 0.
S5, as shown in FIG. 2, splicing the feature set after group normalization to generate fusion features;
the splicing of the feature sets to generate the fused features refers to the splicing operation of tensors; and splicing the characteristic graphs along the dimension direction by the splicing operation of the tensor to obtain a fusion characteristic tensor.
S6, downsampling the fusion features for multiple times, and constructing a feature pyramid for detection;
the step of carrying out multiple downsampling on the fusion features to construct a feature pyramid refers to that a feature graph is subjected to a plurality of downsampling layers connected in series to generate a series of low-resolution feature graphs; the feature map pyramid is a set formed by low-resolution feature maps output by a down-sampling layer; the resolution of the output low resolution feature map is 1/2 of the resolution of the feature map of the downsampled layer input;
in this embodiment, the downsampling layer is implemented by a convolution layer; the convolution kernel size of the downsampling layer is 3, the padding is 1, and the step length is 2.
S7, classifying and positioning the target by using the head detection network, and finally outputting the detection result;
the head detection network sequentially inputs the feature map of the feature pyramid output in the step S6, and outputs the category and the position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;
the calculation steps of the target classification full convolution network are as follows:
s7.1.1, in this embodiment, the feature map of the feature pyramid output in step S6 is input into the series of 4 buffer convolution layers; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram; the convolution kernel size of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the number of output channels is 256;
s7.1.2, inputting the characteristic diagram output by the buffer convolution layer into the classification prediction layer; the classified prediction layer consists of a convolutional layer; let x be (x)N,xC,xH,xW) The 4D tensor is indexed by the (N, C, H, W) sequence and represents the classification result output by the classification prediction layer; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the x are the same as the input feature map; in this embodiment, the size of the convolution kernel of the classified prediction layer is 3, the padding is 1, the step length is 1, the number of output channels is Cls × a, Cls is the number of target classes, and a is the number of preset anchors;
the calculation steps of the target regression full convolution network are as follows:
s7.2.1, in this embodiment, the feature map of the feature pyramid output in step S6 is input into the series of 4 buffer convolution layers; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram; the convolution kernel size of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the number of output channels is 256;
s7.2.2, inputting the characteristic diagram output by the buffer convolution layer into the regression prediction layer; the regression prediction layer consists of a convolution layer; let y be (y)N,yC,yH,yW) Representing the regression result output by the regression prediction layer by using 4D tensors indexed in the (N, C, H, W) sequence; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the y are the same as the input feature map; in this embodiment, the convolution kernel size of the regression prediction layer is 3, the padding is 1, the step size is 1, the number of output channels is 4 × a, and a is the number of preset anchors.
S7.3, the results x and y output by the classification full convolution network and the regression full convolution network are combined to obtain (z, z) as the 4D tensor z indexed in the order of (N, C, H, W)N,zC,zH,zW) (ii) a Where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the size of C is (4+ Cls). times.A, Cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network, and includes the category and the position coordinates of the target.
The above examples of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (9)

1. An aerial image small target detection method based on feature fusion and up-sampling is characterized by comprising the following steps:
s1, extracting a feature set of the input image by using a backbone network;
s2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
s3, constructing an up-sampling layer based on learning, and performing resolution up-sampling on the normalized features to obtain a feature set with uniform resolution;
s4, carrying out group normalization of grouping the characteristics with uniform resolution according to channels;
s5, splicing the feature sets after group normalization to generate fusion features;
s6, downsampling the fusion features for multiple times, and constructing a feature pyramid for detection;
and S7, using the head to detect the network classification and positioning the target, and finally outputting the detection result.
2. The method for detecting the small target of the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in the step S1, the main network is a residual convolution network, the residual convolution network comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolution of the feature graph output by each residual module is the same; 2 times of down sampling exists between every two adjacent stages, and the length and the width of the feature map after down sampling are reduced by two times respectively; and the finally extracted feature set is a set consisting of the last feature map of the two to five stages of the backbone network.
3. The method for detecting the small target in the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in step S2, the channel normalization module is implemented by a convolutional layer; the input of the channel standardization module is a feature map in a feature set output by the backbone network, and the output of the channel standardization module is a feature map with standardized channel dimensions; the resolution of the feature map output by the channel normalization module is the same as that of the input feature map; the channel dimension number of the output feature map of the channel normalization module is a fixed value.
4. The method for detecting the small target in the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in step S3, the learning-based up-sampling layer is formed by cascading a plurality of up-sampling modules; for the feature maps with different resolutions input by the learning-based upsampling layer, the number of cascaded upsampling modules is different, and the resolution of the finally output feature maps is the same; the up-sampling module is formed by connecting a layer of channel expansion layer and a layer of pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature diagram output by the channel expansion layer is 4 times of the channel dimension number of the input feature diagram; the number of channels of the feature map output by the pixel rearrangement layer is 1/4 of the number of channels of the input feature map, and the resolution of the output feature map is 2 times of the resolution of the input feature map.
5. The method for detecting the small target of the aerial image based on the feature fusion and the up-sampling according to claim 4, wherein the formula of the pixel rearrangement layer is as follows:
Figure FDA0002434363900000021
wherein ,
Figure FDA0002434363900000022
representing the pixel rearrangement layer, L representing the input feature map of the pixel rearrangement layer, x and y representing the abscissa and ordinate of the output feature map, C representing the channel coordinate of the input feature map, r representing the up-sampling magnification,
Figure FDA0002434363900000023
meaning rounding down and mod meaning remainder.
6. The method for detecting small targets in aerial images based on feature fusion and upsampling as claimed in claim 1, wherein in step S4, the group normalization by channel comprises the following steps:
s4.1, let I ═ I (I)N,iC,iH,iW) A feature map indicating the resolution uniformity output in step S3 as 4D tensors indexed in the order of (N, C, H, W); where N is the batch axis, C is the channel axis, and H and W are the feature map length and width axes, respectively; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following equations:
Figure FDA0002434363900000031
Figure FDA0002434363900000032
∈ denotes errors between adjacent floating point numbers in a computer, S denotes a pixel set formed by grouping feature maps I according to channels, k denotes one pixel in the pixel set S, m denotes the size of the pixel set S, and the pixel set S is defined as:
Figure FDA0002434363900000033
wherein G represents the number of groups, which is a predefined hyper-parameter, and the value range of G is an integral multiple of 16;
Figure FDA0002434363900000034
representing the number of channels in each group;
Figure FDA0002434363900000035
denotes rounding down, iC、iNRespectively, coordinates of the feature map I on an N, C axis; k is a radical ofN and kCRespectively, the coordinates of the pixel k on the N, C axis;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
Figure FDA0002434363900000036
wherein ,
Figure FDA0002434363900000037
representing the normalized feature map, sigma and mu being the mean and variance calculated in step S4.1;
s4.3, fitting a linear transformation after normalization to compensate for possible loss of feature expression capacity; the specific transformation formula is as follows:
Figure FDA0002434363900000038
where O represents the profile of the group normalized output grouped by channel, and γ and β represent the fitted scale and offset parameters, respectively, where the γ parameter is initialized to 1 and β is initialized to 0.
7. The method for detecting the small target of the aerial image based on the feature fusion and the upsampling as recited in claim 1, wherein in the step S5, the feature set is spliced to generate the fusion feature, which is a splicing operation of tensor; and splicing the characteristic graphs along the dimension direction by the splicing operation of the tensor to obtain a fusion characteristic tensor.
8. The method for detecting the small target of the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in the step S6, the multiple down-sampling of the fusion feature to construct the feature pyramid means that the feature graph is processed through a plurality of down-sampling layers connected in series to generate a series of feature graphs with low resolution; the feature map pyramid is a set formed by low-resolution feature maps output by a down-sampling layer; the resolution of the output low resolution feature map is 1/2 the resolution of the feature map of the downsampled layer input.
9. The method for detecting small targets in aerial images based on feature fusion and up-sampling as claimed in claim 1, wherein in step S7, the head detection network sequentially inputs the feature map of the feature pyramid output in step S6 and outputs the category and position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;
the calculation steps of the target classification full convolution network are as follows:
s7.1.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into a plurality of buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.1.2, inputting the characteristic diagram output by the buffer convolution layer into the classification prediction layer; the classified prediction layer consists of a convolutional layer; let x be (x)N,xC,xH,xW) The 4D tensor is indexed by the (N, C, H, W) sequence and represents the classification result output by the classification prediction layer; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the x are the same as the input feature map; the number of the channels of x is Cls A, Cls is the number of target categories, and A is the number of preset anchors;
the calculation steps of the target regression full convolution network are as follows:
s7.2.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into the buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.2.2, inputting the characteristic diagram output by the buffer convolution layer into the regression prediction layer; the regression prediction layer consists of a convolution layer; let y be (y)N,yC,yH,yW) Representing the regression result output by the regression prediction layer by using 4D tensors indexed in the (N, C, H, W) sequence; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the y are the same as the input feature map; the number C of the channels of the y is 4A, and A is the number of the preset anchors;
s7.3, the results x and y output by the classification full convolution network and the regression full convolution network are combined to obtain (z, z) as the 4D tensor z indexed in the order of (N, C, H, W)N,zC,zH,zW) (ii) a Where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the size of C is (4+ Cls). times.A, Cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is the target detection result output by the network, and includes the category and the position coordinates of the target.
CN202010247656.9A 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling Active CN111461217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010247656.9A CN111461217B (en) 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010247656.9A CN111461217B (en) 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling

Publications (2)

Publication Number Publication Date
CN111461217A true CN111461217A (en) 2020-07-28
CN111461217B CN111461217B (en) 2023-05-23

Family

ID=71682431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010247656.9A Active CN111461217B (en) 2020-03-31 2020-03-31 Aerial image small target detection method based on feature fusion and up-sampling

Country Status (1)

Country Link
CN (1) CN111461217B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070658A (en) * 2020-08-25 2020-12-11 西安理工大学 Chinese character font style migration method based on deep learning
CN112580721A (en) * 2020-12-19 2021-03-30 北京联合大学 Target key point detection method based on multi-resolution feature fusion
CN112633156A (en) * 2020-12-22 2021-04-09 浙江大华技术股份有限公司 Vehicle detection method, image processing apparatus, and computer-readable storage medium
CN112990317A (en) * 2021-03-18 2021-06-18 中国科学院长春光学精密机械与物理研究所 Weak and small target detection method
CN113111877A (en) * 2021-04-28 2021-07-13 奇瑞汽车股份有限公司 Characteristic pyramid and characteristic image extraction method thereof
CN113312995A (en) * 2021-05-18 2021-08-27 华南理工大学 Anchor-free vehicle-mounted pedestrian detection method based on central axis
CN114120077A (en) * 2022-01-27 2022-03-01 山东融瓴科技集团有限公司 Prevention and control risk early warning method based on big data of unmanned aerial vehicle aerial photography
CN111967538B (en) * 2020-09-25 2024-03-15 北京康夫子健康技术有限公司 Feature fusion method, device and equipment applied to small target detection and storage medium
CN117893990A (en) * 2024-03-18 2024-04-16 中国第一汽车股份有限公司 Road sign detection method, device and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109598290A (en) * 2018-11-22 2019-04-09 上海交通大学 A kind of image small target detecting method combined based on hierarchical detection
CN110097129A (en) * 2019-05-05 2019-08-06 西安电子科技大学 Remote sensing target detection method based on profile wave grouping feature pyramid convolution
CN110929649A (en) * 2019-11-24 2020-03-27 华南理工大学 Network and difficult sample mining method for small target detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN109598290A (en) * 2018-11-22 2019-04-09 上海交通大学 A kind of image small target detecting method combined based on hierarchical detection
CN110097129A (en) * 2019-05-05 2019-08-06 西安电子科技大学 Remote sensing target detection method based on profile wave grouping feature pyramid convolution
CN110929649A (en) * 2019-11-24 2020-03-27 华南理工大学 Network and difficult sample mining method for small target detection

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070658A (en) * 2020-08-25 2020-12-11 西安理工大学 Chinese character font style migration method based on deep learning
CN112070658B (en) * 2020-08-25 2024-04-16 西安理工大学 Deep learning-based Chinese character font style migration method
CN111967538B (en) * 2020-09-25 2024-03-15 北京康夫子健康技术有限公司 Feature fusion method, device and equipment applied to small target detection and storage medium
CN112580721B (en) * 2020-12-19 2023-10-24 北京联合大学 Target key point detection method based on multi-resolution feature fusion
CN112580721A (en) * 2020-12-19 2021-03-30 北京联合大学 Target key point detection method based on multi-resolution feature fusion
CN112633156A (en) * 2020-12-22 2021-04-09 浙江大华技术股份有限公司 Vehicle detection method, image processing apparatus, and computer-readable storage medium
CN112633156B (en) * 2020-12-22 2024-05-31 浙江大华技术股份有限公司 Vehicle detection method, image processing device, and computer-readable storage medium
CN112990317A (en) * 2021-03-18 2021-06-18 中国科学院长春光学精密机械与物理研究所 Weak and small target detection method
CN113111877A (en) * 2021-04-28 2021-07-13 奇瑞汽车股份有限公司 Characteristic pyramid and characteristic image extraction method thereof
CN113312995B (en) * 2021-05-18 2023-02-14 华南理工大学 Anchor-free vehicle-mounted pedestrian detection method based on central axis
CN113312995A (en) * 2021-05-18 2021-08-27 华南理工大学 Anchor-free vehicle-mounted pedestrian detection method based on central axis
CN114120077A (en) * 2022-01-27 2022-03-01 山东融瓴科技集团有限公司 Prevention and control risk early warning method based on big data of unmanned aerial vehicle aerial photography
CN117893990A (en) * 2024-03-18 2024-04-16 中国第一汽车股份有限公司 Road sign detection method, device and computer equipment

Also Published As

Publication number Publication date
CN111461217B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN111461217A (en) Aerial image small target detection method based on feature fusion and up-sampling
CN113052210B (en) Rapid low-light target detection method based on convolutional neural network
CN110909642A (en) Remote sensing image target detection method based on multi-scale semantic feature fusion
CN108960261B (en) Salient object detection method based on attention mechanism
CN111524135A (en) Image enhancement-based method and system for detecting defects of small hardware fittings of power transmission line
CN108416292B (en) Unmanned aerial vehicle aerial image road extraction method based on deep learning
CN113723377B (en) Traffic sign detection method based on LD-SSD network
CN114359851A (en) Unmanned target detection method, device, equipment and medium
CN111860683B (en) Target detection method based on feature fusion
CN112183578B (en) Target detection method, medium and system
CN115035295B (en) Remote sensing image semantic segmentation method based on shared convolution kernel and boundary loss function
CN111582074A (en) Monitoring video leaf occlusion detection method based on scene depth information perception
CN111898608B (en) Natural scene multi-language character detection method based on boundary prediction
CN111767919B (en) Multilayer bidirectional feature extraction and fusion target detection method
CN113313118A (en) Self-adaptive variable-proportion target detection method based on multi-scale feature fusion
CN116563553B (en) Unmanned aerial vehicle image segmentation method and system based on deep learning
KR102239133B1 (en) Apparatus and method of defect classification using image transformation based on machine-learning
CN112418229A (en) Unmanned ship marine scene image real-time segmentation method based on deep learning
CN112464982A (en) Target detection model, method and application based on improved SSD algorithm
CN115100409B (en) Video portrait segmentation algorithm based on twin network
WO2020093210A1 (en) Scene segmentation method and system based on contenxtual information guidance
CN115909081A (en) Optical remote sensing image ground object classification method based on edge-guided multi-scale feature fusion
CN115860139A (en) Deep learning-based multi-scale ship target detection method
CN111898671B (en) Target identification method and system based on fusion of laser imager and color camera codes
CN114565764A (en) Port panorama sensing system based on ship instance segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant