CN111461217A - Aerial image small target detection method based on feature fusion and up-sampling - Google Patents
Aerial image small target detection method based on feature fusion and up-sampling Download PDFInfo
- Publication number
- CN111461217A CN111461217A CN202010247656.9A CN202010247656A CN111461217A CN 111461217 A CN111461217 A CN 111461217A CN 202010247656 A CN202010247656 A CN 202010247656A CN 111461217 A CN111461217 A CN 111461217A
- Authority
- CN
- China
- Prior art keywords
- feature
- sampling
- output
- layer
- feature map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 60
- 238000001514 detection method Methods 0.000 title claims abstract description 41
- 230000004927 fusion Effects 0.000 title claims abstract description 41
- 238000010606 normalization Methods 0.000 claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 17
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 238000010586 diagram Methods 0.000 claims description 37
- 230000008707 rearrangement Effects 0.000 claims description 17
- 230000009466 transformation Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 abstract description 3
- 238000012544 monitoring process Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 241001648288 Kineosporiaceae Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an aerial image small target detection method based on feature fusion and up-sampling. The method comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upper sampling layer based on learning, and carrying out resolution up-sampling on the features to obtain a feature set with uniform resolution; group normalization of the features grouped by channels is performed; splicing the feature sets to generate fusion features; performing down-sampling on the fusion features for multiple times, and constructing a feature pyramid for detection; the head detection network is used to classify and locate the target. The invention relates to a feature fusion and feature up-sampling method used in a target detection training and testing stage, which can obviously improve the detection precision of small targets in aerial images and only slightly increase the calculation overhead.
Description
Technical Field
The invention relates to the field of aerial image target detection, in particular to an aerial image small target detection method based on feature fusion and up-sampling.
Background
Compared with a monitoring camera with a fixed position and a view field, the camera on the unmanned aerial vehicle has natural advantages, such as convenient deployment, strong maneuverability and wide view field. These advantages are expected to provide services for many applications such as security monitoring, search rescue, and people flow monitoring. In many drone applications, target detection in aerial images is a key component, critical to developing fully autonomous systems, and therefore an urgent need in the industry.
Although convolutional neural networks have achieved significant effects in the field of general target detection, their performance in an unmanned aerial vehicle aerial scene is not satisfactory. The main reason is that the relative scale and the absolute resolution of the target are smaller in the image in the aerial scene of the unmanned aerial vehicle than in the image in the ordinary scene. Therefore, the resolution of the corresponding feature response area in the extracted convolution feature map is smaller, and higher omission ratio is caused. More specifically, the feature map extracted by the convolutional neural network is often reduced 1/4 or 1/8 relative to the length and width of the input image, and the characterization capability of the feature map on small-scale targets is further weakened. Therefore, how to increase the feature expression of small-scale targets becomes a key point of the system design.
The existing convolutional neural network method mostly adopts an FPN feature fusion network to improve the feature expression of a small-scale target. The specific process is as follows: extracting a feature set of an input image by using a backbone network; using bilinear interpolation to up-sample the high-layer low-resolution characteristic diagram, and fusing the high-layer low-resolution characteristic diagram with the adjacent low-layer characteristic diagram in sequence; and detecting by using the fused feature set. However, the existing FPN feature fusion network cannot sufficiently fuse information of feature maps with different resolutions, and bilinear interpolation is not an efficient upsampling method. These two drawbacks result in FPN having limited effectiveness in the detection of small size targets.
In summary, the key to improving the small target detection at the aerial photography view angle is to improve the feature fusion strategy and the up-sampling method. The invention provides an aerial image small target detection method based on feature fusion and up-sampling, which comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upper sampling layer based on learning, and carrying out resolution up-sampling on the features to obtain a feature set with uniform resolution; group normalization of the features grouped by channels is performed; splicing the feature sets to generate fusion features; performing down-sampling on the fusion features for multiple times, and constructing a feature pyramid for detection; and classifying and positioning the target by using the head detection network, and finally outputting a detection result.
The present invention relates to the following prior art documents:
prior art document 1: he Kaim, et al, "Deep residual learning for imaging recognition," Proceedings of the IEEE conference on computer vision and dpattern recognition.2016.
Prior document 2: wu Y, He K.group nomenclature [ C ]// Proceedings of the European Conference on Computer Vision (ECCV).2018:3-19.
Prior document 3: L in T Y, Goyal P, Girshick R, et al. focal local for dense object detection [ C ]// Proceedings of the IEEE international conference on computer vision.2017: 2980-.
The invention provides a feature extraction network, which is mainly composed of a residual module based on residual connection, can reduce the training difficulty of a deep network, and learns the features with deeper depth and stronger representation capability.A feature normalization method is provided in the prior document 2, so that the problems that the effect is poor and the optimal solution is difficult to converge when the batch is biased during network training in the prior batch normalization are solved.A high-performance one-stage dense target detector is trained in the prior document 3 based on an FPN network and a Focal L oss loss function.
Disclosure of Invention
The invention aims to improve the detection precision of small targets in aerial images, thereby better completing the tasks of security monitoring, search and rescue, stream of people monitoring and the like based on unmanned aerial vehicle target detection. In order to achieve the purpose, the invention provides an aerial image small target detection method based on feature fusion and up-sampling, wherein a channel standardization module and an up-sampling layer are constructed to carry out channel standardization and up-sampling on features; then, carrying out group normalization on the features and splicing the features into fusion features; performing down-sampling for multiple times on the basis of the fusion characteristics to generate a characteristic pyramid; and classifying and positioning the target by using the head network, and outputting a detection result.
The purpose of the invention is realized by at least one of the following technical solutions.
An aerial image small target detection method based on feature fusion and up-sampling comprises the following steps:
s1, extracting a feature set of the input image by using a backbone network;
s2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
s3, constructing an up-sampling layer based on learning, and performing resolution up-sampling on the normalized features to obtain a feature set with uniform resolution;
s4, carrying out group normalization of grouping the characteristics with uniform resolution according to channels;
s5, splicing the feature sets after group normalization to generate fusion features;
s6, downsampling the fusion features for multiple times, and constructing a feature pyramid for detection;
and S7, using the head to detect the network classification and positioning the target, and finally outputting the detection result.
Further, in step S1, the backbone network is a residual convolution network, the residual convolution network includes five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolution of the output feature maps of the residual modules is the same; 2 times of down sampling exists between every two adjacent stages, and the length and the width of the feature map after down sampling are reduced by two times respectively; and the finally extracted feature set is a set consisting of the last feature map of the two to five stages of the backbone network.
Further, in step S2, the channel normalization module is implemented by a convolutional layer; the input of the channel standardization module is a feature map in a feature set output by the backbone network, and the output of the channel standardization module is a feature map with standardized channel dimensions; the resolution of the feature map output by the channel normalization module is the same as that of the input feature map; the channel dimension number of the output feature map of the channel normalization module is a fixed value.
Further, in step S3, the learning-based upsampling layer is formed by cascading several upsampling modules; for the feature maps with different resolutions input by the learning-based upsampling layer, the number of cascaded upsampling modules is different, and the resolution of the finally output feature maps is the same; the up-sampling module is formed by connecting a layer of channel expansion layer and a layer of pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature diagram output by the channel expansion layer is 4 times of the channel dimension number of the input feature diagram; the number of channels of the feature map output by the pixel rearrangement layer is 1/4 of the number of channels of the input feature map, and the resolution of the output feature map is 2 times of the resolution of the input feature map.
Further, the formula of the pixel rearrangement layer is as follows:
wherein ,representing the pixel rearrangement layer, L representing the input feature map of the pixel rearrangement layer, x and y representing the abscissa and ordinate of the output feature map, C representing the channel coordinate of the input feature map, r representing the up-sampling magnification,meaning rounding down and mod meaning remainder.
Further, in step S4, the group normalization by channel includes the steps of:
s4.1, let I ═ I (I)N,iC,iH,iW) A feature map indicating the resolution uniformity output in step S3 as 4D tensors indexed in the order of (N, C, H, W); calculating all of the characteristic maps I according to the following formulaMean μ and variance σ of pixels:
∈ denotes errors between adjacent floating point numbers in a computer, S denotes a pixel set formed by grouping feature maps I according to channels, k denotes one pixel in the pixel set S, m denotes the size of the pixel set S, and the pixel set S is defined as:
wherein G represents the number of groups, which is a predefined hyper-parameter, and the value range of G is an integral multiple of 16;representing the number of channels in each group;denotes rounding down, iC、iNRespectively, coordinates of the feature map I on an N, C axis; k is a radical ofN and kCRespectively, the coordinates of the pixel k on the N, C axis;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
wherein ,representing the normalized feature map, sigma and mu being the mean and variance calculated in step S4.1;
s4.3, fitting a linear transformation after normalization to compensate for possible loss of feature expression capacity; the specific transformation formula is as follows:
where O represents the profile of the group normalized output grouped by channel, and γ and β represent the fitted scale and offset parameters, respectively, where the γ parameter is initialized to 1 and β is initialized to 0.
Further, in step S5, the splicing the feature sets to generate the fused features refers to a splicing operation of tensors; and splicing the characteristic graphs along the dimension direction by the splicing operation of the tensor to obtain a fusion characteristic tensor.
Further, in step S6, the downsampling the fused features for multiple times to construct the feature pyramid means that the feature map is subjected to a plurality of downsampling layers connected in series to generate a series of low-resolution feature maps; the feature map pyramid is a set formed by low-resolution feature maps output by a down-sampling layer; the resolution of the output low resolution feature map is 1/2 the resolution of the feature map of the downsampled layer input.
Further, in step S7, the head detection network sequentially inputs the feature map of the feature pyramid output in step S6, and outputs the category and position coordinates of the object; the head detection network comprises a classification full convolution network and a regression full convolution network;
further, the calculation steps of the target classification full convolution network are as follows:
s7.1.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into a plurality of buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.1.2, inputting the characteristic diagram output by the buffer convolution layer into the classification prediction layer; the classified prediction layer consists of a convolutional layer; let x be (x)N,xC,xH,xW) The 4D tensor is indexed by the (N, C, H, W) sequence and represents the classification result output by the classification prediction layer; where N is the batch axis, C is the channel axis, and H and W are each longA degree and width axis; the batch, length and width (N, H, W) of the x are the same as the input feature map; the number of the channels of x is Cls A, Cls is the number of target categories, and A is the number of preset anchors;
further, the target regression full convolution network is calculated by the following steps:
s7.2.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into the buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.2.2, inputting the characteristic diagram output by the buffer convolution layer into the regression prediction layer; the regression prediction layer consists of a convolution layer; let y be (y)N,yC,yH,yW) Representing the regression result output by the regression prediction layer by using 4D tensors indexed in the (N, C, H, W) sequence; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the y are the same as the input feature map; and the number C of the channels of the y is 4A, and A is the number of the preset anchors.
S7.3, the results x and y output by the classification full convolution network and the regression full convolution network are combined to obtain (z, z) as the 4D tensor z indexed in the order of (N, C, H, W)N,zC,zH,zW) (ii) a Where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the size of C is (4+ Cls). times.A, Cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network, and includes the category and the position coordinates of the target.
Compared with the prior art, the invention has the beneficial effects that:
the invention improves the characteristic fusion process and the characteristic up-sampling method, can obviously improve the representation capability of the characteristic diagram, improves the small target detection precision, and only slightly increases the calculation overhead.
Drawings
FIG. 1 is a flow chart of a method for detecting small targets in aerial images based on feature fusion and upsampling;
FIG. 2 is a schematic diagram of a feature fusion network according to an embodiment of the present invention;
FIG. 3 is a diagram of a pixel rearrangement layer according to an embodiment of the present invention.
Detailed Description
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of various embodiments of the disclosure as defined by the claims and their equivalents. It includes various specific details to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the various embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions may be omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to the written meaning, but are used only by the inventors to enable a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following description of the various embodiments of the present disclosure is provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.
Example (b):
a method for detecting small targets in aerial images based on feature fusion and up-sampling is disclosed, as shown in FIG. 1, and comprises the following steps:
s1, extracting a feature set of the input image by using a backbone network;
the main network is a residual convolution network which comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolution ratios of the output characteristic graphs of the residual modules are the same; 2 times of down sampling exists between every two adjacent stages, and the length and the width of the feature map after down sampling are reduced by two times respectively; and the finally extracted feature set is a set consisting of the last feature map of the two to five stages of the backbone network.
S2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
the channel standardization module is realized by a convolution layer; the input of the channel standardization module is a feature map in a feature set output by the backbone network, and the output of the channel standardization module is a feature map with standardized channel dimensions; the resolution of the feature map output by the channel normalization module is the same as that of the input feature map; the channel dimension number of the output feature map of the channel normalization module is a fixed value.
In this embodiment, the convolution kernel size of the convolution layer in the channel normalization module is 1, the padding is 1, and the step length is 1; the channel dimension number of the feature map output by the channel normalization module is a fixed value 256.
S3, constructing an up-sampling layer based on learning, and performing resolution up-sampling on the normalized features to obtain a feature set with uniform resolution;
the learning-based up-sampling layer is formed by cascading a plurality of up-sampling modules; for the feature maps with different resolutions input by the learning-based upsampling layer, the number of cascaded upsampling modules is different, and the resolution of the finally output feature maps is the same; the up-sampling module is formed by connecting a layer of channel expansion layer and a layer of pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature diagram output by the channel expansion layer is 4 times of the channel dimension number of the input feature diagram; the number of channels of the feature map output by the pixel rearrangement layer is 1/4 of the number of channels of the input feature map, and the resolution of the output feature map is 2 times of the resolution of the input feature map.
In this embodiment, the channel expansion layer is implemented by a convolution layer, the convolution kernel size is 1, the padding is 1, the step length is 1, and the channel dimension number of the output feature map is 1024; the channel dimension number of the output characteristic map of the pixel rearrangement layer is 256;
as shown in fig. 3, the formula of the pixel rearrangement layer is as follows:
wherein ,representing the pixel rearrangement layer, L representing the input feature map of the pixel rearrangement layer, x and y representing the abscissa and ordinate of the output feature map, C representing the channel coordinate of the input feature map, r representing the up-sampling magnification,meaning rounding down and mod meaning remainder.
S4, carrying out group normalization of grouping the characteristics with uniform resolution according to channels;
the group normalization grouped by channel includes the steps of:
s4.1, let I ═ I (I)N,iC,iH,iW) A feature map indicating the resolution uniformity output in step S3 as 4D tensors indexed in the order of (N, C, H, W); where N is the batch axis, C is the channel axis, and H and W are the feature map length and width axes, respectively; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following equations:
∈ denotes errors between adjacent floating point numbers in a computer, wherein the size of ∈ in Python language is 2.220446049250313e-16, S denotes a pixel set formed by grouping feature maps I according to channels, k denotes one pixel in the pixel set S, m denotes the size of the pixel set S, and the pixel set S is defined as:
wherein G represents the number of groups and is a predefined hyper-parameter, the value range of G is an integral multiple of 16, and the value of G is 32 under the default condition;representing the number of channels in each group;denotes rounding down, iC、iNRespectively, coordinates of the feature map I on an N, C axis; k is a radical ofN and kCRespectively, the coordinates of the pixel k on the N, C axis;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
wherein ,representing the normalized feature map, sigma and mu being the mean and variance calculated in step S4.1;
s4.3, fitting a linear transformation after normalization to compensate for possible loss of feature expression capacity; the specific transformation formula is as follows:
where y represents the profile of the group normalized output grouped by channel, and γ and β represent the fitted scale and offset parameters, respectively, where the γ parameter is initialized to 1 and β is initialized to 0.
S5, as shown in FIG. 2, splicing the feature set after group normalization to generate fusion features;
the splicing of the feature sets to generate the fused features refers to the splicing operation of tensors; and splicing the characteristic graphs along the dimension direction by the splicing operation of the tensor to obtain a fusion characteristic tensor.
S6, downsampling the fusion features for multiple times, and constructing a feature pyramid for detection;
the step of carrying out multiple downsampling on the fusion features to construct a feature pyramid refers to that a feature graph is subjected to a plurality of downsampling layers connected in series to generate a series of low-resolution feature graphs; the feature map pyramid is a set formed by low-resolution feature maps output by a down-sampling layer; the resolution of the output low resolution feature map is 1/2 of the resolution of the feature map of the downsampled layer input;
in this embodiment, the downsampling layer is implemented by a convolution layer; the convolution kernel size of the downsampling layer is 3, the padding is 1, and the step length is 2.
S7, classifying and positioning the target by using the head detection network, and finally outputting the detection result;
the head detection network sequentially inputs the feature map of the feature pyramid output in the step S6, and outputs the category and the position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;
the calculation steps of the target classification full convolution network are as follows:
s7.1.1, in this embodiment, the feature map of the feature pyramid output in step S6 is input into the series of 4 buffer convolution layers; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram; the convolution kernel size of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the number of output channels is 256;
s7.1.2, inputting the characteristic diagram output by the buffer convolution layer into the classification prediction layer; the classified prediction layer consists of a convolutional layer; let x be (x)N,xC,xH,xW) The 4D tensor is indexed by the (N, C, H, W) sequence and represents the classification result output by the classification prediction layer; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the x are the same as the input feature map; in this embodiment, the size of the convolution kernel of the classified prediction layer is 3, the padding is 1, the step length is 1, the number of output channels is Cls × a, Cls is the number of target classes, and a is the number of preset anchors;
the calculation steps of the target regression full convolution network are as follows:
s7.2.1, in this embodiment, the feature map of the feature pyramid output in step S6 is input into the series of 4 buffer convolution layers; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram; the convolution kernel size of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the number of output channels is 256;
s7.2.2, inputting the characteristic diagram output by the buffer convolution layer into the regression prediction layer; the regression prediction layer consists of a convolution layer; let y be (y)N,yC,yH,yW) Representing the regression result output by the regression prediction layer by using 4D tensors indexed in the (N, C, H, W) sequence; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the y are the same as the input feature map; in this embodiment, the convolution kernel size of the regression prediction layer is 3, the padding is 1, the step size is 1, the number of output channels is 4 × a, and a is the number of preset anchors.
S7.3, the results x and y output by the classification full convolution network and the regression full convolution network are combined to obtain (z, z) as the 4D tensor z indexed in the order of (N, C, H, W)N,zC,zH,zW) (ii) a Where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the size of C is (4+ Cls). times.A, Cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network, and includes the category and the position coordinates of the target.
The above examples of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (9)
1. An aerial image small target detection method based on feature fusion and up-sampling is characterized by comprising the following steps:
s1, extracting a feature set of the input image by using a backbone network;
s2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;
s3, constructing an up-sampling layer based on learning, and performing resolution up-sampling on the normalized features to obtain a feature set with uniform resolution;
s4, carrying out group normalization of grouping the characteristics with uniform resolution according to channels;
s5, splicing the feature sets after group normalization to generate fusion features;
s6, downsampling the fusion features for multiple times, and constructing a feature pyramid for detection;
and S7, using the head to detect the network classification and positioning the target, and finally outputting the detection result.
2. The method for detecting the small target of the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in the step S1, the main network is a residual convolution network, the residual convolution network comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolution of the feature graph output by each residual module is the same; 2 times of down sampling exists between every two adjacent stages, and the length and the width of the feature map after down sampling are reduced by two times respectively; and the finally extracted feature set is a set consisting of the last feature map of the two to five stages of the backbone network.
3. The method for detecting the small target in the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in step S2, the channel normalization module is implemented by a convolutional layer; the input of the channel standardization module is a feature map in a feature set output by the backbone network, and the output of the channel standardization module is a feature map with standardized channel dimensions; the resolution of the feature map output by the channel normalization module is the same as that of the input feature map; the channel dimension number of the output feature map of the channel normalization module is a fixed value.
4. The method for detecting the small target in the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in step S3, the learning-based up-sampling layer is formed by cascading a plurality of up-sampling modules; for the feature maps with different resolutions input by the learning-based upsampling layer, the number of cascaded upsampling modules is different, and the resolution of the finally output feature maps is the same; the up-sampling module is formed by connecting a layer of channel expansion layer and a layer of pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature diagram output by the channel expansion layer is 4 times of the channel dimension number of the input feature diagram; the number of channels of the feature map output by the pixel rearrangement layer is 1/4 of the number of channels of the input feature map, and the resolution of the output feature map is 2 times of the resolution of the input feature map.
5. The method for detecting the small target of the aerial image based on the feature fusion and the up-sampling according to claim 4, wherein the formula of the pixel rearrangement layer is as follows:
wherein ,representing the pixel rearrangement layer, L representing the input feature map of the pixel rearrangement layer, x and y representing the abscissa and ordinate of the output feature map, C representing the channel coordinate of the input feature map, r representing the up-sampling magnification,meaning rounding down and mod meaning remainder.
6. The method for detecting small targets in aerial images based on feature fusion and upsampling as claimed in claim 1, wherein in step S4, the group normalization by channel comprises the following steps:
s4.1, let I ═ I (I)N,iC,iH,iW) A feature map indicating the resolution uniformity output in step S3 as 4D tensors indexed in the order of (N, C, H, W); where N is the batch axis, C is the channel axis, and H and W are the feature map length and width axes, respectively; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following equations:
∈ denotes errors between adjacent floating point numbers in a computer, S denotes a pixel set formed by grouping feature maps I according to channels, k denotes one pixel in the pixel set S, m denotes the size of the pixel set S, and the pixel set S is defined as:
wherein G represents the number of groups, which is a predefined hyper-parameter, and the value range of G is an integral multiple of 16;representing the number of channels in each group;denotes rounding down, iC、iNRespectively, coordinates of the feature map I on an N, C axis; k is a radical ofN and kCRespectively, the coordinates of the pixel k on the N, C axis;
s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:
wherein ,representing the normalized feature map, sigma and mu being the mean and variance calculated in step S4.1;
s4.3, fitting a linear transformation after normalization to compensate for possible loss of feature expression capacity; the specific transformation formula is as follows:
where O represents the profile of the group normalized output grouped by channel, and γ and β represent the fitted scale and offset parameters, respectively, where the γ parameter is initialized to 1 and β is initialized to 0.
7. The method for detecting the small target of the aerial image based on the feature fusion and the upsampling as recited in claim 1, wherein in the step S5, the feature set is spliced to generate the fusion feature, which is a splicing operation of tensor; and splicing the characteristic graphs along the dimension direction by the splicing operation of the tensor to obtain a fusion characteristic tensor.
8. The method for detecting the small target of the aerial image based on the feature fusion and the up-sampling as claimed in claim 1, wherein in the step S6, the multiple down-sampling of the fusion feature to construct the feature pyramid means that the feature graph is processed through a plurality of down-sampling layers connected in series to generate a series of feature graphs with low resolution; the feature map pyramid is a set formed by low-resolution feature maps output by a down-sampling layer; the resolution of the output low resolution feature map is 1/2 the resolution of the feature map of the downsampled layer input.
9. The method for detecting small targets in aerial images based on feature fusion and up-sampling as claimed in claim 1, wherein in step S7, the head detection network sequentially inputs the feature map of the feature pyramid output in step S6 and outputs the category and position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;
the calculation steps of the target classification full convolution network are as follows:
s7.1.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into a plurality of buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.1.2, inputting the characteristic diagram output by the buffer convolution layer into the classification prediction layer; the classified prediction layer consists of a convolutional layer; let x be (x)N,xC,xH,xW) The 4D tensor is indexed by the (N, C, H, W) sequence and represents the classification result output by the classification prediction layer; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the x are the same as the input feature map; the number of the channels of x is Cls A, Cls is the number of target categories, and A is the number of preset anchors;
the calculation steps of the target regression full convolution network are as follows:
s7.2.1, inputting the characteristic diagram of the characteristic pyramid output in the step S6 into the buffer convolution layers connected in series; the resolution and the channel dimension number of the characteristic diagram output by the buffer convolution layer are the same as those of the input characteristic diagram;
s7.2.2, inputting the characteristic diagram output by the buffer convolution layer into the regression prediction layer; the regression prediction layer consists of a convolution layer; let y be (y)N,yC,yH,yW) Representing the regression result output by the regression prediction layer by using 4D tensors indexed in the (N, C, H, W) sequence; where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the batch, length and width (N, H, W) of the y are the same as the input feature map; the number C of the channels of the y is 4A, and A is the number of the preset anchors;
s7.3, the results x and y output by the classification full convolution network and the regression full convolution network are combined to obtain (z, z) as the 4D tensor z indexed in the order of (N, C, H, W)N,zC,zH,zW) (ii) a Where N is the batch axis, C is the channel axis, and H and W are the length and width axes, respectively; the size of C is (4+ Cls). times.A, Cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is the target detection result output by the network, and includes the category and the position coordinates of the target.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010247656.9A CN111461217B (en) | 2020-03-31 | 2020-03-31 | Aerial image small target detection method based on feature fusion and up-sampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010247656.9A CN111461217B (en) | 2020-03-31 | 2020-03-31 | Aerial image small target detection method based on feature fusion and up-sampling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111461217A true CN111461217A (en) | 2020-07-28 |
CN111461217B CN111461217B (en) | 2023-05-23 |
Family
ID=71682431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010247656.9A Active CN111461217B (en) | 2020-03-31 | 2020-03-31 | Aerial image small target detection method based on feature fusion and up-sampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111461217B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112070658A (en) * | 2020-08-25 | 2020-12-11 | 西安理工大学 | Chinese character font style migration method based on deep learning |
CN112580721A (en) * | 2020-12-19 | 2021-03-30 | 北京联合大学 | Target key point detection method based on multi-resolution feature fusion |
CN112633156A (en) * | 2020-12-22 | 2021-04-09 | 浙江大华技术股份有限公司 | Vehicle detection method, image processing apparatus, and computer-readable storage medium |
CN112990317A (en) * | 2021-03-18 | 2021-06-18 | 中国科学院长春光学精密机械与物理研究所 | Weak and small target detection method |
CN113111877A (en) * | 2021-04-28 | 2021-07-13 | 奇瑞汽车股份有限公司 | Characteristic pyramid and characteristic image extraction method thereof |
CN113312995A (en) * | 2021-05-18 | 2021-08-27 | 华南理工大学 | Anchor-free vehicle-mounted pedestrian detection method based on central axis |
CN114120077A (en) * | 2022-01-27 | 2022-03-01 | 山东融瓴科技集团有限公司 | Prevention and control risk early warning method based on big data of unmanned aerial vehicle aerial photography |
CN111967538B (en) * | 2020-09-25 | 2024-03-15 | 北京康夫子健康技术有限公司 | Feature fusion method, device and equipment applied to small target detection and storage medium |
CN117893990A (en) * | 2024-03-18 | 2024-04-16 | 中国第一汽车股份有限公司 | Road sign detection method, device and computer equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344821A (en) * | 2018-08-30 | 2019-02-15 | 西安电子科技大学 | Small target detecting method based on Fusion Features and deep learning |
CN109598290A (en) * | 2018-11-22 | 2019-04-09 | 上海交通大学 | A kind of image small target detecting method combined based on hierarchical detection |
CN110097129A (en) * | 2019-05-05 | 2019-08-06 | 西安电子科技大学 | Remote sensing target detection method based on profile wave grouping feature pyramid convolution |
CN110929649A (en) * | 2019-11-24 | 2020-03-27 | 华南理工大学 | Network and difficult sample mining method for small target detection |
-
2020
- 2020-03-31 CN CN202010247656.9A patent/CN111461217B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344821A (en) * | 2018-08-30 | 2019-02-15 | 西安电子科技大学 | Small target detecting method based on Fusion Features and deep learning |
CN109598290A (en) * | 2018-11-22 | 2019-04-09 | 上海交通大学 | A kind of image small target detecting method combined based on hierarchical detection |
CN110097129A (en) * | 2019-05-05 | 2019-08-06 | 西安电子科技大学 | Remote sensing target detection method based on profile wave grouping feature pyramid convolution |
CN110929649A (en) * | 2019-11-24 | 2020-03-27 | 华南理工大学 | Network and difficult sample mining method for small target detection |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112070658A (en) * | 2020-08-25 | 2020-12-11 | 西安理工大学 | Chinese character font style migration method based on deep learning |
CN112070658B (en) * | 2020-08-25 | 2024-04-16 | 西安理工大学 | Deep learning-based Chinese character font style migration method |
CN111967538B (en) * | 2020-09-25 | 2024-03-15 | 北京康夫子健康技术有限公司 | Feature fusion method, device and equipment applied to small target detection and storage medium |
CN112580721B (en) * | 2020-12-19 | 2023-10-24 | 北京联合大学 | Target key point detection method based on multi-resolution feature fusion |
CN112580721A (en) * | 2020-12-19 | 2021-03-30 | 北京联合大学 | Target key point detection method based on multi-resolution feature fusion |
CN112633156A (en) * | 2020-12-22 | 2021-04-09 | 浙江大华技术股份有限公司 | Vehicle detection method, image processing apparatus, and computer-readable storage medium |
CN112633156B (en) * | 2020-12-22 | 2024-05-31 | 浙江大华技术股份有限公司 | Vehicle detection method, image processing device, and computer-readable storage medium |
CN112990317A (en) * | 2021-03-18 | 2021-06-18 | 中国科学院长春光学精密机械与物理研究所 | Weak and small target detection method |
CN113111877A (en) * | 2021-04-28 | 2021-07-13 | 奇瑞汽车股份有限公司 | Characteristic pyramid and characteristic image extraction method thereof |
CN113312995B (en) * | 2021-05-18 | 2023-02-14 | 华南理工大学 | Anchor-free vehicle-mounted pedestrian detection method based on central axis |
CN113312995A (en) * | 2021-05-18 | 2021-08-27 | 华南理工大学 | Anchor-free vehicle-mounted pedestrian detection method based on central axis |
CN114120077A (en) * | 2022-01-27 | 2022-03-01 | 山东融瓴科技集团有限公司 | Prevention and control risk early warning method based on big data of unmanned aerial vehicle aerial photography |
CN117893990A (en) * | 2024-03-18 | 2024-04-16 | 中国第一汽车股份有限公司 | Road sign detection method, device and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111461217B (en) | 2023-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111461217A (en) | Aerial image small target detection method based on feature fusion and up-sampling | |
CN113052210B (en) | Rapid low-light target detection method based on convolutional neural network | |
CN110909642A (en) | Remote sensing image target detection method based on multi-scale semantic feature fusion | |
CN108960261B (en) | Salient object detection method based on attention mechanism | |
CN111524135A (en) | Image enhancement-based method and system for detecting defects of small hardware fittings of power transmission line | |
CN108416292B (en) | Unmanned aerial vehicle aerial image road extraction method based on deep learning | |
CN113723377B (en) | Traffic sign detection method based on LD-SSD network | |
CN114359851A (en) | Unmanned target detection method, device, equipment and medium | |
CN111860683B (en) | Target detection method based on feature fusion | |
CN112183578B (en) | Target detection method, medium and system | |
CN115035295B (en) | Remote sensing image semantic segmentation method based on shared convolution kernel and boundary loss function | |
CN111582074A (en) | Monitoring video leaf occlusion detection method based on scene depth information perception | |
CN111898608B (en) | Natural scene multi-language character detection method based on boundary prediction | |
CN111767919B (en) | Multilayer bidirectional feature extraction and fusion target detection method | |
CN113313118A (en) | Self-adaptive variable-proportion target detection method based on multi-scale feature fusion | |
CN116563553B (en) | Unmanned aerial vehicle image segmentation method and system based on deep learning | |
KR102239133B1 (en) | Apparatus and method of defect classification using image transformation based on machine-learning | |
CN112418229A (en) | Unmanned ship marine scene image real-time segmentation method based on deep learning | |
CN112464982A (en) | Target detection model, method and application based on improved SSD algorithm | |
CN115100409B (en) | Video portrait segmentation algorithm based on twin network | |
WO2020093210A1 (en) | Scene segmentation method and system based on contenxtual information guidance | |
CN115909081A (en) | Optical remote sensing image ground object classification method based on edge-guided multi-scale feature fusion | |
CN115860139A (en) | Deep learning-based multi-scale ship target detection method | |
CN111898671B (en) | Target identification method and system based on fusion of laser imager and color camera codes | |
CN114565764A (en) | Port panorama sensing system based on ship instance segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |