CN111461217B

CN111461217B - Aerial image small target detection method based on feature fusion and up-sampling

Info

Publication number: CN111461217B
Application number: CN202010247656.9A
Authority: CN
Inventors: 林沪; 刘琼
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2023-05-23
Anticipated expiration: 2040-03-31
Also published as: CN111461217A

Abstract

The invention discloses an aerial image small target detection method based on feature fusion and up-sampling. The method comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upsampling layer based on learning, and upsampling the features in resolution to obtain a feature set with uniform resolution; group normalization is carried out on the characteristics according to channel grouping; splicing the feature sets to generate fusion features; downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection; network classification and localization targets are detected using the header. The invention relates to a feature fusion and feature up-sampling method for a target detection training and testing stage, which can obviously improve the detection precision of small targets in aerial images and only slightly increase the calculation cost.

Description

Aerial image small target detection method based on feature fusion and up-sampling

Technical Field

The invention relates to the field of aerial image target detection, in particular to an aerial image small target detection method based on feature fusion and up-sampling.

Background

Compared with a monitoring camera with a fixed position and a visual field, the camera on the unmanned aerial vehicle has natural advantages such as convenient deployment, strong maneuverability and wide visual field. These advantages are expected to serve many applications such as security monitoring, search and rescue, and people stream monitoring. In many unmanned aerial vehicle applications, target detection in aerial images is a critical component, and is critical to the development of fully autonomous systems, and thus an urgent need in the industry.

Although convolutional neural networks have achieved significant effects in the field of general target detection, their performance in unmanned aerial vehicle aerial photography scenarios is not satisfactory. The main reason is that the relative scale and absolute resolution of the target are smaller in the image of the unmanned aerial vehicle in the aerial scene than in the ordinary scene. Therefore, the resolution of the corresponding feature response area in the extracted convolution feature map is smaller, and the higher omission ratio is caused. More specifically, the feature images extracted by the convolutional neural network are often reduced by 1/4 or 1/8 relative to the length and width of the input image, so that the characterization capability of the feature images on small-scale targets is further weakened. Therefore, how to increase the feature expression of small scale targets becomes a key point of the system design.

Most of the existing convolutional neural network methods adopt FPN feature fusion networks to improve feature expression of small-scale targets. The specific flow is as follows: extracting a feature set of an input image by using a backbone network; upsampling the high-layer low-resolution feature map by bilinear interpolation, and fusing the bilinear interpolation with the adjacent low-layer feature map in sequence; and detecting by using the fused feature set. However, the existing FPN feature fusion network cannot sufficiently fuse the information of the feature maps with different resolutions, and bilinear interpolation is not an efficient up-sampling method. These two defects result in FPN having limited effectiveness in the detection of small-sized targets.

In summary, the key to improve small target detection under aerial viewing angles is to improve feature fusion strategies and up-sampling methods. The invention provides an aerial image small target detection method based on feature fusion and up-sampling, which comprises the following steps: extracting a feature set of an input image by using a backbone network; constructing a channel standardization module, and standardizing the channel dimension of the features; constructing an upsampling layer based on learning, and upsampling the features in resolution to obtain a feature set with uniform resolution; group normalization is carried out on the characteristics according to channel grouping; splicing the feature sets to generate fusion features; downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection; and using the head to detect the network classification and positioning targets, and finally outputting the detection result.

The present invention relates to the following prior art documents:

existing document 1: he imaging, et al, "Deep residual learning for image recognment," Proceedings of the IEEE conference on computer vision and pattern recognment.2016.

Existing document 2: wu Y, he K.group normalization [ C ]// Proceedings of the European Conference on Computer Vision (ECCV). 2018:3-19.

Existing document 3: lin T Y, goyal P, girshick R, et al Focal loss for dense object detection [ C ]// Proceedings of the IEEE international conference on computer vision.2017:2980-2988.

The prior document 1 proposes a feature extraction network, the main component unit of which is a residual module based on residual connection, so that training difficulty of a deep network can be reduced, and features with deeper depth and stronger characterization capability can be learned. The prior document 2 proposes a feature normalization method, which improves the problems that the effect is poor when the original batch normalization is performed in a small batch during network training, and the optimal solution is difficult to converge. Existing document 3 trains a high performance one-stage dense object detector based on FPN network and Focal Loss function. The present invention extracts a feature set of an input picture using the existing document 1; normalizing the feature map by using group normalization grouped by channel dimension in the prior document 2; the feature fusion network is improved on the basis of the existing document 3, and the network is trained using the loss function of the existing document 3.

Disclosure of Invention

The invention aims to improve the detection precision of small targets of aerial images, thereby better completing the tasks of security monitoring, search and rescue, people flow monitoring and the like based on unmanned aerial vehicle target detection. In order to achieve the above purpose, according to the present invention, an aerial image small target detection method based on feature fusion and upsampling is provided, and a channel standardization module and an upsampling layer are configured to perform channel standardization and upsampling on features; then, the features are spliced into fusion features after group normalization; downsampling for a plurality of times on the basis of the fusion of the features to generate a feature pyramid; and classifying and positioning targets by using the head network, and outputting detection results.

The object of the invention is achieved by at least one of the following technical solutions.

An aerial image small target detection method based on feature fusion and up-sampling comprises the following steps:

s1, extracting a feature set of an input image by using a backbone network;

s2, constructing a channel standardization module, and standardizing the channel dimension of the features extracted in the step S1;

s3, constructing an up-sampling layer based on learning, and carrying out resolution up-sampling on the standardized features to obtain a feature set with uniform resolution;

s4, carrying out group normalization on the features with uniform resolutions according to channel grouping;

s5, splicing the feature sets after the group normalization to generate fusion features;

s6, downsampling the fusion features for a plurality of times, and constructing a feature pyramid for detection;

s7, detecting network classification and positioning targets by using the head, and finally outputting detection results.

Further, in step S1, the backbone network is a residual convolution network, where the residual convolution network includes five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolutions of the output feature graphs of the residual modules are the same; 2 times of downsampling exists between every two adjacent stages, and the length and width of the feature map after downsampling are reduced by two times; the final extracted feature set is a set formed by the last feature map of the second to fifth stages of the backbone network.

Further, in step S2, the channel normalization module is implemented by a convolution layer; the input of the channel standardization module is a feature diagram in a feature set output by the backbone network, and the output of the channel standardization module is a feature diagram of channel dimension standardization; the resolution of the feature map output by the channel normalization module is the same as the resolution of the input feature map; the channel dimension number of the output characteristic diagram of the channel normalization module is a fixed value.

Further, in step S3, the learning-based upsampling layer is formed by cascading a plurality of upsampling modules; the up-sampling layer based on learning has different numbers of up-sampling modules for the feature graphs with different resolutions, and the resolution of the feature graphs finally output is the same; the up-sampling module is formed by connecting a channel expansion layer and a pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature map output by the channel expansion layer is 4 times of the channel dimension number of the input feature map; the channel number of the feature image output by the pixel rearrangement layer is 1/4 of the channel number of the input feature image, and the resolution of the output feature image is 2 times of the resolution of the input feature image.

Further, the formula of the pixel rearrangement layer is as follows:

wherein ,

representing a pixel rearrangement layer, L representing an input feature map of the pixel rearrangement layer, x and y representing the abscissa of the output feature map, C representing the channel coordinates of the input feature map, r representing the upsampling magnification,/'>

Representing a downward rounding, mod represents a remainder.

Further, in step S4, the group normalization of the per-channel grouping includes the following steps:

s4.1, assume i= (I) _N ，i _C ，i _H ，i _W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the feature graphs with uniform resolution output in the step S3; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following formula:

wherein, E refers to the error between the distinguishing floating point numbers in the computer; s represents a pixel set of the feature map I after being grouped according to channels; k represents one pixel in the pixel set S; m represents the size of the pixel set S; the set of pixels S is defined as:

wherein G represents the number of packets, which is a predefined hyper-parameter, and the value range of G is an integer multiple of 16;

representing the number of channels per group; />

Representing a downward rounding, i _C 、i _N Coordinates of the feature map I on the N, C axis, respectively; k (k) _N and k_C Representing the coordinates of pixel k on the N, C axis, respectively;

s4.2, normalizing the characteristic diagram I, wherein the calculation formula is as follows:

wherein ,

representing the normalized feature map, wherein sigma and mu are the mean and variance calculated in the step S4.1;

s4.3, fitting a linear transformation after normalization to compensate possible loss of characteristic expression capacity; the specific transformation formula is as follows:

wherein O represents a feature map of group normalized output grouped by channel; gamma and beta represent the scaling and offset parameters of the fit, respectively; where the gamma parameter is initialized to 1 and the beta is initialized to 0.

Further, in step S5, the feature set is spliced, and the generation of the fusion feature refers to a tensor splicing operation; and the tensor splicing operation splices the feature graphs along the dimension direction to obtain a fusion feature tensor.

Further, in step S6, the step of performing downsampling on the fused features for multiple times to construct a feature pyramid refers to generating a series of low-resolution feature graphs from the feature graphs through a plurality of downsampling layers connected in series; the feature map pyramid refers to a set formed by low-resolution feature maps output by a downsampling layer; the resolution of the output low-resolution feature map is 1/2 of the resolution of the feature map input by the downsampling layer.

Further, in step S7, the head detection network sequentially inputs the feature graphs of the feature pyramid output in step S6, and outputs the category and the position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;

further, the calculation steps of the target classification full convolution network are as follows:

s7.1.1, inputting the feature map of the feature pyramid output in the step S6 into a plurality of serially connected buffer convolution layers; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map;

s7.1.2, inputting the feature map output by the buffer convolution layer into a classification prediction layer; the classification prediction layer consists of a layer of convolution layer; let x= (x) _N ，x _C ，x _H ，x _W ) 4D tensors indexed in the order of (N, C, H, W) represent classification results output by the classification prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of x are the same as the input profile; by a means ofThe channel number of x is Cls A, cls is the number of target categories, and A is the number of preset anchors;

further, the calculation steps of the target regression full convolution network are as follows:

s7.2.1, inputting the feature map of the feature pyramid output in the step S6 into a series-connected buffer convolution layer; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map;

s7.2.2, inputting the feature map output by the buffer convolution layer into a regression prediction layer; the regression prediction layer consists of a layer of convolution layer; let y= (y) _N ，y _C ，y _H ，y _W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the regression result output by the regression prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of y are the same as the input profile; and the channel number C of y is 4 x A, and A is the number of preset anchors.

S7.3, combining the results x and y output by the classification full convolution network and the regression full convolution network to obtain a 4D tensor z= (z) indexed in the (N, C, H, W) sequence _N ，z _C ，z _H ，z _W ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the size of C is (4+Cls) A, cls is the number of target categories, and A is the number of preset anchors; the 4D tensor z is a target detection result output by the network and comprises the category and the position coordinates of the target.

Compared with the prior art, the invention has the beneficial effects that:

the invention improves the feature fusion flow and the feature up-sampling method, can obviously improve the characterization capability of the feature map, improves the small target detection precision, and only slightly increases the calculation cost.

Drawings

FIG. 1 is a flow chart of a method for detecting small targets in aerial images based on feature fusion and upsampling;

FIG. 2 is a schematic diagram of a network structure for feature fusion in an embodiment of the present invention;

fig. 3 is a schematic diagram of a pixel rearrangement layer according to an embodiment of the present invention.

Detailed Description

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the various embodiments of the disclosure defined by the claims and their equivalents. It includes various specific details to aid understanding, but these are to be considered merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the various embodiments of the present invention described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions may be omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to written meanings, but are used only by the inventors to enable a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.

Examples:

an aerial image small target detection method based on feature fusion and up-sampling, as shown in fig. 1, comprises the following steps:

s1, extracting a feature set of an input image by using a backbone network;

the main network is a residual convolution network, and the residual convolution network comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolutions of the output feature graphs of the residual modules are the same; 2 times of downsampling exists between every two adjacent stages, and the length and width of the feature map after downsampling are reduced by two times; the final extracted feature set is a set formed by the last feature map of the second to fifth stages of the backbone network.

the channel standardization module is realized by a convolution layer; the input of the channel standardization module is a feature diagram in a feature set output by the backbone network, and the output of the channel standardization module is a feature diagram of channel dimension standardization; the resolution of the feature map output by the channel normalization module is the same as the resolution of the input feature map; the channel dimension number of the output characteristic diagram of the channel normalization module is a fixed value.

In this embodiment, the convolution kernel size of the convolution layer in the channel normalization module is 1, the filling is 1, and the step length is 1; the number of channel dimensions of the feature map output by the channel normalization module is a fixed value of 256.

the up-sampling layer based on learning is formed by cascading a plurality of up-sampling modules; the up-sampling layer based on learning has different numbers of up-sampling modules for the feature graphs with different resolutions, and the resolution of the feature graphs finally output is the same; the up-sampling module is formed by connecting a channel expansion layer and a pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature map output by the channel expansion layer is 4 times of the channel dimension number of the input feature map; the channel number of the feature image output by the pixel rearrangement layer is 1/4 of the channel number of the input feature image, and the resolution of the output feature image is 2 times of the resolution of the input feature image.

In this embodiment, the channel expansion layer is implemented by a layer of convolution layer, where the convolution kernel size is 1, the filling is 1, the step size is 1, and the channel dimension number of the output feature map is 1024; the channel dimension number of the output feature map of the pixel rearrangement layer is 256;

as shown in fig. 3, the formula of the pixel rearrangement layer is as follows:

wherein ,

representation ofA pixel rearrangement layer, L representing an input feature map of the pixel rearrangement layer, x and y representing the abscissa of the output feature map, C representing the channel coordinates of the input feature map, r representing the upsampling magnification,/and->

Representing a downward rounding, mod represents a remainder.

the group normalization by channel grouping comprises the steps of:

s4.1, assume i= (I) _N ，i _C ，i _H ，i _W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the feature graphs with uniform resolution output in the step S3; wherein N is a batch axis, C is a channel axis, and H and W are characteristic diagram length and width axes respectively; the mean μ and variance σ of all pixels of the feature map I are calculated according to the following formula:

wherein, E refers to the error between the adjacent floating point numbers in the computer, and the E is 2.220446049250313e-16 in Python language; s represents a pixel set of the feature map I after being grouped according to channels; k represents one pixel in the pixel set S; m represents the size of the pixel set S; the set of pixels S is defined as:

wherein G represents the number of packets, which is a predefined hyper-parameter, and the value range of G is an integer multiple of 16, and the value of G is 32 in default;

representing the number of channels per group; />

wherein ,

wherein y represents a feature map of group normalized output grouped by channel; gamma and beta represent the scaling and offset parameters of the fit, respectively; where the gamma parameter is initialized to 1 and the beta is initialized to 0.

S5, as shown in FIG. 2, the feature sets after group normalization are spliced to generate fusion features;

the feature set is spliced, and the generation of fusion features refers to the splicing operation of tensors; and the tensor splicing operation splices the feature graphs along the dimension direction to obtain a fusion feature tensor.

the feature pyramid is constructed by carrying out multiple downsampling on the fusion features, namely a series of low-resolution feature graphs are generated by the feature graphs through a plurality of downsampling layers which are connected in series; the feature map pyramid refers to a set formed by low-resolution feature maps output by a downsampling layer; the resolution of the output low-resolution feature map is 1/2 of the resolution of the feature map input by the downsampling layer;

in this embodiment, the downsampling layer is implemented by a convolution layer; the convolution kernel of the downsampling layer is 3 in size, the filling is 1, and the step length is 2.

S7, detecting network classification and positioning targets by using the head, and finally outputting detection results;

the head detection network sequentially inputs the feature graphs of the feature pyramid output in the step S6 and outputs the category and position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;

the calculation steps of the target classification full convolution network are as follows:

s7.1.1 in this embodiment, the feature map of the feature pyramid output in step S6 is input into 4 serially connected buffer convolution layers; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map; the convolution kernel of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the output channel number is 256;

s7.1.2, inputting the feature map output by the buffer convolution layer into a classification prediction layer; the classification prediction layer consists of a layer of convolution layer; let x= (x) _N ，x _C ，x _H ，x _W ) 4D tensors indexed in the order of (N, C, H, W) represent classification results output by the classification prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of x are the same as the input profile; in this embodiment, the convolution kernel size of the classification prediction layer is 3, the filling is 1, the step length is 1, the number of output channels is cls×a, cls is the number of target classes, and a is the number of preset anchors;

the calculation steps of the target regression full convolution network are as follows:

s7.2.1 in this embodiment, the feature map of the feature pyramid output in step S6 is input into 4 serially connected buffer convolution layers; the resolution and the channel dimension number of the feature map output by the buffer convolution layer are the same as those of the input feature map; the convolution kernel of the buffer convolution layer is 3, the filling is 1, the step length is 1, and the output channel number is 256;

s7.2.2, inputting the feature map output by the buffer convolution layer into a regression prediction layer; the regression prediction layer consists of a layer of convolution layer; let y= (y) _N ，y _C ，y _H ，y _W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the regression result output by the regression prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of y are the same as the input profile; in this embodiment, the convolution kernel of the regression prediction layer is 3, the padding is 1, the step size is 1, the number of output channels is 4×a, and a is the number of preset anchors.

The above examples of the present invention are only examples for clearly illustrating the present invention, and are not limiting of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The aerial image small target detection method based on feature fusion and up-sampling is characterized by comprising the following steps of:

s1, extracting a feature set of an input image by using a backbone network;

s4, carrying out group normalization on the features with uniform resolutions according to channel grouping; the group normalization by channel grouping comprises the steps of:

representing the number of channels per group; />

wherein ,

wherein O represents a feature map of group normalized output grouped by channel; gamma and beta represent the scaling and offset parameters of the fit, respectively; wherein the gamma parameter is initialized to 1 and the beta parameter is initialized to 0;

2. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S1, the main network is a residual convolution network, the residual convolution network comprises five stages, each stage is formed by connecting a plurality of similar residual modules in series, and the resolutions of the output feature images of the residual modules are the same; 2 times of downsampling exists between every two adjacent stages, and the length and width of the feature map after downsampling are reduced by two times; the final extracted feature set is a set formed by the last feature map of the second to fifth stages of the backbone network.

3. The aerial image small target detection method based on feature fusion and upsampling according to claim 1, wherein in step S2, the channel normalization module is implemented by a convolution layer; the input of the channel standardization module is a feature diagram in a feature set output by the backbone network, and the output of the channel standardization module is a feature diagram of channel dimension standardization; the resolution of the feature map output by the channel normalization module is the same as the resolution of the input feature map; the channel dimension number of the output characteristic diagram of the channel normalization module is a fixed value.

4. The aerial image small target detection method based on feature fusion and upsampling according to claim 1, wherein in step S3, the upsampling layer based on learning is formed by cascading a plurality of upsampling modules; the up-sampling layer based on learning has different numbers of up-sampling modules for the feature graphs with different resolutions, and the resolution of the feature graphs finally output is the same; the up-sampling module is formed by connecting a channel expansion layer and a pixel rearrangement layer in series; the resolution of the up-sampling feature map output by the up-sampling module is 2 times of that of the input feature map; the channel dimension number of the feature map output by the channel expansion layer is 4 times of the channel dimension number of the input feature map; the channel number of the feature image output by the pixel rearrangement layer is 1/4 of the channel number of the input feature image, and the resolution of the output feature image is 2 times of the resolution of the input feature image.

5. The method for detecting the small target of the aerial image based on feature fusion and upsampling according to claim 4, wherein the formula of the pixel rearrangement layer is as follows:

wherein ,

representing a pixel rearrangement layer, L representing an input feature map of the pixel rearrangement layer, x and y representing the abscissa of the output feature map, C representing the channel coordinates of the input feature map, r representing the upsampling magnification,

representing a downward rounding, mod represents a remainder.

6. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S5, the feature sets are spliced, and the fusion feature is generated by tensor splicing operation; and the tensor splicing operation splices the feature graphs along the dimension direction to obtain a fusion feature tensor.

7. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S6, feature pyramids are constructed by performing down-sampling on the fused features for a plurality of times, namely, a series of low-resolution feature graphs are generated by passing the feature graphs through a plurality of down-sampling layers connected in series; the feature map pyramid refers to a set formed by low-resolution feature maps output by a downsampling layer; the resolution of the output low-resolution feature map is 1/2 of the resolution of the feature map input by the downsampling layer.

8. The aerial image small target detection method based on feature fusion and up-sampling as claimed in claim 1, wherein in step S7, the head detection network sequentially inputs the feature map of the feature pyramid output in step S6, and outputs the category and position coordinates of the target; the head detection network comprises a classification full convolution network and a regression full convolution network;

s7.1.2, inputting the feature map output by the buffer convolution layer into a classification prediction layer; the classification prediction layer consists of a layer of convolution layer; let x= (x) _N ，x _C ，x _H ，x _W ) 4D tensors indexed in the order of (N, C, H, W) represent classification results output by the classification prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of x are the same as the input profile; the channel number of x is Cls A, cls is the number of target categories, and A is the number of preset anchors;

s7.2.2, inputting the feature map output by the buffer convolution layer into a regression prediction layer; the regression prediction layer consists of a layer of convolution layer; let y= (y) _N ，y _C ，y _H ，y _W ) The 4D tensors are indexed in the sequence of (N, C, H, W) and represent the regression result output by the regression prediction layer; wherein N is a batch axis, C is a channel axis, and H and W are length and width axes, respectively; the batch, length and width (N, H, W) of y are the same as the input profile; the channel number C of y is 4 x A, A is the number of preset anchors;