CN113191296A

CN113191296A - Method for detecting five parameters of target in any orientation based on YOLOV5

Info

Publication number: CN113191296A
Application number: CN202110521035.XA
Authority: CN
Inventors: 席智中; 孙玉绘; 王金根; 张明义; 范希辉; 张罗政; 朱静; 陈代梅; 许蒙恩
Original assignee: PLA Army Academy of Artillery and Air Defense
Current assignee: PLA Army Academy of Artillery and Air Defense
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-07-30

Abstract

The invention discloses a method for detecting five parameters of an arbitrarily-oriented target based on YOLOV5, which comprises the steps of firstly extracting the characteristics of a remote sensing image by using a specific characteristic extraction network of YOLOV5, realizing the characteristic output of three scales, firstly directly regressing five parameters of a target rotating frame from an output characteristic diagram, reconstructing the characteristic diagram by using coordinates obtained after the five parameters are decoded, and regressing more accurate coordinates. Training uses a minimization smmolh L1 loss function to make the model converge faster and better. In the invention, different task requirements and hardware bottlenecks are considered, and lightweight acceleration models representing different speeds and precisions are designed; the detection precision of the model with the largest scale reaches SOTA, and the model with the smallest network depth can realize the effect of near real-time detection on higher precision, is convenient to carry on mobile terminals such as unmanned aerial vehicles and raspberry groups, and has very wide application prospect.

Description

Method for detecting five parameters of target in any orientation based on YOLOV5

Technical Field

The invention relates to the technical field of target detection, image processing technology, algorithm and neural network application, in particular to a five-parameter detection method for targets in any orientation based on YOLOV 5.

Background

With the improvement of hardware equipment and the continuous maturity of remote sensing technology, the quality and resolution of remote sensing images shot based on satellites, radars and unmanned aerial vehicles reach the level of natural images. However, objects in remote sensing images have distinct characteristics: the targets are all represented in a view angle of a top view; the target scale change is large; the arrangement direction of special objects such as vehicles, airplanes, ships and the like. The method for detecting the rotating target by adopting the universal horizontal frame detection method has three defects: the size and aspect ratio cannot reflect the true shape of the target object as in fig. 2 a; object and background pixels are not effectively separated as in fig. 2 b; the dense objects are difficult to separate from each other as shown in fig. 2 c. The rectangular frame in any direction is adopted to detect and position the target, the position information of the object can be better reflected as shown in figures 2d, 2e and 2f, and the method has important significance in geography, agriculture and military. The rotating frame detection method is originated from scene text detection in any direction based on deep learning, and a representative algorithm is as follows:

1. traditional algorithm represented by SWT, Selective Search and edgeBox

Before the birth of the deep learning method, traditional algorithms such as SWT, MSER, ER, Selective Search, EdgeBox and the like are mainly adopted for rotating target detection and scene inclined text detection, and the basic idea is as follows: firstly, binarizing the picture, such as self-adaptive binarization, if noise exists, Gaussian filtering can be adopted to simply filter, then a target region is obtained through morphological operations such as corrosion, expansion and the like, then a function for searching the outline is used to obtain points on the outline, and finally the minimum circumscribed rectangle is taken out. Extracting edges and gradients through a canny operator like an SWT algorithm, and then searching edges in the opposite direction through the gradient direction; the Edge Boxes algorithm determines the number of contours in the frame and the number of contours overlapping with the Edge of the frame by using Edge information (Edge), scores the frame based on the number of contours, and further determines the proposal information (consisting of size, length-width ratio and position) according to the sequence of the scores. The latter work is to run the correlation detection algorithm inside the propofol. The selective search algorithm firstly divides a picture into a plurality of small regions through a simple region division algorithm, and then continuously aggregates adjacent small regions through pixel similarity and region size (small regions are aggregated first, so that the situation that the small regions are continuously aggregated by large regions to cause incomplete hierarchical relationship) is prevented, and the method is similar to a clustering idea. After the target approximate region is obtained, drawing a maximum external rectangle (such as a rectangle with any angle in a scene text)

2. RRPN inclined text detection method

The RRPN algorithm was born in 2018 and is mainly used for oblique text detection. The method is based on a region extraction method of Faster Rcnn, and a rotating rectangle is represented by a five-parameter method of a central point, width and height and a rotating angle. An anchor frame with an angle is generated in advance in the detection process, and RRoI (Rotation Region-of-Interest) and learning of a rotating Interest area are combined. During training, a prediction frame which has IoU (intersection ratio) with a GT (real) frame of more than 0.7 and an angle with the GT frame of less than pi/12 is taken as a positive sample, IoU with the GT frame of less than 0.3, or a prediction frame which has IoU with the GT frame of more than 0.7 and an angle with the GT frame of more than pi/12 is taken as a negative sample, Smmoth L1 is adopted as regression loss, and cross entropy loss is adopted as category loss. In addition, the method provides a method (triangle segmentation method) for calculating the overlapping area of the oblique rectangles, and a good effect is achieved.

3、ROI Transformer

The core idea of the method is to introduce a Roi Transformer module to convert a horizontal anchor frame output in an RPN stage into a rotating anchor frame, so as to reduce a huge amount of calculation caused by introducing a large number of rotating anchor frames. The Roi Transformer module is divided into two parts, the first part being the RRoI Learner, which is mainly responsible for learning RRoIs (rotational regions of interest) from HRoIs (horizontal regions of interest): an offset (x, y, w, h,) is generated by inputting the feature map into the fully connected layer of five dimensions. And in the second part, Rroi Warping extracts rotation-invariant depth features through inputting feature maps and Rrois, further regresses refined offset, and decodes to obtain an output rotation frame. In the ideal case, each HroI is a circumscribed rectangle of RroI. By introducing the Roi transform, the method greatly reduces the calculation consumption and achieves good effect.

4、Gliding Vertex

The method is disclosed in CVPR2020. the method positions a quadrilateral by learning the offset of four points of an object on a non-rotated rectangle, thereby representing an object. The network structure used is also based on fast Rcnn, which is classified and regressed separately at the final full link layer. The final position regression uses a nine parameter regression method in which horizontal box coordinates (x, y, w, h) and four point offsets (α) are removed₁，α₂，α₃，α₄) In addition, a twiddle factor r (calculated as the ratio of the area of the rectangle to the area of the circumscribed horizontal rectangle) is introduced to determine whether the rectangle is horizontal or rotated. For the horizontal target, α is set to 1, and r is greater than 0.95, i.e., a horizontal rectangle is determined.

5、P-RSDet

The method is named Object Detection for Remote Sensing Image Based on Polar Coordinates and is published in CVPR2020. The method introduces polar coordinates for rotating target detection for the first time, and has the characteristics of fewer parameters and higher speed. Its rotating box representation method refers to Cornor Net, regression pole (x, y) and two corner points (ρ, ")₁,⊙₂). Feature extraction network provisioningDifferent network structures such as ResNet101, DLA34, Hourglass and the like are shown and represent different scales and speeds. In the detection head, the regression of the extreme points adopts a Gaussian heat map mode similar to that of the CenterNet, a probability map of the positions of the extreme points is output, and the category Loss adopts the Focal Loss. In the regression Loss, the coordinates of the center point are lost with Smmolh L1, while the author of the Loss of the coordinates of the extreme points introduces Polar Ring Area Loss, specifically developing as:

L_pr(ρ，θ)＝Smooth L1(|[ρ²-(ρ^*)²](θ-θ^*)|，0)。

the first method, i.e. the conventional detection method, needs to perform artificial feature extraction operators for different targets, has poor robustness, can only extract shallow features, and has poor semantic expression capability. Like the SWT algorithm, the edges and gradients are extracted by the canny operator, and then the edges in the opposite direction are searched by the gradient direction. However, even in cases where edges are all accurately extracted, there is still a problem in computing the target width at the search edge. The Edge Boxes algorithm, however, is not a "learning" based algorithm and has no training process. If an individual human is trained, the highest scored propusal (region of interest) is certainly the individual human, if a car is trained, the highest scored propusal is certainly the individual car, and the like, and the generalization ability for different categories cannot be expressed. The second approach is a straightforward improvement over horizontal Faster Rcnn, requiring a large number of anchors (anchor boxes) to be designed to cover all the dimensions, aspect ratios, and angles that the target may exist, and is computationally expensive. The third method has poor characteristic extraction network effect, the subsequent FPN output of five-layer characteristic diagrams leads to increased calculated amount, each HRoI is connected with a five-dimensional full connection with the same channel number, and the parameter amount greatly influences the reasoning speed. The eight parameter regression method of method four, the accuracy relies on the horizontal detection box generated in the first stage. If the regression in the first stage is not accurate, the four deviation values predicted in the second stage are also not accurate absolutely. The fifth method is different from the first four methods, and a new thought is directly developed for detecting the rotating target. However, since the method is anchor-free, the accuracy is necessarily reduced while the speed is increased (the method does not generate the anchor in prediction and directly performs regression, so that a large amount of time can be saved).

Therefore, the anchor-base rotary target detection model is designed, has high speed and high precision and can reach SOTA, and has important significance for detecting the rotary target in the remote sensing image.

Disclosure of Invention

The invention aims to make up for the defects of the prior art and provides an arbitrary orientation target five-parameter detection method based on YOLOV 5. Firstly, extracting the remote sensing image features by using a specific feature extraction network of YOLOV5, then realizing feature output of three scales by using an FPN + PAN structure, and directly classifying and regressing on an output feature map to obtain the position and category information of a target in the image. In the second stage of detection, the target position information obtained in the first stage is used for reconstructing characteristics, and a finer characteristic diagram is obtained so as to regress more accurate coordinates. The minimization smmolh L1 loss function is adopted in the training, so that the angle loss of the model is faster and better converged. In addition, 4 models from large to small to light are designed according to different convolution layer numbers, different calculated quantities, different precision and different detection speeds are represented respectively, and different network depths can be selected according to different tasks. Compared with the prior art, the method achieves SOTA in both detection precision and speed.

The invention is realized by the following technical scheme:

a five-parameter detection method of any orientation target based on YOLOV5 comprises the following specific steps:

(1) inputting the obtained remote sensing image into a Yolov5 feature extraction network for feature extraction to obtain three feature graphs with different scales;

(2) classifying and regressing the characteristic diagram obtained in the step (1), and performing characteristic reconstruction operation on a regression result to obtain a more detailed characteristic diagram;

(3) and (4) classifying and regressing again by using the fine characteristic diagram obtained in the step (2), and outputting and calculating loss.

Before the Yolov5 feature extraction network is used for feature extraction in the step (1), data enhancement operations such as random turning, stretching, color gamut transformation, random image graying (for infrared image detection), and the like, and data enhancement operations are performed on the remote sensing image, then the remote sensing image is uniformly scaled to a standard size for Focus slicing operation, and then the remote sensing image is input to the Yolov5 feature extraction network for feature extraction.

The Yolov5 feature extraction Network is composed of a plurality of CSP (Cross Stage Partial Network), CBL (convolution + batch normalization + leakyRelu) and SPP modules. The CSP module is a main structure for feature extraction: each CSP module divides the feature mapping of the basic layer into two parts, one part is convoluted by a plurality of residual modules, and then is combined with the other part through a cross-stage hierarchical structure, so that the problem of overhigh inference calculation caused by repeated gradient information in network optimization is avoided, the calculation amount is reduced, and the accuracy rate can be ensured. CBL is a conventional feature extraction operation. The SPP respectively performs four times of maximum pooling on the same feature map with different scales, and the four pooled feature maps are superposed to retain target information with different scale levels. After feature extraction, different layer feature maps are input into the FPN and PAN modules. The FPN is a top-down structure, and the semantic information of the high-level feature map is transmitted and fused downwards in an upsampling mode to obtain a feature map for prediction. And PAN is a bottom-up feature pyramid. FPN conveys strong semantic features from top to bottom, while PAN conveys strong localization features from bottom to top to achieve feature fusion from different stem layers to different detection layers. And finally, outputting three different scale feature maps.

The specific content of the step (2) is as follows: classifying and regressing convolution are respectively carried out on the obtained feature maps with different scales, the obtained regressing parameters are decoded into a five-parameter model rotating frame of a target, then corresponding feature points are recoded by utilizing the position information of the decoded bounding box, the whole feature mapping is reconstructed, the new feature mapping is subjected to normalization operation and is multiplied to the original feature map as a mask point to obtain a refined feature map, and the specific implementation process is as follows: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; after traversing the feature points, reconstructing a feature map; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map.

The specific content of the step (3) is as follows: and (3) classifying and regressing the feature map reconstructed in the step (2) again, decoding the regression parameters into a five-parameter model rotating frame of the target, performing non-maximum suppression operation on the generated rotating frame by taking the class score map as a confidence coefficient, outputting the rotating frame, and calculating loss.

The five-parameter model is a rectangle with five parameters representing any direction: the parameters x, y, w, h, θ are defined as: target center point coordinates x, y, target width and height w, h and a rotation angle theta; the method comprises the following steps that the lowest vertex of a target frame is taken as a starting point, rays extending along the positive direction of an x axis of the target frame are taken as datum lines, the target frame moves along the counterclockwise direction, the edge of the first encountered target frame is defined as the width w of the target frame, the other edge of the first encountered target frame is defined as the length h, the included angle between the width w of the target frame and the datum lines is a target offset angle theta, and the range is [ -90,0 ].

Calculating loss in the step (3), specifically as follows: the loss function is:

wherein N represents the number of anchor frames, t'_nThe value is 0 or 1, the foreground is 1, the background is 0, and if the background is the background, no regression exists; v'_njRepresenting the predicted offset vector, v_njA target vector representing a real box; t is t_nIs the target actual category, t_nCalculating probability distribution of each category for sigmoid; l is_regIs a smooth L1 loss, L_clsUsing focal loss, to prevent the loss increase due to the angle jump, regression is performedThe loss is adjusted as follows:

the invention has the advantages that: the method uses the specific CSPNet module of YOLOV5 to increase the speed and precision of feature extraction, and the structure of combining FPN and PAN further increases the fusion capability of features with different scales; adding a feature reconstruction module into a five-parameter angle regression model to realize feature alignment, and introducing a minimization Smmolh L1 loss function to reduce loss mutation caused by inaccurate angle regression; considering different task requirements and hardware bottlenecks, designing lightweight acceleration models representing different speeds and accuracies; the detection precision of the model with the largest scale reaches SOTA, and the model with the smallest network depth can realize the effect of near real-time detection on higher precision, is convenient to carry on mobile terminals such as unmanned aerial vehicles and raspberry groups, and has very wide application prospect.

Drawings

Fig. 1 is a schematic flow chart of an arbitrary orientation target five-parameter detection method based on YOLOV 5.

FIG. 2 is a schematic diagram showing the comparison between horizontal frame and rotating frame detection in remote sensing image target detection (FIG. 2a represents a diagram in which the size and the aspect ratio cannot reflect the real shape of a target object; FIG. 2b represents a diagram in which an object and background pixels are not effectively separated; FIG. 2c represents a diagram in which dense objects are difficult to separate; and FIGS. 2d, 2e and 2f represent diagrams in which rectangular frames in any directions are used for detecting and positioning targets).

FIG. 3 is a graphical representation of the fluctuation of losses when using the minimum Smooth L1 loss versus the normal Smooth L1 loss. It can be seen that minimizing the variation in the value of Smooth L1 loss is minimal and also more likely to fall to an optimum point.

FIG. 4 is a schematic diagram of a feature reconstruction process using a five parameter regression based method.

Fig. 5 shows feature reconstruction using a bilinear interpolation method. (FIG. 5a is the original image, FIG. 5b is the bilinear interpolation calculation method, FIG. 5c is the deviation caused by feature misalignment, FIG. 5d is the more accurate diagram of the bounding box obtained after bilinear interpolation)

FIG. 6 is a graph comparing the results of testing four different scale models on a DOTA dataset against a UCAS-AOD dataset.

FIG. 7 is a comparison of the test results of the present invention on DOTA datasets versus UCAS-AOD datasets with other detection methods. (FIG. 7a represents a comparison of the results of the tests on the DOTA dataset, with the abbreviations for the names given below: Pl: Plane, Bd: Baseball Diamond, Br: Bridge, Gft: group field track, Sv: Small vessel, Lv: Large vessel, Sh: Ship, Tc: Tennis vessel, Bc: Basketballl vessel, St: Storage tank, Sbf: Soccer-ballfield, Ra: Rodabout, Ha: Harbor, Sp: Swimming port, He: icoropter; FIG. 7b represents a comparison of the results of the tests on the UCAS-AOD dataset).

Fig. 8 is a schematic diagram of a rectangular frame with an arbitrary orientation represented by five parameters (fig. 8a is a schematic diagram of one orientation of the rectangular frame, and fig. 8b is a schematic diagram of another orientation of the rectangular frame).

FIG. 9 is a schematic diagram of a model misconvergence that may occur when calculating angle loss without minimizing the Smooth L1 loss.

Detailed Description

The invention mainly adopts a mainstream data set for verification, the CPU of a computer for testing is Intel core i 910900 k ubuntu 18.04+ (3.7GHz), the memory is 16G, the GPU model is Inviida 2080ti, and the display memory is 12G. All steps and conclusions are verified to be correct on the programming software Python3.8 and the deep learning framework Pytroch 1.7.0. FIG. 6 is a graph comparing the results of testing four different scale models on a DOTA dataset against a UCAS-AOD dataset. It can be seen that the heaviest model has the highest precision and faster detection speed, and the lightest model has near real-time detection speed on the basis of higher precision. FIG. 7 is a comparison of the test results of the present invention on DOTA datasets versus UCAS-AOD datasets with other detection methods. (FIG. 7a represents a graph comparing test results on DOTA data set; FIG. 7b represents a graph comparing test results on UCAS-AOD data set). Under the same training condition, the invention can be seen to have higher precision and speed. The method of the present invention is further illustrated with reference to the accompanying drawings and specific examples.

Fig. 1 shows a schematic flow diagram of a rotary target detection model based on YOLOV5, and the specific embodiment is as follows:

for convenience of description, the following terms are first defined:

defining 1 five-parameter model

As shown in fig. 4, the five-parameter model represents an arbitrary direction rectangle by five parameters: the parameters (x, y, w, h, θ) are defined by: target center point coordinates (x, y), target width and height (w, h), and rotation angle (theta). And taking the lowest vertex of the target frame as a starting point, taking the extended ray of the target frame along the positive direction of the x axis as a reference line, moving along the counterclockwise direction, defining the edge of the first encountered target frame as the width w of the target frame, and defining the other edge as the length h. The included angle between the width w of the target frame and the datum line is a target offset angle theta and ranges from [ -90,0 ]. As shown in fig. 8a, 8 b. The five parameters (x, y, w, h, θ) are defined by: target center point coordinates (x, y), target width and height (w, h), and rotation angle (theta). And taking the lowest vertex of the target frame as a starting point, taking the extended ray of the target frame along the positive direction of the x axis as a reference line, moving along the counterclockwise direction, defining the edge of the first encountered target frame as the width w of the target frame, and defining the other edge as the length h. The included angle between the width w of the target frame and the datum line is a target offset angle theta and ranges from [ -90,0 ].

Definition 2 feature reconstruction

As shown in fig. 5, in the first stage of detection, the decoded position information of the bounding box is used to re-encode the corresponding feature points, reconstruct the whole feature map, perform normalization operation on the new feature map, and multiply the new feature map as a mask point onto the original feature map to obtain a refined feature map, which is specifically implemented by the following steps: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; then adding five feature vectors and replacing the current feature vectors, and reconstructing the whole feature map after traversing the feature points; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the reconstruction operation of the feature map. The pseudo code is as follows:

definitions 3 minimization of Smooth L1 loss

As shown in fig. 9, when the rotation angle θ of the real frame (solid frame) approaches-90 degrees and the rotation angle θ' of the prediction frame (dashed frame) approaches 0 degrees, the angle loss is as large as approximately 90 degrees. At this point, however, the prediction box has already approached the true box very much. Optimizing according to Smooth L1 losses, the model is difficult to train along the fastest route. When Smooth L1 loss optimization is adopted, the model can be regressed into a real frame only by regressing a small deviation (clockwise rotation by a small angle):

when the real frame θ approaches-90 degrees and the prediction frame θ' approaches 0 degrees, the angle loss is as large as approaching 90 degrees. However, at this time, the prediction frame approaches the real frame, and only a small clockwise rotation angle is required to return to the real frame. The loss calculation in this case should be:

the total loss is then:

referring to fig. 1, the rotating target detection process based on two different methods is implemented by the following steps:

step 1, inputting an image to perform feature extraction to obtain a feature map

After data enhancement operations such as random flipping, stretching, color gamut transformation and the like are performed on the input image (only the training process includes the operations, but the detection process does not perform the operations), the input image is uniformly scaled to a standard size (for example, 608 × 608), and Focus slicing operation is performed, and then the input image is input to a Yolov5 feature extraction network. The Yolov5 feature extraction Network is composed of a plurality of CSP (Cross Stage Partial Network), CBL (convolution + batch normalization + leakyRelu) and SPP modules. The CSP module is a main structure for feature extraction: each CSP module divides the feature mapping of the basic layer into two parts, one part is convoluted by a plurality of residual modules, and then is combined with the other part through a cross-stage hierarchical structure, so that the problem of overhigh inference calculation caused by repeated gradient information in network optimization is avoided, the calculation amount is reduced, and the accuracy rate can be ensured. By changing the number of residual error components in the CSP module, the depth and scale of the whole model can be controlled, so that different detection accuracy and speed can be realized. CBL is a conventional feature extraction operation. The SPP respectively performs four times of maximum pooling on the same feature map with different scales, and the four pooled feature maps are superposed to retain target information with different scale levels. After feature extraction, different layer feature maps are input into the FPN and PAN modules. The FPN is a top-down structure, and the semantic information of the high-level feature map is transmitted and fused downwards in an upsampling mode to obtain a feature map for prediction. And PAN is a bottom-up feature pyramid. FPN conveys strong semantic features from top to bottom, while PAN conveys strong localization features from bottom to top to achieve feature fusion from different stem layers to different detection layers. And finally, outputting three different scale feature maps.

Step 2, classifying and regressing the characteristic diagram in the step 1, and performing characteristic reconstruction on a regression result

And (3) classifying and regressing and convoluting the characteristic diagram in the step (1), and decoding the obtained regression parameters into a five-parameter model rotating frame of the target (a rotating rectangular frame is defined by five parameters of x, y, w, h and theta). And then recoding the corresponding feature points by using the position information of the decoded bounding box, reconstructing the whole feature mapping, carrying out normalization operation on the new feature mapping, and multiplying the normalized feature mapping onto the original feature map as a mask point to obtain a refined feature map, wherein the specific implementation process comprises the following steps: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; then adding five feature vectors and replacing the current feature vectors, and reconstructing the whole feature map after traversing the feature points; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map. This reconstruction process can be repeated more than twice.

And 3, classifying and regressing again by utilizing the reconstructed characteristic diagram, outputting and calculating loss

And (4) carrying out classification and regression operation on the feature map reconstructed in the step (2), and decoding the regression parameters into a target rotating frame. The generated rotation box is output after performing NMS (non-maximum suppression) operation with the class score map as a confidence. If the model is in the training stage, the loss is calculated by the two classification and regression steps 2 and 3, so as to achieve the purpose of training the model. As shown in fig. 3, the loss function is:

wherein N represents the number of anchor frames, t'_nThe value is 0 or 1 (foreground is 1, background is 0, if background, there is no regression); v'_njRepresenting the predicted offset vector, v_njA target vector representing a real box. t is t_nIs the target actual category, t_nThe class probability distributions calculated for sigmoid. L is_regIs a smooth L1 loss, L_clsFocal loss was used. To prevent the loss increase due to the angle jump, the regression loss is adjusted to:

so that the model is trained in the fastest and best way.

Claims

1. An arbitrary orientation target five-parameter detection method based on YOLOV5 is characterized in that: the method comprises the following specific steps:

(2) classifying and regressing the characteristic diagram obtained in the step (1), and performing characteristic reconstruction on a regression result to obtain a refined characteristic diagram;

(3) and (4) classifying and regressing by using the refined feature map obtained in the step (2), outputting and calculating loss.

2. The method for detecting the target five parameters in any orientation based on the YOLOV5 as claimed in claim 1, wherein: before the Yolov5 feature extraction network is used for feature extraction in the step (1), the remote sensing image is subjected to random turning, stretching, color gamut transformation and image random gray data enhancement operation, then the remote sensing image is uniformly scaled to a standard size for Focus slicing operation, and then the remote sensing image is input to the Yolov5 feature extraction network for feature extraction.

3. The method for detecting the target five parameters in any orientation based on the YOLOV5 as claimed in claim 1, wherein: the specific content of the step (2) is as follows: classifying and regressing convolution are respectively carried out on the obtained feature maps with different scales, the obtained regressing parameters are decoded into a target five-parameter model rotating frame, then corresponding feature points are recoded by utilizing the position information of the decoded bounding box, the whole feature mapping is reconstructed, the new feature mapping is subjected to normalization operation and is multiplied to the original feature map as a mask point to obtain a refined feature map, and the specific implementation process is as follows: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; after traversing the feature points, reconstructing a feature map; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map.

4. The method of claim 3, wherein the five-parameter detection method is based on the Yolov5 and is characterized in that: the specific content of the step (3) is as follows: and (3) classifying and regressing the feature map refined in the step (2) again, decoding the regression parameters into a target five-parameter model rotating frame, taking the class score map as a confidence coefficient of the generated rotating frame, performing non-maximum value inhibition operation, outputting the non-maximum value inhibition operation, and calculating loss.

5. The method of claim 4, wherein the five-parameter detection method based on the YOLOV5 is characterized in that: the five-parameter model is a rectangle with five parameters representing any direction: the parameters x, y, w, h, θ are defined as: target center point coordinates x, y, target width and height w, h and a rotation angle theta; the method comprises the following steps that the lowest vertex of a target frame is taken as a starting point, rays extending along the positive direction of an x axis of the target frame are taken as datum lines, the target frame moves along the counterclockwise direction, the edge of the first encountered target frame is defined as the width w of the target frame, the other edge of the first encountered target frame is defined as the length h, the included angle between the width w of the target frame and the datum lines is a target offset angle theta, and the range is [ -90,0 ].

6. The method of claim 4, wherein the five-parameter detection method based on the YOLOV5 is characterized in that: calculating loss in the step (3), specifically as follows: the loss function is:

wherein N represents the number of anchor frames, t'_nThe value is 0 or 1, the foreground is 1, the background is 0, and if the background is the background, no regression exists; v'_njRepresenting the predicted offset vector, v_njA target vector representing a real box; t is t_nIs the target actual category, t_nCalculating probability distribution of each category for sigmoid; l is_regIs a smooth L1 loss, L_clsWith focal loss, to prevent the loss increase due to the angle jump, the regression loss was adjusted to: