CN113191296A - Method for detecting five parameters of target in any orientation based on YOLOV5 - Google Patents

Method for detecting five parameters of target in any orientation based on YOLOV5 Download PDF

Info

Publication number
CN113191296A
CN113191296A CN202110521035.XA CN202110521035A CN113191296A CN 113191296 A CN113191296 A CN 113191296A CN 202110521035 A CN202110521035 A CN 202110521035A CN 113191296 A CN113191296 A CN 113191296A
Authority
CN
China
Prior art keywords
feature
target
yolov5
parameters
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110521035.XA
Other languages
Chinese (zh)
Inventor
席智中
孙玉绘
王金根
张明义
范希辉
张罗政
朱静
陈代梅
许蒙恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PLA Army Academy of Artillery and Air Defense
Original Assignee
PLA Army Academy of Artillery and Air Defense
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLA Army Academy of Artillery and Air Defense filed Critical PLA Army Academy of Artillery and Air Defense
Priority to CN202110521035.XA priority Critical patent/CN113191296A/en
Publication of CN113191296A publication Critical patent/CN113191296A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Astronomy & Astrophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting five parameters of an arbitrarily-oriented target based on YOLOV5, which comprises the steps of firstly extracting the characteristics of a remote sensing image by using a specific characteristic extraction network of YOLOV5, realizing the characteristic output of three scales, firstly directly regressing five parameters of a target rotating frame from an output characteristic diagram, reconstructing the characteristic diagram by using coordinates obtained after the five parameters are decoded, and regressing more accurate coordinates. Training uses a minimization smmolh L1 loss function to make the model converge faster and better. In the invention, different task requirements and hardware bottlenecks are considered, and lightweight acceleration models representing different speeds and precisions are designed; the detection precision of the model with the largest scale reaches SOTA, and the model with the smallest network depth can realize the effect of near real-time detection on higher precision, is convenient to carry on mobile terminals such as unmanned aerial vehicles and raspberry groups, and has very wide application prospect.

Description

Method for detecting five parameters of target in any orientation based on YOLOV5
Technical Field
The invention relates to the technical field of target detection, image processing technology, algorithm and neural network application, in particular to a five-parameter detection method for targets in any orientation based on YOLOV 5.
Background
With the improvement of hardware equipment and the continuous maturity of remote sensing technology, the quality and resolution of remote sensing images shot based on satellites, radars and unmanned aerial vehicles reach the level of natural images. However, objects in remote sensing images have distinct characteristics: the targets are all represented in a view angle of a top view; the target scale change is large; the arrangement direction of special objects such as vehicles, airplanes, ships and the like. The method for detecting the rotating target by adopting the universal horizontal frame detection method has three defects: the size and aspect ratio cannot reflect the true shape of the target object as in fig. 2 a; object and background pixels are not effectively separated as in fig. 2 b; the dense objects are difficult to separate from each other as shown in fig. 2 c. The rectangular frame in any direction is adopted to detect and position the target, the position information of the object can be better reflected as shown in figures 2d, 2e and 2f, and the method has important significance in geography, agriculture and military. The rotating frame detection method is originated from scene text detection in any direction based on deep learning, and a representative algorithm is as follows:
1. traditional algorithm represented by SWT, Selective Search and edgeBox
Before the birth of the deep learning method, traditional algorithms such as SWT, MSER, ER, Selective Search, EdgeBox and the like are mainly adopted for rotating target detection and scene inclined text detection, and the basic idea is as follows: firstly, binarizing the picture, such as self-adaptive binarization, if noise exists, Gaussian filtering can be adopted to simply filter, then a target region is obtained through morphological operations such as corrosion, expansion and the like, then a function for searching the outline is used to obtain points on the outline, and finally the minimum circumscribed rectangle is taken out. Extracting edges and gradients through a canny operator like an SWT algorithm, and then searching edges in the opposite direction through the gradient direction; the Edge Boxes algorithm determines the number of contours in the frame and the number of contours overlapping with the Edge of the frame by using Edge information (Edge), scores the frame based on the number of contours, and further determines the proposal information (consisting of size, length-width ratio and position) according to the sequence of the scores. The latter work is to run the correlation detection algorithm inside the propofol. The selective search algorithm firstly divides a picture into a plurality of small regions through a simple region division algorithm, and then continuously aggregates adjacent small regions through pixel similarity and region size (small regions are aggregated first, so that the situation that the small regions are continuously aggregated by large regions to cause incomplete hierarchical relationship) is prevented, and the method is similar to a clustering idea. After the target approximate region is obtained, drawing a maximum external rectangle (such as a rectangle with any angle in a scene text)
2. RRPN inclined text detection method
The RRPN algorithm was born in 2018 and is mainly used for oblique text detection. The method is based on a region extraction method of Faster Rcnn, and a rotating rectangle is represented by a five-parameter method of a central point, width and height and a rotating angle. An anchor frame with an angle is generated in advance in the detection process, and RRoI (Rotation Region-of-Interest) and learning of a rotating Interest area are combined. During training, a prediction frame which has IoU (intersection ratio) with a GT (real) frame of more than 0.7 and an angle with the GT frame of less than pi/12 is taken as a positive sample, IoU with the GT frame of less than 0.3, or a prediction frame which has IoU with the GT frame of more than 0.7 and an angle with the GT frame of more than pi/12 is taken as a negative sample, Smmoth L1 is adopted as regression loss, and cross entropy loss is adopted as category loss. In addition, the method provides a method (triangle segmentation method) for calculating the overlapping area of the oblique rectangles, and a good effect is achieved.
3、ROI Transformer
The core idea of the method is to introduce a Roi Transformer module to convert a horizontal anchor frame output in an RPN stage into a rotating anchor frame, so as to reduce a huge amount of calculation caused by introducing a large number of rotating anchor frames. The Roi Transformer module is divided into two parts, the first part being the RRoI Learner, which is mainly responsible for learning RRoIs (rotational regions of interest) from HRoIs (horizontal regions of interest): an offset (x, y, w, h,) is generated by inputting the feature map into the fully connected layer of five dimensions. And in the second part, Rroi Warping extracts rotation-invariant depth features through inputting feature maps and Rrois, further regresses refined offset, and decodes to obtain an output rotation frame. In the ideal case, each HroI is a circumscribed rectangle of RroI. By introducing the Roi transform, the method greatly reduces the calculation consumption and achieves good effect.
4、Gliding Vertex
The method is disclosed in CVPR2020. the method positions a quadrilateral by learning the offset of four points of an object on a non-rotated rectangle, thereby representing an object. The network structure used is also based on fast Rcnn, which is classified and regressed separately at the final full link layer. The final position regression uses a nine parameter regression method in which horizontal box coordinates (x, y, w, h) and four point offsets (α) are removed1,α2,α3,α4) In addition, a twiddle factor r (calculated as the ratio of the area of the rectangle to the area of the circumscribed horizontal rectangle) is introduced to determine whether the rectangle is horizontal or rotated. For the horizontal target, α is set to 1, and r is greater than 0.95, i.e., a horizontal rectangle is determined.
5、P-RSDet
The method is named Object Detection for Remote Sensing Image Based on Polar Coordinates and is published in CVPR2020. The method introduces polar coordinates for rotating target detection for the first time, and has the characteristics of fewer parameters and higher speed. Its rotating box representation method refers to Cornor Net, regression pole (x, y) and two corner points (ρ, ")1,⊙2). Feature extraction network provisioningDifferent network structures such as ResNet101, DLA34, Hourglass and the like are shown and represent different scales and speeds. In the detection head, the regression of the extreme points adopts a Gaussian heat map mode similar to that of the CenterNet, a probability map of the positions of the extreme points is output, and the category Loss adopts the Focal Loss. In the regression Loss, the coordinates of the center point are lost with Smmolh L1, while the author of the Loss of the coordinates of the extreme points introduces Polar Ring Area Loss, specifically developing as:
Lpr(ρ,θ)=Smooth L1(|[ρ2-(ρ*)2](θ-θ*)|,0)。
the first method, i.e. the conventional detection method, needs to perform artificial feature extraction operators for different targets, has poor robustness, can only extract shallow features, and has poor semantic expression capability. Like the SWT algorithm, the edges and gradients are extracted by the canny operator, and then the edges in the opposite direction are searched by the gradient direction. However, even in cases where edges are all accurately extracted, there is still a problem in computing the target width at the search edge. The Edge Boxes algorithm, however, is not a "learning" based algorithm and has no training process. If an individual human is trained, the highest scored propusal (region of interest) is certainly the individual human, if a car is trained, the highest scored propusal is certainly the individual car, and the like, and the generalization ability for different categories cannot be expressed. The second approach is a straightforward improvement over horizontal Faster Rcnn, requiring a large number of anchors (anchor boxes) to be designed to cover all the dimensions, aspect ratios, and angles that the target may exist, and is computationally expensive. The third method has poor characteristic extraction network effect, the subsequent FPN output of five-layer characteristic diagrams leads to increased calculated amount, each HRoI is connected with a five-dimensional full connection with the same channel number, and the parameter amount greatly influences the reasoning speed. The eight parameter regression method of method four, the accuracy relies on the horizontal detection box generated in the first stage. If the regression in the first stage is not accurate, the four deviation values predicted in the second stage are also not accurate absolutely. The fifth method is different from the first four methods, and a new thought is directly developed for detecting the rotating target. However, since the method is anchor-free, the accuracy is necessarily reduced while the speed is increased (the method does not generate the anchor in prediction and directly performs regression, so that a large amount of time can be saved).
Therefore, the anchor-base rotary target detection model is designed, has high speed and high precision and can reach SOTA, and has important significance for detecting the rotary target in the remote sensing image.
Disclosure of Invention
The invention aims to make up for the defects of the prior art and provides an arbitrary orientation target five-parameter detection method based on YOLOV 5. Firstly, extracting the remote sensing image features by using a specific feature extraction network of YOLOV5, then realizing feature output of three scales by using an FPN + PAN structure, and directly classifying and regressing on an output feature map to obtain the position and category information of a target in the image. In the second stage of detection, the target position information obtained in the first stage is used for reconstructing characteristics, and a finer characteristic diagram is obtained so as to regress more accurate coordinates. The minimization smmolh L1 loss function is adopted in the training, so that the angle loss of the model is faster and better converged. In addition, 4 models from large to small to light are designed according to different convolution layer numbers, different calculated quantities, different precision and different detection speeds are represented respectively, and different network depths can be selected according to different tasks. Compared with the prior art, the method achieves SOTA in both detection precision and speed.
The invention is realized by the following technical scheme:
a five-parameter detection method of any orientation target based on YOLOV5 comprises the following specific steps:
(1) inputting the obtained remote sensing image into a Yolov5 feature extraction network for feature extraction to obtain three feature graphs with different scales;
(2) classifying and regressing the characteristic diagram obtained in the step (1), and performing characteristic reconstruction operation on a regression result to obtain a more detailed characteristic diagram;
(3) and (4) classifying and regressing again by using the fine characteristic diagram obtained in the step (2), and outputting and calculating loss.
Before the Yolov5 feature extraction network is used for feature extraction in the step (1), data enhancement operations such as random turning, stretching, color gamut transformation, random image graying (for infrared image detection), and the like, and data enhancement operations are performed on the remote sensing image, then the remote sensing image is uniformly scaled to a standard size for Focus slicing operation, and then the remote sensing image is input to the Yolov5 feature extraction network for feature extraction.
The Yolov5 feature extraction Network is composed of a plurality of CSP (Cross Stage Partial Network), CBL (convolution + batch normalization + leakyRelu) and SPP modules. The CSP module is a main structure for feature extraction: each CSP module divides the feature mapping of the basic layer into two parts, one part is convoluted by a plurality of residual modules, and then is combined with the other part through a cross-stage hierarchical structure, so that the problem of overhigh inference calculation caused by repeated gradient information in network optimization is avoided, the calculation amount is reduced, and the accuracy rate can be ensured. CBL is a conventional feature extraction operation. The SPP respectively performs four times of maximum pooling on the same feature map with different scales, and the four pooled feature maps are superposed to retain target information with different scale levels. After feature extraction, different layer feature maps are input into the FPN and PAN modules. The FPN is a top-down structure, and the semantic information of the high-level feature map is transmitted and fused downwards in an upsampling mode to obtain a feature map for prediction. And PAN is a bottom-up feature pyramid. FPN conveys strong semantic features from top to bottom, while PAN conveys strong localization features from bottom to top to achieve feature fusion from different stem layers to different detection layers. And finally, outputting three different scale feature maps.
The specific content of the step (2) is as follows: classifying and regressing convolution are respectively carried out on the obtained feature maps with different scales, the obtained regressing parameters are decoded into a five-parameter model rotating frame of a target, then corresponding feature points are recoded by utilizing the position information of the decoded bounding box, the whole feature mapping is reconstructed, the new feature mapping is subjected to normalization operation and is multiplied to the original feature map as a mask point to obtain a refined feature map, and the specific implementation process is as follows: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; after traversing the feature points, reconstructing a feature map; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map.
The specific content of the step (3) is as follows: and (3) classifying and regressing the feature map reconstructed in the step (2) again, decoding the regression parameters into a five-parameter model rotating frame of the target, performing non-maximum suppression operation on the generated rotating frame by taking the class score map as a confidence coefficient, outputting the rotating frame, and calculating loss.
The five-parameter model is a rectangle with five parameters representing any direction: the parameters x, y, w, h, θ are defined as: target center point coordinates x, y, target width and height w, h and a rotation angle theta; the method comprises the following steps that the lowest vertex of a target frame is taken as a starting point, rays extending along the positive direction of an x axis of the target frame are taken as datum lines, the target frame moves along the counterclockwise direction, the edge of the first encountered target frame is defined as the width w of the target frame, the other edge of the first encountered target frame is defined as the length h, the included angle between the width w of the target frame and the datum lines is a target offset angle theta, and the range is [ -90,0 ].
Calculating loss in the step (3), specifically as follows: the loss function is:
Figure BDA0003063985330000051
wherein N represents the number of anchor frames, t'nThe value is 0 or 1, the foreground is 1, the background is 0, and if the background is the background, no regression exists; v'njRepresenting the predicted offset vector, vnjA target vector representing a real box; t is tnIs the target actual category, tnCalculating probability distribution of each category for sigmoid; l isregIs a smooth L1 loss, LclsUsing focal loss, to prevent the loss increase due to the angle jump, regression is performedThe loss is adjusted as follows:
Figure BDA0003063985330000061
the invention has the advantages that: the method uses the specific CSPNet module of YOLOV5 to increase the speed and precision of feature extraction, and the structure of combining FPN and PAN further increases the fusion capability of features with different scales; adding a feature reconstruction module into a five-parameter angle regression model to realize feature alignment, and introducing a minimization Smmolh L1 loss function to reduce loss mutation caused by inaccurate angle regression; considering different task requirements and hardware bottlenecks, designing lightweight acceleration models representing different speeds and accuracies; the detection precision of the model with the largest scale reaches SOTA, and the model with the smallest network depth can realize the effect of near real-time detection on higher precision, is convenient to carry on mobile terminals such as unmanned aerial vehicles and raspberry groups, and has very wide application prospect.
Drawings
Fig. 1 is a schematic flow chart of an arbitrary orientation target five-parameter detection method based on YOLOV 5.
FIG. 2 is a schematic diagram showing the comparison between horizontal frame and rotating frame detection in remote sensing image target detection (FIG. 2a represents a diagram in which the size and the aspect ratio cannot reflect the real shape of a target object; FIG. 2b represents a diagram in which an object and background pixels are not effectively separated; FIG. 2c represents a diagram in which dense objects are difficult to separate; and FIGS. 2d, 2e and 2f represent diagrams in which rectangular frames in any directions are used for detecting and positioning targets).
FIG. 3 is a graphical representation of the fluctuation of losses when using the minimum Smooth L1 loss versus the normal Smooth L1 loss. It can be seen that minimizing the variation in the value of Smooth L1 loss is minimal and also more likely to fall to an optimum point.
FIG. 4 is a schematic diagram of a feature reconstruction process using a five parameter regression based method.
Fig. 5 shows feature reconstruction using a bilinear interpolation method. (FIG. 5a is the original image, FIG. 5b is the bilinear interpolation calculation method, FIG. 5c is the deviation caused by feature misalignment, FIG. 5d is the more accurate diagram of the bounding box obtained after bilinear interpolation)
FIG. 6 is a graph comparing the results of testing four different scale models on a DOTA dataset against a UCAS-AOD dataset.
FIG. 7 is a comparison of the test results of the present invention on DOTA datasets versus UCAS-AOD datasets with other detection methods. (FIG. 7a represents a comparison of the results of the tests on the DOTA dataset, with the abbreviations for the names given below: Pl: Plane, Bd: Baseball Diamond, Br: Bridge, Gft: group field track, Sv: Small vessel, Lv: Large vessel, Sh: Ship, Tc: Tennis vessel, Bc: Basketballl vessel, St: Storage tank, Sbf: Soccer-ballfield, Ra: Rodabout, Ha: Harbor, Sp: Swimming port, He: icoropter; FIG. 7b represents a comparison of the results of the tests on the UCAS-AOD dataset).
Fig. 8 is a schematic diagram of a rectangular frame with an arbitrary orientation represented by five parameters (fig. 8a is a schematic diagram of one orientation of the rectangular frame, and fig. 8b is a schematic diagram of another orientation of the rectangular frame).
FIG. 9 is a schematic diagram of a model misconvergence that may occur when calculating angle loss without minimizing the Smooth L1 loss.
Detailed Description
The invention mainly adopts a mainstream data set for verification, the CPU of a computer for testing is Intel core i 910900 k ubuntu 18.04+ (3.7GHz), the memory is 16G, the GPU model is Inviida 2080ti, and the display memory is 12G. All steps and conclusions are verified to be correct on the programming software Python3.8 and the deep learning framework Pytroch 1.7.0. FIG. 6 is a graph comparing the results of testing four different scale models on a DOTA dataset against a UCAS-AOD dataset. It can be seen that the heaviest model has the highest precision and faster detection speed, and the lightest model has near real-time detection speed on the basis of higher precision. FIG. 7 is a comparison of the test results of the present invention on DOTA datasets versus UCAS-AOD datasets with other detection methods. (FIG. 7a represents a graph comparing test results on DOTA data set; FIG. 7b represents a graph comparing test results on UCAS-AOD data set). Under the same training condition, the invention can be seen to have higher precision and speed. The method of the present invention is further illustrated with reference to the accompanying drawings and specific examples.
Fig. 1 shows a schematic flow diagram of a rotary target detection model based on YOLOV5, and the specific embodiment is as follows:
for convenience of description, the following terms are first defined:
defining 1 five-parameter model
As shown in fig. 4, the five-parameter model represents an arbitrary direction rectangle by five parameters: the parameters (x, y, w, h, θ) are defined by: target center point coordinates (x, y), target width and height (w, h), and rotation angle (theta). And taking the lowest vertex of the target frame as a starting point, taking the extended ray of the target frame along the positive direction of the x axis as a reference line, moving along the counterclockwise direction, defining the edge of the first encountered target frame as the width w of the target frame, and defining the other edge as the length h. The included angle between the width w of the target frame and the datum line is a target offset angle theta and ranges from [ -90,0 ]. As shown in fig. 8a, 8 b. The five parameters (x, y, w, h, θ) are defined by: target center point coordinates (x, y), target width and height (w, h), and rotation angle (theta). And taking the lowest vertex of the target frame as a starting point, taking the extended ray of the target frame along the positive direction of the x axis as a reference line, moving along the counterclockwise direction, defining the edge of the first encountered target frame as the width w of the target frame, and defining the other edge as the length h. The included angle between the width w of the target frame and the datum line is a target offset angle theta and ranges from [ -90,0 ].
Definition 2 feature reconstruction
As shown in fig. 5, in the first stage of detection, the decoded position information of the bounding box is used to re-encode the corresponding feature points, reconstruct the whole feature map, perform normalization operation on the new feature map, and multiply the new feature map as a mask point onto the original feature map to obtain a refined feature map, which is specifically implemented by the following steps: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; then adding five feature vectors and replacing the current feature vectors, and reconstructing the whole feature map after traversing the feature points; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the reconstruction operation of the feature map. The pseudo code is as follows:
Figure BDA0003063985330000081
definitions 3 minimization of Smooth L1 loss
As shown in fig. 9, when the rotation angle θ of the real frame (solid frame) approaches-90 degrees and the rotation angle θ' of the prediction frame (dashed frame) approaches 0 degrees, the angle loss is as large as approximately 90 degrees. At this point, however, the prediction box has already approached the true box very much. Optimizing according to Smooth L1 losses, the model is difficult to train along the fastest route. When Smooth L1 loss optimization is adopted, the model can be regressed into a real frame only by regressing a small deviation (clockwise rotation by a small angle):
when the real frame θ approaches-90 degrees and the prediction frame θ' approaches 0 degrees, the angle loss is as large as approaching 90 degrees. However, at this time, the prediction frame approaches the real frame, and only a small clockwise rotation angle is required to return to the real frame. The loss calculation in this case should be:
Figure BDA0003063985330000091
the total loss is then:
Figure BDA0003063985330000092
referring to fig. 1, the rotating target detection process based on two different methods is implemented by the following steps:
step 1, inputting an image to perform feature extraction to obtain a feature map
After data enhancement operations such as random flipping, stretching, color gamut transformation and the like are performed on the input image (only the training process includes the operations, but the detection process does not perform the operations), the input image is uniformly scaled to a standard size (for example, 608 × 608), and Focus slicing operation is performed, and then the input image is input to a Yolov5 feature extraction network. The Yolov5 feature extraction Network is composed of a plurality of CSP (Cross Stage Partial Network), CBL (convolution + batch normalization + leakyRelu) and SPP modules. The CSP module is a main structure for feature extraction: each CSP module divides the feature mapping of the basic layer into two parts, one part is convoluted by a plurality of residual modules, and then is combined with the other part through a cross-stage hierarchical structure, so that the problem of overhigh inference calculation caused by repeated gradient information in network optimization is avoided, the calculation amount is reduced, and the accuracy rate can be ensured. By changing the number of residual error components in the CSP module, the depth and scale of the whole model can be controlled, so that different detection accuracy and speed can be realized. CBL is a conventional feature extraction operation. The SPP respectively performs four times of maximum pooling on the same feature map with different scales, and the four pooled feature maps are superposed to retain target information with different scale levels. After feature extraction, different layer feature maps are input into the FPN and PAN modules. The FPN is a top-down structure, and the semantic information of the high-level feature map is transmitted and fused downwards in an upsampling mode to obtain a feature map for prediction. And PAN is a bottom-up feature pyramid. FPN conveys strong semantic features from top to bottom, while PAN conveys strong localization features from bottom to top to achieve feature fusion from different stem layers to different detection layers. And finally, outputting three different scale feature maps.
Step 2, classifying and regressing the characteristic diagram in the step 1, and performing characteristic reconstruction on a regression result
And (3) classifying and regressing and convoluting the characteristic diagram in the step (1), and decoding the obtained regression parameters into a five-parameter model rotating frame of the target (a rotating rectangular frame is defined by five parameters of x, y, w, h and theta). And then recoding the corresponding feature points by using the position information of the decoded bounding box, reconstructing the whole feature mapping, carrying out normalization operation on the new feature mapping, and multiplying the normalized feature mapping onto the original feature map as a mask point to obtain a refined feature map, wherein the specific implementation process comprises the following steps: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; then adding five feature vectors and replacing the current feature vectors, and reconstructing the whole feature map after traversing the feature points; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map. This reconstruction process can be repeated more than twice.
And 3, classifying and regressing again by utilizing the reconstructed characteristic diagram, outputting and calculating loss
And (4) carrying out classification and regression operation on the feature map reconstructed in the step (2), and decoding the regression parameters into a target rotating frame. The generated rotation box is output after performing NMS (non-maximum suppression) operation with the class score map as a confidence. If the model is in the training stage, the loss is calculated by the two classification and regression steps 2 and 3, so as to achieve the purpose of training the model. As shown in fig. 3, the loss function is:
Figure BDA0003063985330000101
wherein N represents the number of anchor frames, t'nThe value is 0 or 1 (foreground is 1, background is 0, if background, there is no regression); v'njRepresenting the predicted offset vector, vnjA target vector representing a real box. t is tnIs the target actual category, tnThe class probability distributions calculated for sigmoid. L isregIs a smooth L1 loss, LclsFocal loss was used. To prevent the loss increase due to the angle jump, the regression loss is adjusted to:
Figure BDA0003063985330000102
so that the model is trained in the fastest and best way.

Claims (6)

1. An arbitrary orientation target five-parameter detection method based on YOLOV5 is characterized in that: the method comprises the following specific steps:
(1) inputting the obtained remote sensing image into a Yolov5 feature extraction network for feature extraction to obtain three feature graphs with different scales;
(2) classifying and regressing the characteristic diagram obtained in the step (1), and performing characteristic reconstruction on a regression result to obtain a refined characteristic diagram;
(3) and (4) classifying and regressing by using the refined feature map obtained in the step (2), outputting and calculating loss.
2. The method for detecting the target five parameters in any orientation based on the YOLOV5 as claimed in claim 1, wherein: before the Yolov5 feature extraction network is used for feature extraction in the step (1), the remote sensing image is subjected to random turning, stretching, color gamut transformation and image random gray data enhancement operation, then the remote sensing image is uniformly scaled to a standard size for Focus slicing operation, and then the remote sensing image is input to the Yolov5 feature extraction network for feature extraction.
3. The method for detecting the target five parameters in any orientation based on the YOLOV5 as claimed in claim 1, wherein: the specific content of the step (2) is as follows: classifying and regressing convolution are respectively carried out on the obtained feature maps with different scales, the obtained regressing parameters are decoded into a target five-parameter model rotating frame, then corresponding feature points are recoded by utilizing the position information of the decoded bounding box, the whole feature mapping is reconstructed, the new feature mapping is subjected to normalization operation and is multiplied to the original feature map as a mask point to obtain a refined feature map, and the specific implementation process is as follows: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; after traversing the feature points, reconstructing a feature map; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map.
4. The method of claim 3, wherein the five-parameter detection method is based on the Yolov5 and is characterized in that: the specific content of the step (3) is as follows: and (3) classifying and regressing the feature map refined in the step (2) again, decoding the regression parameters into a target five-parameter model rotating frame, taking the class score map as a confidence coefficient of the generated rotating frame, performing non-maximum value inhibition operation, outputting the non-maximum value inhibition operation, and calculating loss.
5. The method of claim 4, wherein the five-parameter detection method based on the YOLOV5 is characterized in that: the five-parameter model is a rectangle with five parameters representing any direction: the parameters x, y, w, h, θ are defined as: target center point coordinates x, y, target width and height w, h and a rotation angle theta; the method comprises the following steps that the lowest vertex of a target frame is taken as a starting point, rays extending along the positive direction of an x axis of the target frame are taken as datum lines, the target frame moves along the counterclockwise direction, the edge of the first encountered target frame is defined as the width w of the target frame, the other edge of the first encountered target frame is defined as the length h, the included angle between the width w of the target frame and the datum lines is a target offset angle theta, and the range is [ -90,0 ].
6. The method of claim 4, wherein the five-parameter detection method based on the YOLOV5 is characterized in that: calculating loss in the step (3), specifically as follows: the loss function is:
Figure FDA0003063985320000021
Figure FDA0003063985320000022
wherein N represents the number of anchor frames, t'nThe value is 0 or 1, the foreground is 1, the background is 0, and if the background is the background, no regression exists; v'njRepresenting the predicted offset vector, vnjA target vector representing a real box; t is tnIs the target actual category, tnCalculating probability distribution of each category for sigmoid; l isregIs a smooth L1 loss, LclsWith focal loss, to prevent the loss increase due to the angle jump, the regression loss was adjusted to:
Figure FDA0003063985320000023
CN202110521035.XA 2021-05-13 2021-05-13 Method for detecting five parameters of target in any orientation based on YOLOV5 Pending CN113191296A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110521035.XA CN113191296A (en) 2021-05-13 2021-05-13 Method for detecting five parameters of target in any orientation based on YOLOV5

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110521035.XA CN113191296A (en) 2021-05-13 2021-05-13 Method for detecting five parameters of target in any orientation based on YOLOV5

Publications (1)

Publication Number Publication Date
CN113191296A true CN113191296A (en) 2021-07-30

Family

ID=76981387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110521035.XA Pending CN113191296A (en) 2021-05-13 2021-05-13 Method for detecting five parameters of target in any orientation based on YOLOV5

Country Status (1)

Country Link
CN (1) CN113191296A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966587A (en) * 2021-03-02 2021-06-15 北京百度网讯科技有限公司 Training method of target detection model, target detection method and related equipment
CN113449702A (en) * 2021-08-31 2021-09-28 天津联图科技有限公司 Target detection method and device for remote sensing image, storage medium and electronic equipment
CN113673616A (en) * 2021-08-26 2021-11-19 南通大学 Attention and context coupled lightweight small target detection method
CN113837058A (en) * 2021-09-17 2021-12-24 南通大学 Lightweight rainwater grate detection method coupled with context aggregation network
CN114926552A (en) * 2022-06-17 2022-08-19 中国人民解放军陆军炮兵防空兵学院 Method and system for calculating Gaussian coordinates of pixel points based on unmanned aerial vehicle image
CN116052110A (en) * 2023-03-28 2023-05-02 四川公路桥梁建设集团有限公司 Intelligent positioning method and system for pavement marking defects
CN117094343A (en) * 2023-10-19 2023-11-21 成都新西旺自动化科技有限公司 QR code decoding system and method
CN117994594A (en) * 2024-04-03 2024-05-07 武汉纺织大学 Power operation risk identification method based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN111091105A (en) * 2019-12-23 2020-05-01 郑州轻工业大学 Remote sensing image target detection method based on new frame regression loss function
CN112395975A (en) * 2020-11-17 2021-02-23 南京泓图人工智能技术研究院有限公司 Remote sensing image target detection method based on rotating area generation network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN111091105A (en) * 2019-12-23 2020-05-01 郑州轻工业大学 Remote sensing image target detection method based on new frame regression loss function
CN112395975A (en) * 2020-11-17 2021-02-23 南京泓图人工智能技术研究院有限公司 Remote sensing image target detection method based on rotating area generation network

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
WEN QIAN 等: "Learning Modulated Loss for Rotated Object Detection", 《ARXIV》 *
WEN QIAN 等: "Learning Modulated Loss for Rotated Object Detection", 《ARXIV》, 23 December 2019 (2019-12-23), pages 1 - 11 *
XUE YANG 等: "R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object", 《ARXIV》 *
XUE YANG 等: "R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object", 《ARXIV》, 2 December 2019 (2019-12-02), pages 1 - 13 *
刘思远等: "基于深度卷积神经网络的遥感图像目标检测方法", 《工业控制计算机》 *
刘思远等: "基于深度卷积神经网络的遥感图像目标检测方法", 《工业控制计算机》, no. 05, 25 May 2020 (2020-05-25), pages 75 - 77 *
崔丽群 等: "一种背景抑制改进的显著性目标检测方法", 计算机工程与科学, pages 1435 - 1443 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966587B (en) * 2021-03-02 2022-12-20 北京百度网讯科技有限公司 Training method of target detection model, target detection method and related equipment
CN112966587A (en) * 2021-03-02 2021-06-15 北京百度网讯科技有限公司 Training method of target detection model, target detection method and related equipment
CN113673616B (en) * 2021-08-26 2023-09-29 南通大学 Light-weight small target detection method coupling attention and context
CN113673616A (en) * 2021-08-26 2021-11-19 南通大学 Attention and context coupled lightweight small target detection method
CN113449702B (en) * 2021-08-31 2021-12-03 天津联图科技有限公司 Target detection method and device for remote sensing image, storage medium and electronic equipment
CN113449702A (en) * 2021-08-31 2021-09-28 天津联图科技有限公司 Target detection method and device for remote sensing image, storage medium and electronic equipment
CN113837058B (en) * 2021-09-17 2022-09-30 南通大学 Lightweight rainwater grate detection method coupled with context aggregation network
CN113837058A (en) * 2021-09-17 2021-12-24 南通大学 Lightweight rainwater grate detection method coupled with context aggregation network
CN114926552A (en) * 2022-06-17 2022-08-19 中国人民解放军陆军炮兵防空兵学院 Method and system for calculating Gaussian coordinates of pixel points based on unmanned aerial vehicle image
CN116052110A (en) * 2023-03-28 2023-05-02 四川公路桥梁建设集团有限公司 Intelligent positioning method and system for pavement marking defects
CN117094343A (en) * 2023-10-19 2023-11-21 成都新西旺自动化科技有限公司 QR code decoding system and method
CN117094343B (en) * 2023-10-19 2023-12-29 成都新西旺自动化科技有限公司 QR code decoding system and method
CN117994594A (en) * 2024-04-03 2024-05-07 武汉纺织大学 Power operation risk identification method based on deep learning

Similar Documents

Publication Publication Date Title
CN113191296A (en) Method for detecting five parameters of target in any orientation based on YOLOV5
CN111862126B (en) Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
WO2023116631A1 (en) Training method and training apparatus for rotating-ship target detection model, and storage medium
CN111985376A (en) Remote sensing image ship contour extraction method based on deep learning
CN113177503A (en) Arbitrary orientation target twelve parameter detection method based on YOLOV5
CN111079739B (en) Multi-scale attention feature detection method
CN111738055B (en) Multi-category text detection system and bill form detection method based on same
CN111783523A (en) Remote sensing image rotating target detection method
Chen et al. Geospatial transformer is what you need for aircraft detection in SAR Imagery
Zeng et al. Recognition and extraction of high-resolution satellite remote sensing image buildings based on deep learning
Lu et al. A cnn-transformer hybrid model based on cswin transformer for uav image object detection
Zhu et al. AOPDet: Automatic organized points detector for precisely localizing objects in aerial imagery
Fan et al. A novel sonar target detection and classification algorithm
Chen et al. Oriented object detection by searching corner points in remote sensing imagery
Chen et al. Shape similarity intersection-over-union loss hybrid model for detection of synthetic aperture radar small ship objects in complex scenes
Zhang et al. FANet: An arbitrary direction remote sensing object detection network based on feature fusion and angle classification
Chen et al. Coupled global–local object detection for large vhr aerial images
Cao et al. Detection method based on image enhancement and an improved faster R-CNN for failed satellite components
CN115830480A (en) Small sample aerial image rotating target detection method
Hrustic et al. Deep learning based traffic signs boundary estimation
CN115035429A (en) Aerial photography target detection method based on composite backbone network and multiple measuring heads
Li et al. SDCDet: Robust Remote Sensing Object Detection Based on Instance Segmentation Direction Correction
Zhou et al. LEDet: localization estimation detector with data augmentation for ship detection based on unmanned surface vehicle
Liu TS2Anet: Ship detection network based on transformer
Liu et al. Arbitrary-oriented ship detection based on rotation region locating networks in large scale remote sensing images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210730