CN113191296A - Method for detecting five parameters of target in any orientation based on YOLOV5 - Google Patents
Method for detecting five parameters of target in any orientation based on YOLOV5 Download PDFInfo
- Publication number
- CN113191296A CN113191296A CN202110521035.XA CN202110521035A CN113191296A CN 113191296 A CN113191296 A CN 113191296A CN 202110521035 A CN202110521035 A CN 202110521035A CN 113191296 A CN113191296 A CN 113191296A
- Authority
- CN
- China
- Prior art keywords
- feature
- target
- yolov5
- parameters
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000001514 detection method Methods 0.000 claims abstract description 42
- 238000000605 extraction Methods 0.000 claims abstract description 29
- 238000010586 diagram Methods 0.000 claims abstract description 24
- 239000013598 vector Substances 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 6
- 238000009826 distribution Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000005764 inhibitory process Effects 0.000 claims 2
- 238000012549 training Methods 0.000 abstract description 8
- 230000000694 effects Effects 0.000 abstract description 5
- 238000011897 real-time detection Methods 0.000 abstract description 3
- 240000007651 Rubus glaucus Species 0.000 abstract description 2
- 235000011034 Rubus glaucus Nutrition 0.000 abstract description 2
- 235000009122 Rubus idaeus Nutrition 0.000 abstract description 2
- 230000001133 acceleration Effects 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 12
- 238000012360 testing method Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 8
- WDLRUFUQRNWCPK-UHFFFAOYSA-N Tetraxetan Chemical compound OC(=O)CN1CCN(CC(O)=O)CCN(CC(O)=O)CCN(CC(O)=O)CC1 WDLRUFUQRNWCPK-UHFFFAOYSA-N 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007797 corrosion Effects 0.000 description 1
- 238000005260 corrosion Methods 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- OLBCVFGFOZPWHH-UHFFFAOYSA-N propofol Chemical compound CC(C)C1=CC=CC(C(C)C)=C1O OLBCVFGFOZPWHH-UHFFFAOYSA-N 0.000 description 1
- 229960004134 propofol Drugs 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/13—Satellite images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Biophysics (AREA)
- Remote Sensing (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Astronomy & Astrophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for detecting five parameters of an arbitrarily-oriented target based on YOLOV5, which comprises the steps of firstly extracting the characteristics of a remote sensing image by using a specific characteristic extraction network of YOLOV5, realizing the characteristic output of three scales, firstly directly regressing five parameters of a target rotating frame from an output characteristic diagram, reconstructing the characteristic diagram by using coordinates obtained after the five parameters are decoded, and regressing more accurate coordinates. Training uses a minimization smmolh L1 loss function to make the model converge faster and better. In the invention, different task requirements and hardware bottlenecks are considered, and lightweight acceleration models representing different speeds and precisions are designed; the detection precision of the model with the largest scale reaches SOTA, and the model with the smallest network depth can realize the effect of near real-time detection on higher precision, is convenient to carry on mobile terminals such as unmanned aerial vehicles and raspberry groups, and has very wide application prospect.
Description
Technical Field
The invention relates to the technical field of target detection, image processing technology, algorithm and neural network application, in particular to a five-parameter detection method for targets in any orientation based on YOLOV 5.
Background
With the improvement of hardware equipment and the continuous maturity of remote sensing technology, the quality and resolution of remote sensing images shot based on satellites, radars and unmanned aerial vehicles reach the level of natural images. However, objects in remote sensing images have distinct characteristics: the targets are all represented in a view angle of a top view; the target scale change is large; the arrangement direction of special objects such as vehicles, airplanes, ships and the like. The method for detecting the rotating target by adopting the universal horizontal frame detection method has three defects: the size and aspect ratio cannot reflect the true shape of the target object as in fig. 2 a; object and background pixels are not effectively separated as in fig. 2 b; the dense objects are difficult to separate from each other as shown in fig. 2 c. The rectangular frame in any direction is adopted to detect and position the target, the position information of the object can be better reflected as shown in figures 2d, 2e and 2f, and the method has important significance in geography, agriculture and military. The rotating frame detection method is originated from scene text detection in any direction based on deep learning, and a representative algorithm is as follows:
1. traditional algorithm represented by SWT, Selective Search and edgeBox
Before the birth of the deep learning method, traditional algorithms such as SWT, MSER, ER, Selective Search, EdgeBox and the like are mainly adopted for rotating target detection and scene inclined text detection, and the basic idea is as follows: firstly, binarizing the picture, such as self-adaptive binarization, if noise exists, Gaussian filtering can be adopted to simply filter, then a target region is obtained through morphological operations such as corrosion, expansion and the like, then a function for searching the outline is used to obtain points on the outline, and finally the minimum circumscribed rectangle is taken out. Extracting edges and gradients through a canny operator like an SWT algorithm, and then searching edges in the opposite direction through the gradient direction; the Edge Boxes algorithm determines the number of contours in the frame and the number of contours overlapping with the Edge of the frame by using Edge information (Edge), scores the frame based on the number of contours, and further determines the proposal information (consisting of size, length-width ratio and position) according to the sequence of the scores. The latter work is to run the correlation detection algorithm inside the propofol. The selective search algorithm firstly divides a picture into a plurality of small regions through a simple region division algorithm, and then continuously aggregates adjacent small regions through pixel similarity and region size (small regions are aggregated first, so that the situation that the small regions are continuously aggregated by large regions to cause incomplete hierarchical relationship) is prevented, and the method is similar to a clustering idea. After the target approximate region is obtained, drawing a maximum external rectangle (such as a rectangle with any angle in a scene text)
2. RRPN inclined text detection method
The RRPN algorithm was born in 2018 and is mainly used for oblique text detection. The method is based on a region extraction method of Faster Rcnn, and a rotating rectangle is represented by a five-parameter method of a central point, width and height and a rotating angle. An anchor frame with an angle is generated in advance in the detection process, and RRoI (Rotation Region-of-Interest) and learning of a rotating Interest area are combined. During training, a prediction frame which has IoU (intersection ratio) with a GT (real) frame of more than 0.7 and an angle with the GT frame of less than pi/12 is taken as a positive sample, IoU with the GT frame of less than 0.3, or a prediction frame which has IoU with the GT frame of more than 0.7 and an angle with the GT frame of more than pi/12 is taken as a negative sample, Smmoth L1 is adopted as regression loss, and cross entropy loss is adopted as category loss. In addition, the method provides a method (triangle segmentation method) for calculating the overlapping area of the oblique rectangles, and a good effect is achieved.
3、ROI Transformer
The core idea of the method is to introduce a Roi Transformer module to convert a horizontal anchor frame output in an RPN stage into a rotating anchor frame, so as to reduce a huge amount of calculation caused by introducing a large number of rotating anchor frames. The Roi Transformer module is divided into two parts, the first part being the RRoI Learner, which is mainly responsible for learning RRoIs (rotational regions of interest) from HRoIs (horizontal regions of interest): an offset (x, y, w, h,) is generated by inputting the feature map into the fully connected layer of five dimensions. And in the second part, Rroi Warping extracts rotation-invariant depth features through inputting feature maps and Rrois, further regresses refined offset, and decodes to obtain an output rotation frame. In the ideal case, each HroI is a circumscribed rectangle of RroI. By introducing the Roi transform, the method greatly reduces the calculation consumption and achieves good effect.
4、Gliding Vertex
The method is disclosed in CVPR2020. the method positions a quadrilateral by learning the offset of four points of an object on a non-rotated rectangle, thereby representing an object. The network structure used is also based on fast Rcnn, which is classified and regressed separately at the final full link layer. The final position regression uses a nine parameter regression method in which horizontal box coordinates (x, y, w, h) and four point offsets (α) are removed1,α2,α3,α4) In addition, a twiddle factor r (calculated as the ratio of the area of the rectangle to the area of the circumscribed horizontal rectangle) is introduced to determine whether the rectangle is horizontal or rotated. For the horizontal target, α is set to 1, and r is greater than 0.95, i.e., a horizontal rectangle is determined.
5、P-RSDet
The method is named Object Detection for Remote Sensing Image Based on Polar Coordinates and is published in CVPR2020. The method introduces polar coordinates for rotating target detection for the first time, and has the characteristics of fewer parameters and higher speed. Its rotating box representation method refers to Cornor Net, regression pole (x, y) and two corner points (ρ, ")1,⊙2). Feature extraction network provisioningDifferent network structures such as ResNet101, DLA34, Hourglass and the like are shown and represent different scales and speeds. In the detection head, the regression of the extreme points adopts a Gaussian heat map mode similar to that of the CenterNet, a probability map of the positions of the extreme points is output, and the category Loss adopts the Focal Loss. In the regression Loss, the coordinates of the center point are lost with Smmolh L1, while the author of the Loss of the coordinates of the extreme points introduces Polar Ring Area Loss, specifically developing as:
Lpr(ρ,θ)=Smooth L1(|[ρ2-(ρ*)2](θ-θ*)|,0)。
the first method, i.e. the conventional detection method, needs to perform artificial feature extraction operators for different targets, has poor robustness, can only extract shallow features, and has poor semantic expression capability. Like the SWT algorithm, the edges and gradients are extracted by the canny operator, and then the edges in the opposite direction are searched by the gradient direction. However, even in cases where edges are all accurately extracted, there is still a problem in computing the target width at the search edge. The Edge Boxes algorithm, however, is not a "learning" based algorithm and has no training process. If an individual human is trained, the highest scored propusal (region of interest) is certainly the individual human, if a car is trained, the highest scored propusal is certainly the individual car, and the like, and the generalization ability for different categories cannot be expressed. The second approach is a straightforward improvement over horizontal Faster Rcnn, requiring a large number of anchors (anchor boxes) to be designed to cover all the dimensions, aspect ratios, and angles that the target may exist, and is computationally expensive. The third method has poor characteristic extraction network effect, the subsequent FPN output of five-layer characteristic diagrams leads to increased calculated amount, each HRoI is connected with a five-dimensional full connection with the same channel number, and the parameter amount greatly influences the reasoning speed. The eight parameter regression method of method four, the accuracy relies on the horizontal detection box generated in the first stage. If the regression in the first stage is not accurate, the four deviation values predicted in the second stage are also not accurate absolutely. The fifth method is different from the first four methods, and a new thought is directly developed for detecting the rotating target. However, since the method is anchor-free, the accuracy is necessarily reduced while the speed is increased (the method does not generate the anchor in prediction and directly performs regression, so that a large amount of time can be saved).
Therefore, the anchor-base rotary target detection model is designed, has high speed and high precision and can reach SOTA, and has important significance for detecting the rotary target in the remote sensing image.
Disclosure of Invention
The invention aims to make up for the defects of the prior art and provides an arbitrary orientation target five-parameter detection method based on YOLOV 5. Firstly, extracting the remote sensing image features by using a specific feature extraction network of YOLOV5, then realizing feature output of three scales by using an FPN + PAN structure, and directly classifying and regressing on an output feature map to obtain the position and category information of a target in the image. In the second stage of detection, the target position information obtained in the first stage is used for reconstructing characteristics, and a finer characteristic diagram is obtained so as to regress more accurate coordinates. The minimization smmolh L1 loss function is adopted in the training, so that the angle loss of the model is faster and better converged. In addition, 4 models from large to small to light are designed according to different convolution layer numbers, different calculated quantities, different precision and different detection speeds are represented respectively, and different network depths can be selected according to different tasks. Compared with the prior art, the method achieves SOTA in both detection precision and speed.
The invention is realized by the following technical scheme:
a five-parameter detection method of any orientation target based on YOLOV5 comprises the following specific steps:
(1) inputting the obtained remote sensing image into a Yolov5 feature extraction network for feature extraction to obtain three feature graphs with different scales;
(2) classifying and regressing the characteristic diagram obtained in the step (1), and performing characteristic reconstruction operation on a regression result to obtain a more detailed characteristic diagram;
(3) and (4) classifying and regressing again by using the fine characteristic diagram obtained in the step (2), and outputting and calculating loss.
Before the Yolov5 feature extraction network is used for feature extraction in the step (1), data enhancement operations such as random turning, stretching, color gamut transformation, random image graying (for infrared image detection), and the like, and data enhancement operations are performed on the remote sensing image, then the remote sensing image is uniformly scaled to a standard size for Focus slicing operation, and then the remote sensing image is input to the Yolov5 feature extraction network for feature extraction.
The Yolov5 feature extraction Network is composed of a plurality of CSP (Cross Stage Partial Network), CBL (convolution + batch normalization + leakyRelu) and SPP modules. The CSP module is a main structure for feature extraction: each CSP module divides the feature mapping of the basic layer into two parts, one part is convoluted by a plurality of residual modules, and then is combined with the other part through a cross-stage hierarchical structure, so that the problem of overhigh inference calculation caused by repeated gradient information in network optimization is avoided, the calculation amount is reduced, and the accuracy rate can be ensured. CBL is a conventional feature extraction operation. The SPP respectively performs four times of maximum pooling on the same feature map with different scales, and the four pooled feature maps are superposed to retain target information with different scale levels. After feature extraction, different layer feature maps are input into the FPN and PAN modules. The FPN is a top-down structure, and the semantic information of the high-level feature map is transmitted and fused downwards in an upsampling mode to obtain a feature map for prediction. And PAN is a bottom-up feature pyramid. FPN conveys strong semantic features from top to bottom, while PAN conveys strong localization features from bottom to top to achieve feature fusion from different stem layers to different detection layers. And finally, outputting three different scale feature maps.
The specific content of the step (2) is as follows: classifying and regressing convolution are respectively carried out on the obtained feature maps with different scales, the obtained regressing parameters are decoded into a five-parameter model rotating frame of a target, then corresponding feature points are recoded by utilizing the position information of the decoded bounding box, the whole feature mapping is reconstructed, the new feature mapping is subjected to normalization operation and is multiplied to the original feature map as a mask point to obtain a refined feature map, and the specific implementation process is as follows: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; after traversing the feature points, reconstructing a feature map; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map.
The specific content of the step (3) is as follows: and (3) classifying and regressing the feature map reconstructed in the step (2) again, decoding the regression parameters into a five-parameter model rotating frame of the target, performing non-maximum suppression operation on the generated rotating frame by taking the class score map as a confidence coefficient, outputting the rotating frame, and calculating loss.
The five-parameter model is a rectangle with five parameters representing any direction: the parameters x, y, w, h, θ are defined as: target center point coordinates x, y, target width and height w, h and a rotation angle theta; the method comprises the following steps that the lowest vertex of a target frame is taken as a starting point, rays extending along the positive direction of an x axis of the target frame are taken as datum lines, the target frame moves along the counterclockwise direction, the edge of the first encountered target frame is defined as the width w of the target frame, the other edge of the first encountered target frame is defined as the length h, the included angle between the width w of the target frame and the datum lines is a target offset angle theta, and the range is [ -90,0 ].
Calculating loss in the step (3), specifically as follows: the loss function is:wherein N represents the number of anchor frames, t'nThe value is 0 or 1, the foreground is 1, the background is 0, and if the background is the background, no regression exists; v'njRepresenting the predicted offset vector, vnjA target vector representing a real box; t is tnIs the target actual category, tnCalculating probability distribution of each category for sigmoid; l isregIs a smooth L1 loss, LclsUsing focal loss, to prevent the loss increase due to the angle jump, regression is performedThe loss is adjusted as follows:
the invention has the advantages that: the method uses the specific CSPNet module of YOLOV5 to increase the speed and precision of feature extraction, and the structure of combining FPN and PAN further increases the fusion capability of features with different scales; adding a feature reconstruction module into a five-parameter angle regression model to realize feature alignment, and introducing a minimization Smmolh L1 loss function to reduce loss mutation caused by inaccurate angle regression; considering different task requirements and hardware bottlenecks, designing lightweight acceleration models representing different speeds and accuracies; the detection precision of the model with the largest scale reaches SOTA, and the model with the smallest network depth can realize the effect of near real-time detection on higher precision, is convenient to carry on mobile terminals such as unmanned aerial vehicles and raspberry groups, and has very wide application prospect.
Drawings
Fig. 1 is a schematic flow chart of an arbitrary orientation target five-parameter detection method based on YOLOV 5.
FIG. 2 is a schematic diagram showing the comparison between horizontal frame and rotating frame detection in remote sensing image target detection (FIG. 2a represents a diagram in which the size and the aspect ratio cannot reflect the real shape of a target object; FIG. 2b represents a diagram in which an object and background pixels are not effectively separated; FIG. 2c represents a diagram in which dense objects are difficult to separate; and FIGS. 2d, 2e and 2f represent diagrams in which rectangular frames in any directions are used for detecting and positioning targets).
FIG. 3 is a graphical representation of the fluctuation of losses when using the minimum Smooth L1 loss versus the normal Smooth L1 loss. It can be seen that minimizing the variation in the value of Smooth L1 loss is minimal and also more likely to fall to an optimum point.
FIG. 4 is a schematic diagram of a feature reconstruction process using a five parameter regression based method.
Fig. 5 shows feature reconstruction using a bilinear interpolation method. (FIG. 5a is the original image, FIG. 5b is the bilinear interpolation calculation method, FIG. 5c is the deviation caused by feature misalignment, FIG. 5d is the more accurate diagram of the bounding box obtained after bilinear interpolation)
FIG. 6 is a graph comparing the results of testing four different scale models on a DOTA dataset against a UCAS-AOD dataset.
FIG. 7 is a comparison of the test results of the present invention on DOTA datasets versus UCAS-AOD datasets with other detection methods. (FIG. 7a represents a comparison of the results of the tests on the DOTA dataset, with the abbreviations for the names given below: Pl: Plane, Bd: Baseball Diamond, Br: Bridge, Gft: group field track, Sv: Small vessel, Lv: Large vessel, Sh: Ship, Tc: Tennis vessel, Bc: Basketballl vessel, St: Storage tank, Sbf: Soccer-ballfield, Ra: Rodabout, Ha: Harbor, Sp: Swimming port, He: icoropter; FIG. 7b represents a comparison of the results of the tests on the UCAS-AOD dataset).
Fig. 8 is a schematic diagram of a rectangular frame with an arbitrary orientation represented by five parameters (fig. 8a is a schematic diagram of one orientation of the rectangular frame, and fig. 8b is a schematic diagram of another orientation of the rectangular frame).
FIG. 9 is a schematic diagram of a model misconvergence that may occur when calculating angle loss without minimizing the Smooth L1 loss.
Detailed Description
The invention mainly adopts a mainstream data set for verification, the CPU of a computer for testing is Intel core i 910900 k ubuntu 18.04+ (3.7GHz), the memory is 16G, the GPU model is Inviida 2080ti, and the display memory is 12G. All steps and conclusions are verified to be correct on the programming software Python3.8 and the deep learning framework Pytroch 1.7.0. FIG. 6 is a graph comparing the results of testing four different scale models on a DOTA dataset against a UCAS-AOD dataset. It can be seen that the heaviest model has the highest precision and faster detection speed, and the lightest model has near real-time detection speed on the basis of higher precision. FIG. 7 is a comparison of the test results of the present invention on DOTA datasets versus UCAS-AOD datasets with other detection methods. (FIG. 7a represents a graph comparing test results on DOTA data set; FIG. 7b represents a graph comparing test results on UCAS-AOD data set). Under the same training condition, the invention can be seen to have higher precision and speed. The method of the present invention is further illustrated with reference to the accompanying drawings and specific examples.
Fig. 1 shows a schematic flow diagram of a rotary target detection model based on YOLOV5, and the specific embodiment is as follows:
for convenience of description, the following terms are first defined:
defining 1 five-parameter model
As shown in fig. 4, the five-parameter model represents an arbitrary direction rectangle by five parameters: the parameters (x, y, w, h, θ) are defined by: target center point coordinates (x, y), target width and height (w, h), and rotation angle (theta). And taking the lowest vertex of the target frame as a starting point, taking the extended ray of the target frame along the positive direction of the x axis as a reference line, moving along the counterclockwise direction, defining the edge of the first encountered target frame as the width w of the target frame, and defining the other edge as the length h. The included angle between the width w of the target frame and the datum line is a target offset angle theta and ranges from [ -90,0 ]. As shown in fig. 8a, 8 b. The five parameters (x, y, w, h, θ) are defined by: target center point coordinates (x, y), target width and height (w, h), and rotation angle (theta). And taking the lowest vertex of the target frame as a starting point, taking the extended ray of the target frame along the positive direction of the x axis as a reference line, moving along the counterclockwise direction, defining the edge of the first encountered target frame as the width w of the target frame, and defining the other edge as the length h. The included angle between the width w of the target frame and the datum line is a target offset angle theta and ranges from [ -90,0 ].
Definition 2 feature reconstruction
As shown in fig. 5, in the first stage of detection, the decoded position information of the bounding box is used to re-encode the corresponding feature points, reconstruct the whole feature map, perform normalization operation on the new feature map, and multiply the new feature map as a mask point onto the original feature map to obtain a refined feature map, which is specifically implemented by the following steps: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; then adding five feature vectors and replacing the current feature vectors, and reconstructing the whole feature map after traversing the feature points; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the reconstruction operation of the feature map. The pseudo code is as follows:
definitions 3 minimization of Smooth L1 loss
As shown in fig. 9, when the rotation angle θ of the real frame (solid frame) approaches-90 degrees and the rotation angle θ' of the prediction frame (dashed frame) approaches 0 degrees, the angle loss is as large as approximately 90 degrees. At this point, however, the prediction box has already approached the true box very much. Optimizing according to Smooth L1 losses, the model is difficult to train along the fastest route. When Smooth L1 loss optimization is adopted, the model can be regressed into a real frame only by regressing a small deviation (clockwise rotation by a small angle):
when the real frame θ approaches-90 degrees and the prediction frame θ' approaches 0 degrees, the angle loss is as large as approaching 90 degrees. However, at this time, the prediction frame approaches the real frame, and only a small clockwise rotation angle is required to return to the real frame. The loss calculation in this case should be:the total loss is then:
referring to fig. 1, the rotating target detection process based on two different methods is implemented by the following steps:
After data enhancement operations such as random flipping, stretching, color gamut transformation and the like are performed on the input image (only the training process includes the operations, but the detection process does not perform the operations), the input image is uniformly scaled to a standard size (for example, 608 × 608), and Focus slicing operation is performed, and then the input image is input to a Yolov5 feature extraction network. The Yolov5 feature extraction Network is composed of a plurality of CSP (Cross Stage Partial Network), CBL (convolution + batch normalization + leakyRelu) and SPP modules. The CSP module is a main structure for feature extraction: each CSP module divides the feature mapping of the basic layer into two parts, one part is convoluted by a plurality of residual modules, and then is combined with the other part through a cross-stage hierarchical structure, so that the problem of overhigh inference calculation caused by repeated gradient information in network optimization is avoided, the calculation amount is reduced, and the accuracy rate can be ensured. By changing the number of residual error components in the CSP module, the depth and scale of the whole model can be controlled, so that different detection accuracy and speed can be realized. CBL is a conventional feature extraction operation. The SPP respectively performs four times of maximum pooling on the same feature map with different scales, and the four pooled feature maps are superposed to retain target information with different scale levels. After feature extraction, different layer feature maps are input into the FPN and PAN modules. The FPN is a top-down structure, and the semantic information of the high-level feature map is transmitted and fused downwards in an upsampling mode to obtain a feature map for prediction. And PAN is a bottom-up feature pyramid. FPN conveys strong semantic features from top to bottom, while PAN conveys strong localization features from bottom to top to achieve feature fusion from different stem layers to different detection layers. And finally, outputting three different scale feature maps.
Step 2, classifying and regressing the characteristic diagram in the step 1, and performing characteristic reconstruction on a regression result
And (3) classifying and regressing and convoluting the characteristic diagram in the step (1), and decoding the obtained regression parameters into a five-parameter model rotating frame of the target (a rotating rectangular frame is defined by five parameters of x, y, w, h and theta). And then recoding the corresponding feature points by using the position information of the decoded bounding box, reconstructing the whole feature mapping, carrying out normalization operation on the new feature mapping, and multiplying the normalized feature mapping onto the original feature map as a mask point to obtain a refined feature map, wherein the specific implementation process comprises the following steps: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; then adding five feature vectors and replacing the current feature vectors, and reconstructing the whole feature map after traversing the feature points; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map. This reconstruction process can be repeated more than twice.
And 3, classifying and regressing again by utilizing the reconstructed characteristic diagram, outputting and calculating loss
And (4) carrying out classification and regression operation on the feature map reconstructed in the step (2), and decoding the regression parameters into a target rotating frame. The generated rotation box is output after performing NMS (non-maximum suppression) operation with the class score map as a confidence. If the model is in the training stage, the loss is calculated by the two classification and regression steps 2 and 3, so as to achieve the purpose of training the model. As shown in fig. 3, the loss function is:wherein N represents the number of anchor frames, t'nThe value is 0 or 1 (foreground is 1, background is 0, if background, there is no regression); v'njRepresenting the predicted offset vector, vnjA target vector representing a real box. t is tnIs the target actual category, tnThe class probability distributions calculated for sigmoid. L isregIs a smooth L1 loss, LclsFocal loss was used. To prevent the loss increase due to the angle jump, the regression loss is adjusted to:so that the model is trained in the fastest and best way.
Claims (6)
1. An arbitrary orientation target five-parameter detection method based on YOLOV5 is characterized in that: the method comprises the following specific steps:
(1) inputting the obtained remote sensing image into a Yolov5 feature extraction network for feature extraction to obtain three feature graphs with different scales;
(2) classifying and regressing the characteristic diagram obtained in the step (1), and performing characteristic reconstruction on a regression result to obtain a refined characteristic diagram;
(3) and (4) classifying and regressing by using the refined feature map obtained in the step (2), outputting and calculating loss.
2. The method for detecting the target five parameters in any orientation based on the YOLOV5 as claimed in claim 1, wherein: before the Yolov5 feature extraction network is used for feature extraction in the step (1), the remote sensing image is subjected to random turning, stretching, color gamut transformation and image random gray data enhancement operation, then the remote sensing image is uniformly scaled to a standard size for Focus slicing operation, and then the remote sensing image is input to the Yolov5 feature extraction network for feature extraction.
3. The method for detecting the target five parameters in any orientation based on the YOLOV5 as claimed in claim 1, wherein: the specific content of the step (2) is as follows: classifying and regressing convolution are respectively carried out on the obtained feature maps with different scales, the obtained regressing parameters are decoded into a target five-parameter model rotating frame, then corresponding feature points are recoded by utilizing the position information of the decoded bounding box, the whole feature mapping is reconstructed, the new feature mapping is subjected to normalization operation and is multiplied to the original feature map as a mask point to obtain a refined feature map, and the specific implementation process is as follows: and (3) using a class score map generated according to three feature maps with different scales as a mask, wherein each feature point only retains a prediction box with the highest score: then obtaining corresponding feature vectors on the feature map according to five coordinates, namely a central point and four vertexes, of the prediction frame, and obtaining accurate feature vectors by utilizing bilinear interpolation of position information of the coordinates; after traversing the feature points, reconstructing a feature map; and finally, normalizing the reconstructed feature map, and multiplying the normalized feature map on the original feature map as a mask point to complete the refinement operation of the feature map.
4. The method of claim 3, wherein the five-parameter detection method is based on the Yolov5 and is characterized in that: the specific content of the step (3) is as follows: and (3) classifying and regressing the feature map refined in the step (2) again, decoding the regression parameters into a target five-parameter model rotating frame, taking the class score map as a confidence coefficient of the generated rotating frame, performing non-maximum value inhibition operation, outputting the non-maximum value inhibition operation, and calculating loss.
5. The method of claim 4, wherein the five-parameter detection method based on the YOLOV5 is characterized in that: the five-parameter model is a rectangle with five parameters representing any direction: the parameters x, y, w, h, θ are defined as: target center point coordinates x, y, target width and height w, h and a rotation angle theta; the method comprises the following steps that the lowest vertex of a target frame is taken as a starting point, rays extending along the positive direction of an x axis of the target frame are taken as datum lines, the target frame moves along the counterclockwise direction, the edge of the first encountered target frame is defined as the width w of the target frame, the other edge of the first encountered target frame is defined as the length h, the included angle between the width w of the target frame and the datum lines is a target offset angle theta, and the range is [ -90,0 ].
6. The method of claim 4, wherein the five-parameter detection method based on the YOLOV5 is characterized in that: calculating loss in the step (3), specifically as follows: the loss function is: wherein N represents the number of anchor frames, t'nThe value is 0 or 1, the foreground is 1, the background is 0, and if the background is the background, no regression exists; v'njRepresenting the predicted offset vector, vnjA target vector representing a real box; t is tnIs the target actual category, tnCalculating probability distribution of each category for sigmoid; l isregIs a smooth L1 loss, LclsWith focal loss, to prevent the loss increase due to the angle jump, the regression loss was adjusted to:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110521035.XA CN113191296A (en) | 2021-05-13 | 2021-05-13 | Method for detecting five parameters of target in any orientation based on YOLOV5 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110521035.XA CN113191296A (en) | 2021-05-13 | 2021-05-13 | Method for detecting five parameters of target in any orientation based on YOLOV5 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113191296A true CN113191296A (en) | 2021-07-30 |
Family
ID=76981387
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110521035.XA Pending CN113191296A (en) | 2021-05-13 | 2021-05-13 | Method for detecting five parameters of target in any orientation based on YOLOV5 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113191296A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112966587A (en) * | 2021-03-02 | 2021-06-15 | 北京百度网讯科技有限公司 | Training method of target detection model, target detection method and related equipment |
CN113449702A (en) * | 2021-08-31 | 2021-09-28 | 天津联图科技有限公司 | Target detection method and device for remote sensing image, storage medium and electronic equipment |
CN113673616A (en) * | 2021-08-26 | 2021-11-19 | 南通大学 | Attention and context coupled lightweight small target detection method |
CN113837058A (en) * | 2021-09-17 | 2021-12-24 | 南通大学 | Lightweight rainwater grate detection method coupled with context aggregation network |
CN114926552A (en) * | 2022-06-17 | 2022-08-19 | 中国人民解放军陆军炮兵防空兵学院 | Method and system for calculating Gaussian coordinates of pixel points based on unmanned aerial vehicle image |
CN116052110A (en) * | 2023-03-28 | 2023-05-02 | 四川公路桥梁建设集团有限公司 | Intelligent positioning method and system for pavement marking defects |
CN117094343A (en) * | 2023-10-19 | 2023-11-21 | 成都新西旺自动化科技有限公司 | QR code decoding system and method |
CN117994594A (en) * | 2024-04-03 | 2024-05-07 | 武汉纺织大学 | Power operation risk identification method based on deep learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019192397A1 (en) * | 2018-04-04 | 2019-10-10 | 华中科技大学 | End-to-end recognition method for scene text in any shape |
CN111091105A (en) * | 2019-12-23 | 2020-05-01 | 郑州轻工业大学 | Remote sensing image target detection method based on new frame regression loss function |
CN112395975A (en) * | 2020-11-17 | 2021-02-23 | 南京泓图人工智能技术研究院有限公司 | Remote sensing image target detection method based on rotating area generation network |
-
2021
- 2021-05-13 CN CN202110521035.XA patent/CN113191296A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019192397A1 (en) * | 2018-04-04 | 2019-10-10 | 华中科技大学 | End-to-end recognition method for scene text in any shape |
CN111091105A (en) * | 2019-12-23 | 2020-05-01 | 郑州轻工业大学 | Remote sensing image target detection method based on new frame regression loss function |
CN112395975A (en) * | 2020-11-17 | 2021-02-23 | 南京泓图人工智能技术研究院有限公司 | Remote sensing image target detection method based on rotating area generation network |
Non-Patent Citations (7)
Title |
---|
WEN QIAN 等: "Learning Modulated Loss for Rotated Object Detection", 《ARXIV》 * |
WEN QIAN 等: "Learning Modulated Loss for Rotated Object Detection", 《ARXIV》, 23 December 2019 (2019-12-23), pages 1 - 11 * |
XUE YANG 等: "R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object", 《ARXIV》 * |
XUE YANG 等: "R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object", 《ARXIV》, 2 December 2019 (2019-12-02), pages 1 - 13 * |
刘思远等: "基于深度卷积神经网络的遥感图像目标检测方法", 《工业控制计算机》 * |
刘思远等: "基于深度卷积神经网络的遥感图像目标检测方法", 《工业控制计算机》, no. 05, 25 May 2020 (2020-05-25), pages 75 - 77 * |
崔丽群 等: "一种背景抑制改进的显著性目标检测方法", 计算机工程与科学, pages 1435 - 1443 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112966587B (en) * | 2021-03-02 | 2022-12-20 | 北京百度网讯科技有限公司 | Training method of target detection model, target detection method and related equipment |
CN112966587A (en) * | 2021-03-02 | 2021-06-15 | 北京百度网讯科技有限公司 | Training method of target detection model, target detection method and related equipment |
CN113673616B (en) * | 2021-08-26 | 2023-09-29 | 南通大学 | Light-weight small target detection method coupling attention and context |
CN113673616A (en) * | 2021-08-26 | 2021-11-19 | 南通大学 | Attention and context coupled lightweight small target detection method |
CN113449702B (en) * | 2021-08-31 | 2021-12-03 | 天津联图科技有限公司 | Target detection method and device for remote sensing image, storage medium and electronic equipment |
CN113449702A (en) * | 2021-08-31 | 2021-09-28 | 天津联图科技有限公司 | Target detection method and device for remote sensing image, storage medium and electronic equipment |
CN113837058B (en) * | 2021-09-17 | 2022-09-30 | 南通大学 | Lightweight rainwater grate detection method coupled with context aggregation network |
CN113837058A (en) * | 2021-09-17 | 2021-12-24 | 南通大学 | Lightweight rainwater grate detection method coupled with context aggregation network |
CN114926552A (en) * | 2022-06-17 | 2022-08-19 | 中国人民解放军陆军炮兵防空兵学院 | Method and system for calculating Gaussian coordinates of pixel points based on unmanned aerial vehicle image |
CN116052110A (en) * | 2023-03-28 | 2023-05-02 | 四川公路桥梁建设集团有限公司 | Intelligent positioning method and system for pavement marking defects |
CN117094343A (en) * | 2023-10-19 | 2023-11-21 | 成都新西旺自动化科技有限公司 | QR code decoding system and method |
CN117094343B (en) * | 2023-10-19 | 2023-12-29 | 成都新西旺自动化科技有限公司 | QR code decoding system and method |
CN117994594A (en) * | 2024-04-03 | 2024-05-07 | 武汉纺织大学 | Power operation risk identification method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113191296A (en) | Method for detecting five parameters of target in any orientation based on YOLOV5 | |
CN111862126B (en) | Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm | |
WO2023116631A1 (en) | Training method and training apparatus for rotating-ship target detection model, and storage medium | |
CN111985376A (en) | Remote sensing image ship contour extraction method based on deep learning | |
CN113177503A (en) | Arbitrary orientation target twelve parameter detection method based on YOLOV5 | |
CN111079739B (en) | Multi-scale attention feature detection method | |
CN111738055B (en) | Multi-category text detection system and bill form detection method based on same | |
CN111783523A (en) | Remote sensing image rotating target detection method | |
Chen et al. | Geospatial transformer is what you need for aircraft detection in SAR Imagery | |
Zeng et al. | Recognition and extraction of high-resolution satellite remote sensing image buildings based on deep learning | |
Lu et al. | A cnn-transformer hybrid model based on cswin transformer for uav image object detection | |
Zhu et al. | AOPDet: Automatic organized points detector for precisely localizing objects in aerial imagery | |
Fan et al. | A novel sonar target detection and classification algorithm | |
Chen et al. | Oriented object detection by searching corner points in remote sensing imagery | |
Chen et al. | Shape similarity intersection-over-union loss hybrid model for detection of synthetic aperture radar small ship objects in complex scenes | |
Zhang et al. | FANet: An arbitrary direction remote sensing object detection network based on feature fusion and angle classification | |
Chen et al. | Coupled global–local object detection for large vhr aerial images | |
Cao et al. | Detection method based on image enhancement and an improved faster R-CNN for failed satellite components | |
CN115830480A (en) | Small sample aerial image rotating target detection method | |
Hrustic et al. | Deep learning based traffic signs boundary estimation | |
CN115035429A (en) | Aerial photography target detection method based on composite backbone network and multiple measuring heads | |
Li et al. | SDCDet: Robust Remote Sensing Object Detection Based on Instance Segmentation Direction Correction | |
Zhou et al. | LEDet: localization estimation detector with data augmentation for ship detection based on unmanned surface vehicle | |
Liu | TS2Anet: Ship detection network based on transformer | |
Liu et al. | Arbitrary-oriented ship detection based on rotation region locating networks in large scale remote sensing images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210730 |