CN112418212B - YOLOv3 algorithm based on EIoU improvement - Google Patents

YOLOv3 algorithm based on EIoU improvement Download PDF

Info

Publication number
CN112418212B
CN112418212B CN202010892321.2A CN202010892321A CN112418212B CN 112418212 B CN112418212 B CN 112418212B CN 202010892321 A CN202010892321 A CN 202010892321A CN 112418212 B CN112418212 B CN 112418212B
Authority
CN
China
Prior art keywords
loss
iou
convolution
frame
eiou
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010892321.2A
Other languages
Chinese (zh)
Other versions
CN112418212A (en
Inventor
王兰美
褚安亮
梁涛
廖桂生
王桂宝
孙长征
陈正涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Shaanxi University of Technology
Original Assignee
Xidian University
Shaanxi University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University, Shaanxi University of Technology filed Critical Xidian University
Priority to CN202010892321.2A priority Critical patent/CN112418212B/en
Publication of CN112418212A publication Critical patent/CN112418212A/en
Application granted granted Critical
Publication of CN112418212B publication Critical patent/CN112418212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an EIoU-based improved YOLOv3 algorithm, which mainly solves the problem of IoU-based loss L caused by overlapping rate, scale and length-width ratio in the existing algorithm IoU And (3) calculating inaccuracy so as to influence detection performance. Firstly, downloading a universal data set in the current target detection field; secondly, reconstructing an existing algorithm YOLOv3 network model, training by using a prepared data set, and detecting the performance of the model; then the EIoU-based loss function L EIoU Embedding the model into a YOLOv3 algorithm model for training and evaluating the performance; and finally, comparing the classical YOLOv3 algorithm, and analyzing the test result. Compared with the classical YOLOv3 algorithm, the YOLOv3 algorithm based on the EIoU improvement improves average accuracy, is more suitable for the situation when a plurality of objects are overlapped in the same area, does not introduce more calculated amount, and does not influence instantaneity compared with an original model. The module can be still embedded into other classical algorithm models for comparison test, and has better applicability and robustness.

Description

YOLOv3 algorithm based on EIoU improvement
Technical Field
The invention belongs to the field of image recognition, and particularly relates to a YOLOv3 target detection algorithm based on an improved loss function EIoU, which shows good detection performance on a general standard data set.
Background
The target detection mainly comprises a traditional target detection technology and a target detection technology based on deep learning, and in recent years, along with development and intelligent popularization of technologies, the traditional target detection technology can not meet the demands of people far, and the target detection technology based on the deep learning is generated and developed rapidly and becomes a mainstream algorithm in the current target detection field.
Target detection techniques based on deep learning can be broadly divided into two types of methods, one and two: the two-stage method mainly refers to algorithms based on candidate areas, such as R-CNN, fast-R-CNN and Fast-R-CNN, and the detection steps of the algorithms are as follows: firstly, generating a plurality of candidate areas on a picture, and then, classifying and regressing candidate frames on the candidate areas through a convolutional neural network. The method has high precision, but the detection speed is low, and the real-time requirement cannot be met; the one-stage method uses a convolution neural network to directly predict the types and positions of different targets, belongs to an end-to-end method, and mainly comprises SSD and YOLO series.
The most common indicator in target detection is the cross-over ratio (Intersection over Union, ioU), which can reflect the detection effect of the predicted detection frame and the real detection frame. However, when IoU =0, the distance between the two, that is, the overlap ratio, cannot be reflected as a loss function. Meanwhile, as loss loss=0, no gradient feedback exists, learning and training cannot be performed, and IoU cannot accurately reflect the degree of coincidence of the two, so that a generalized coincidence ratio (Generalized Intersection over Union, GIoU) is proposed. The GIoU focuses not only on the overlapping area but also on other non-overlapping areas, which can better reflect the overlapping ratio of the two, but the training process is still easy to diverge, and the distance overlapping ratio (Distance Intersection over Union, DIoU) is generated. The DIoU considers the distance between the target and the anchor frame, the overlapping rate and the scale, so that the regression of the target frame becomes more stable, but the aspect ratio of the anchor frame is not considered, the complete intersection ratio (Complete Intersection over Union, CIoU) is proposed on the basis of the DIoU, the above IoU research process is combined, the overlapping rate, the scale and the aspect ratio are comprehensively considered, the edge base intersection ratio (Edge Intersection over Union, EIoU) is proposed, the edge base intersection ratio is embedded into the existing classical algorithm YOLOv3, the performance is quite excellent, the module is more suitable for the situation when a plurality of objects overlap in the same area, in addition, the module does not introduce more calculation amount, the real-time performance is not affected compared with the original model, and the module can be embedded into other classical algorithm models and has wide applicability.
Disclosure of Invention
The method provided by the invention provides an improved YOLOv3 algorithm based on EIoU, and the detection performance of the YOLOv3 algorithm is partially improved by embedding an improved IoU loss function EIoU.
Step one: and downloading a COCO data set of the current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect and detect the performance of the method. Download address:http:// cocodataset.org/#home
COCO, collectively Microsoft Common Objects in Context, is a data set available for image recognition by Microsoft team. The COCO dataset provided 80 object categories. The labeling type of the picture in the data set used in the invention is an object detection target detection type which is expressed as category information p labeled with the target of interest in the picture i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
Step two: reconstructing the YOLOv3 network system, training the YOLOv3 network based on the data set selected in the step one, outputting a weight file Q, detecting the performance of the data, and making comparison data.
The main network structure of the YOLOv3 algorithm consists of 52 convolutional layers, which are divided into three phases, i.e., three different scale outputs. The 1-26 layer convolution is stage 1, the 27-43 layer convolution is stage 2, and the 44-52 layer convolution is stage 3. The specific network structure and training procedure is as follows, where "x" represents the product:
firstly, randomly initializing weights by a network to enable the initialized values to be subjected to Gaussian normal distribution, then inputting a picture with 416 multiplied by 3 pixels, and obtaining 416 multiplied by 32 feature map output by a 1 st layer convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step size is 1 and the number is 32; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 2 in step size and 64 in number, so that the 208×208×64 feature map output is obtained, and the like. According to different convolution kernels of each layer in the network diagram, respectively entering three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then entering feature interaction layers 1,2 and 3 to continuously carry out convolution operations as follows:
the feature interaction layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and the convolution operations of 3×3×256 and 1×1×255 are carried out, so that a feature map of 52×52×255 is obtained.
The feature interaction layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×256, 3×3×512, 1×1×256, 3×3×512 and 1×1×256, the step sizes are all 1, a feature map of 26×26×256 is obtained, and the convolution operations of 3×3×512 and 1×1×255 are carried out, so that a feature map of 26×26×255 is obtained.
The feature interaction layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are 1×1×512, 3×3×1024, 1×1×512, 3×3×1024 and 1×1×512 in sequence, the step sizes are 1, a feature map of 13×13×512 is obtained, and the convolution operations of 3×3×1024 and 1×1×255 are performed, so that a feature map of 13×13×255 is obtained.
Taking 52×52×255 features as an example, fig. 1 illustrates: the first dimension 52 represents the number of horizontal pixels in the picture, the second dimension 52 represents the number of vertical pixels in the picture, the third dimension 255 represents the number of interesting target features, the information of 3 scales is included, the information of each scale includes 85 information points, and the 85 information points are respectively: center position coordinates (x, y) of the object of interest, width w and height h of the object, and category information p i And confidence C, wherein the category information p i =80. So 3× (1+1+1+1+80+1) =255. Significance of each dimension of feature map 2 and feature map 3The same as for the feature of figure 1.
Obtaining predicted frame information of an interested target through the network model, comparing the predicted frame with a real frame, and calculating loss errors including IoU loss L IoU Confidence penalty L C Class loss L P The calculation formula is as follows:
IoU loss L IoU
L IoU Representing the target position loss value.
L IoU =1-IoU
Wherein IoU is calculated, see fig. 5.
2. Confidence loss:
the function used for confidence loss is a binary cross entropy function:
L C =obj_loss+noobj_loss
where N represents the total number of bounding boxes predicted by the network,indicating whether an object is present in the i-th predicted bounding box, if so +.>If not, then->C i Indicating the confidence level of the i-th bounding box where the object is located,representing the confidence of the ith bounding box of the network prediction.
3. Category loss
Wherein p is i Representing the probability of each class in the ith bounding box where the object is located,representing the probability of each class in the ith bounding box of the network prediction.
The final loss function L is:
L=L IoU +L C +L P
according to the invention, an iteration threshold value epoch is set according to the precision requirement, when the iteration number is smaller than the epoch, the weight is updated by utilizing an Adam optimization algorithm until the loss value is lower than the set threshold value or the iteration number is larger than the epoch, the training process is ended, and a weight file Q is output 1 ,Q 1 The method comprises the steps of including weight coefficients and offset of parameters of each network layer in the training process, and then performing performance detection on training results.
Step three: loss L for current IoU-based IoU The defect that gradient return cannot be carried out under the condition that the pre-selected frame is completely wrapped by the target frame is overcome, and an improved version of loss function L based on EIoU representation is provided EIoU And embedding the performance of the test result into an algorithm model, and training and detecting the performance of the test result.
The formula is as follows:
L EIoU =1-IoU+R
wherein:
penalty factorWherein (x' 1 ,y' 1 )、(x' 1 ,y' 2 )、(x' 2 ,y' 1 )、(x' 2 ,y' 2 ) Representing four vertex coordinates of the prediction frame, (x) 1 ,y 1 )、(x 1 ,y 2 )、(x 2 ,y 1 )、(x 2 ,y 2 ) Representing the four vertex coordinates of the real frame respectively, l, w representing the length and width of the minimum closure region capable of containing both the predicted frame and the real frame respectively, and l 2 =(max(x 2 ,x' 2 )-min(x 1 ,x' 1 )) 2 ,w 2 =(max(y 2 ,y' 2 )-min(y 1 ,y' 1 )) 2 The method comprises the steps of carrying out a first treatment on the surface of the IoU is the intersection ratio between the predicted frame and the real frame, L EIoU Representing the loss value.
From the above formula, L EIoU The prediction frame is pushed to be close to the real frame continuously, and the prediction frame can be still close to the real frame direction under the condition that the prediction frame is not intersected with the real frame and is contained, and the situation that the prediction frame has the same area size but different length-width ratios under the condition that the real frame completely wraps the prediction frame is considered, L EIoU Is not limited by the aspect ratio.
Embedding the loss function module into the YOLOv3 model to replace IoU loss function, training again, keeping the training process consistent with the training process in the third step, and outputting a weight file Q 2 And detecting the training result.
Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings needed in the embodiments or the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a partial sample graph in a training set;
FIG. 3 is a diagram of the structure of a YOLOv3 network model;
FIG. 4 is a schematic diagram of a network training process;
fig. 5 is a IoU calculation diagram;
FIG. 6 is a graph comparing loss values of IoU;
FIG. 7 is an EIoU loss value calculation graph;
FIG. 8 is a graph of partial detection results of the original YOLOv3 model;
FIG. 9 is a graph comparing the results of the partial detection of original Yolov3 with the modified Yolov3 model;
table 1 shows the overall performance of the original Yolov3 and modified Yolov3 models on the validation dataset;
Detailed Description
To make the above and other objects, features and advantages of the present invention more apparent, the following specific examples of the present invention are given together with the accompanying drawings, which are described in detail as follows:
referring to fig. 1, the implementation steps of the present invention are as follows:
step one: and downloading a COCO data set of the current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect and detect the performance of the method. Download address:http:// cocodataset.org/#home
COCO, collectively Microsoft Common Objects in Context, is a data set available for image recognition by Microsoft team. The COCO dataset provided 80 object categories. The labeling type of the picture in the data set used by the invention is an object detection type which is expressed as category information p labeled with the object of interest in the picture i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.
FIG. 2 is a partial sample graph of a training set in a COCO dataset to represent the universality of target detection objects, and training different images under different angles in different scenes.
Step two: reconstructing the YOLOv3 network system, training the YOLOv3 network based on the data set selected in the step one, outputting a weight file Q, detecting the performance of the data, and making comparison data.
Referring to fig. 3 and 4: the main network structure of the Yolov3 algorithm consists of 52 convolution layers and is divided into three stages, namely three different-scale outputs. The 1-26 layer convolution is stage 1, the 27-43 layer convolution is stage 2, and the 44-52 layer convolution is stage 3.
The specific network structure and training procedure is as follows, where "x" represents the product:
firstly, randomly initializing weights by a network to enable the initialized values to be subjected to Gaussian normal distribution, then inputting a picture with 416 multiplied by 3 pixels, and obtaining 416 multiplied by 32 feature map output by a 1 st layer convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step size is 1 and the number is 32; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 2 in step size and 64 in number, so that the 208×208×64 feature map output is obtained, and the like. According to different convolution kernels of each layer in the network diagram, respectively entering three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then entering feature interaction layers 1,2 and 3 to continuously carry out convolution operations as follows:
the feature interaction layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and the convolution operations of 3×3×256 and 1×1×255 are carried out, so that a feature map of 52×52×255 is obtained.
The feature interaction layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×256, 3×3×512, 1×1×256, 3×3×512 and 1×1×256, the step sizes are all 1, a feature map of 26×26×256 is obtained, and the convolution operations of 3×3×512 and 1×1×255 are carried out, so that a feature map of 26×26×255 is obtained.
The feature interaction layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are 1×1×512, 3×3×1024, 1×1×512, 3×3×1024 and 1×1×512 in sequence, the step sizes are 1, a feature map of 13×13×512 is obtained, and the convolution operations of 3×3×1024 and 1×1×255 are performed, so that a feature map of 13×13×255 is obtained.
Take 52X 255 feature FIG. 1 as an example: the first dimension 52 represents the number of horizontal pixels in the picture, the second dimension 52 represents the number of vertical pixels in the picture, the third dimension 255 represents the number of interesting target features, the information of 3 scales is included, the information of each scale includes 85 information points, and the 85 information points are respectively: center position coordinates (x, y) of the object of interest, width w and height h of the object, and category information p i And confidence C, wherein the category information p i =80. So 3× (1+1+1+1+80+1) =255. The significance of each dimension of the feature map 2 and the feature map 3 is the same as that of the feature map 1.
Obtaining predicted frame information of an interested target through the network model, comparing the predicted frame with a real frame, and calculating loss errors including IoU loss L IoU Confidence penalty L C Class loss L P The calculation formula is as follows:
IoU loss L IoU
The IoU loses L IoU Representing the target position loss value.
L IoU =1-IoU
Wherein IoU is calculated, see fig. 5.
2. Confidence loss:
the function used for confidence loss is a binary cross entropy function:
L C =obj_loss+noobj_loss
where N represents the total number of bounding boxes predicted by the network,indicating whether an object is present in the i-th predicted bounding box, if so +.>If not, then->C i Indicating the confidence level of the ith bounding box where the object is located,/->Representing the confidence of the ith bounding box of the network prediction.
3. Category loss
Wherein p is i Representing the probability of each class in the ith bounding box where the object is located,representing the probability of each class in the ith bounding box of the network prediction.
The final loss function L is:
L=L IoU +L C +L P
according to the invention, an iteration threshold value epoch=100 is set according to the precision requirement, when the iteration number is smaller than epoch, the weight is updated by utilizing an Adam optimization algorithm until the loss value is lower than the set threshold value or the iteration number is larger than epoch, the training process is ended, and a weight file Q is output 1 ,Q 1 The method comprises the steps of including weight coefficients and offset of parameters of each network layer in the training process, and then performing performance detection on training results.
In summary, the specific training process can be summarized in a simplified manner as follows:
(1) The network randomly initializes the weight value, and the initialized value is subjected to Gaussian normal distribution.
(2) And outputting three feature images with different scales by the input picture data through the network model in the step two, and obtaining the prediction frame information by utilizing the feature images.
(3) Comparing the predicted frame with the real frame, wherein the calculated loss error at this stage mainly comprises IoU loss L IoU Confidence penalty L C Class loss L P
(4) At this time, when the iteration number is smaller than epoch=100, the Adam optimization algorithm is used for updating the weight until the loss value is lower than a set threshold value or the iteration number is larger than epoch, the training process is ended, a weight file is output, and then performance detection is performed on the training result. The main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all the categories is averaged mAP (mean Average Precision).
Step three: loss L for current IoU-based IoU The defect that gradient return cannot be carried out under the condition that the pre-selected frame is completely wrapped by the target frame is overcome, and an improved version of loss function L based on EIoU representation is provided EIoU And embedding the performance of the test result into an algorithm model, and training and detecting the performance of the test result.
Referring to fig. 7: loss function L based on EIoU representation EIoU The loss value calculation is shown in the following formula:
L EIoU =1-IoU+R
wherein:
penalty factorWherein (x' 1 ,y' 1 )、(x' 1 ,y' 2 )、(x' 2 ,y' 1 )、(x' 2 ,y' 2 ) Representing four vertex coordinates of the prediction frame, (x) 1 ,y 1 )、(x 1 ,y 2 )、(x 2 ,y 1 )、(x 2 ,y 2 ) Representing the four vertex coordinates of the real frame respectively, l, w representing the length and width of the minimum closure region capable of containing both the predicted frame and the real frame respectively, and l 2 =(max(x 2 ,x' 2 )-min(x 1 ,x' 1 )) 2 ,w 2 =(max(y 2 ,y' 2 )-min(y 1 ,y' 1 )) 2 The method comprises the steps of carrying out a first treatment on the surface of the IoU is the intersection ratio between the predicted frame and the real frame, L EIoU Representing the loss value.
From the above formula, it can be seen that the loss function L is expressed based on EIoU EIoU The prediction frame is pushed to be close to the real frame continuously, and the prediction frame can be still close to the real frame direction under the condition that the prediction frame is not intersected with the real frame and is contained, and the situation that the prediction frame has the same area size but different length-width ratios under the condition that the real frame completely wraps the prediction frame is considered, L EIoU The value range of the EIoU is consistent with that of the GIoU, and the value ranges are 0 and 2.
Embedding the loss function module into the YOLOv3 model replaces the IoU-based loss function L IoU And training again, wherein the training process is consistent with the training process in the third step, a weight file is output, and the training result is detected.
Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.
In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.
The invention is further described below in connection with simulation examples.
Simulation example:
the invention adopts the original YOLOv3 model as a comparison model, adopts the COCO data set as a training set and a testing set, and provides a partial detection effect graph.
Fig. 2 is a partial sample diagram in the training set, randomly selecting partial test data in the COCO dataset, and displaying the result, selecting pictures of different backgrounds, different categories, different target sizes, and different target densities, so as to show universality of the test result.
FIG. 4 is a schematic diagram of a network training flow in which the global loss calculation section, the method of the present invention utilizes a loss function L based on EIoU representation EIoU Instead of the IoU based loss function L IoU The rest parts are kept the same, and control variable comparison is carried out to detect the accuracy of the method.
FIG. 6 is a method of the inventionLoss function L expressed in EIoU EIoU Comparing calculation with current calculation method, wherein red frame represents prediction frame and black frame represents real frame, it can be seen that when prediction frame is completely wrapped by real frame, for the case that prediction frame occupies real frame in the same proportion but has different aspect ratio, the method of the invention provides L EIoU Can be well distinguished, but the existing calculation method cannot be distinguished.
FIG. 7 shows the loss function L based on EIoU representation of the method of the present invention EIoU The calculation diagram is specifically as follows:
l 2 =(max(x 2 ,x' 2 )-min(x 1 ,x' 1 )) 2 =8 2 =64
w 2 =(max(y 2 ,y' 2 )-min(y 1 ,y' 1 )) 2 =6 2 =36
L EIoU =1-IoU+R=1-0.3+0.064=0.764
fig. 8 is a partial detection result diagram of the original YOLOv3 model, and detection diagrams with different backgrounds, different categories and different target sizes are selected to show universality of the original detection model, so that it can be seen that the basic category detection effect of the object in the picture is good.
Fig. 9 is a graph comparing the detection results of the original YOLOv3 and the improved YOLOv3 model, wherein the left column is a graph of the detection effect of the YOLOv3 model, and the right column is a graph of the detection effect of the improved YOLOv3 model, and it can be seen that in the graph of the detection effect of the original YOLOv3 model, for the case that two or more objects overlap, no good detection effect is obtained, such as three-headed elephant, two horses and two zebras in the drawing, but only one original model is detected. After the improvement of the method provided by the invention, good detection effects can be obtained on target objects in the pictures, and three-head elephant, two horses and two zebras are detected. In conclusion, the improved YOLOv3 model has better performance on partial detection graphs than the original YOLOv3 model.
The overall performance of the original YOLOv3 and the improved YOLOv3 model on the validation data set is shown in the attached table 1, and it can be seen that the average accuracy mAP of the improved YOLOv3 model on the validation set is higher than that of the original YOLOv3 model.
The simulation experiment shows that the improved YOLOv3 model embedded with the EIoU module has quite excellent performance, is more suitable for the situation when a plurality of objects are overlapped in the same area, does not introduce more calculated amount, and has no influence on real-time performance compared with the original model. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.

Claims (4)

1. An EIoU-based improved YOLOv3 process comprising the steps of:
step one: downloading a COCO data set of a current target detection field general data set, ensuring that the COCO data set is consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method;
step two: reconstructing a YOLOv3 network system, training the YOLOv3 network based on the data set selected in the step one, outputting a weight file Q, detecting the performance of the weight file Q, and making comparison data;
step three: loss L for current IoU-based IoU The defect that gradient return cannot be carried out under the condition that the pre-selected frame is completely wrapped by the target frame is overcome, and an improved version of loss function L based on EIoU representation is provided EIoU Embedding the performance of the test object into a method model, and training and detecting the performance of the test object;
loss function L based on EIoU representation EIoU The loss value calculation formula is as follows:
L EIoU =1-IoU+R
wherein:
penalty factorWherein (x' 1 ,y' 1 )、(x' 1 ,y' 2 )、(x' 2 ,y' 1 )、(x' 2 ,y' 2 ) Representing four vertex coordinates of the prediction frame, (x) 1 ,y 1 )、(x 1 ,y 2 )、(x 2 ,y 1 )、(x 2 ,y 2 ) Representing the four vertex coordinates of the real frame, respectively, l, w representing the length and width of the minimum closure region capable of containing both the predicted frame and the real frame, respectively, and l2= (max (x 2 ,x' 2 )-min(x 1 ,x' 1 )) 2 ,w 2 =(max(y 2 ,y' 2 )-min(y 1 ,y' 1 )) 2 Representing the product, ioU is the intersection ratio between the predicted frame and the real frame, L EIoU Representing a loss value;
from the above formula, L EIoU The prediction frame is pushed to be close to the real frame continuously, and the prediction frame can be still close to the real frame direction under the condition that the prediction frame is not intersected with the real frame and is contained, and the situation that the prediction frame has the same area size but different length-width ratios under the condition that the real frame completely wraps the prediction frame is considered, L EIoU Is not limited by aspect ratio;
embedding the loss function module into the YOLOv3 model replaces the IoU-based loss function L IoU Training is carried out again, the training process is consistent with the training process in the step three, a weight file is output, and a training result is detected;
step four: the test results were analyzed against the classical YOLOv3 method.
2. The EIoU-based modified YOLOv3 method of claim 1, step one: downloading COCO data set of general data set in current target detection field, COCO is Microsoft Common Objects in Context, which is a data set provided by Microsoft team and can be used for image recognition, COCO data set provides 80 object categories, and the labeling type of the picture in the data set used in the invention is object detection type, which is expressed as category information p labeled with the target of interest in the picture i The object is locatedThe central position coordinates (x, y) of the target and the width w and the height h of the target are visualized by a rectangular frame; the data set is selected to be consistent with the universal data set in the field so as to achieve the comparison effect, and the performance of the method is detected.
3. An EIoU-based modified YOLOv3 process according to claim 1, step two: reconstructing a YOLOv3 network system, training the YOLOv3 network based on the data set selected in the step one, outputting a weight file Q, detecting the performance of the data, and making comparison data, wherein the specific network model and training process are as follows:
the main network structure of the YOLOv3 method consists of 52 convolution layers, and is divided into three stages, namely three different-scale outputs; the 1-26-layer convolution is stage 1, the 27-43-layer convolution is stage 2, the 44-52-layer convolution is stage 3, the output of stage 1, namely the output receptive field of the 26 th convolution layer is small and is responsible for detecting small targets, the output of stage 2, namely the output receptive field of the 43 rd convolution layer is centered and is responsible for detecting medium-sized targets, the output of stage 3, namely the output receptive field of the 52 th convolution layer is large and is easy to detect large targets;
firstly, randomly initializing weights by a network to enable the initialized values to be subjected to Gaussian normal distribution, then inputting a picture with 416 multiplied by 3 pixels, and obtaining 416 multiplied by 32 feature map output by a 1 st layer convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step size is 1 and the number is 32; entering a layer 2 convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step length is 2, the number is 64, and the output of the feature map of 208 multiplied by 64 is obtained, so that the method is similar to the method; according to different convolution kernels of each layer in the network diagram, respectively entering three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then entering feature interaction layers 1,2 and 3 to continuously carry out convolution operations as follows:
the feature interaction layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and the convolution operations of 3×3×256 and 1×1×255 are carried out, so that a feature map of 52×52×255 is obtained;
the feature interaction layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are 1×1×256, 3×3×512, 1×1×256, 3×3×512 and 1×1×256 in sequence, the step sizes are 1, a feature map of 26×26×256 is obtained, and the convolution operations of 3×3×512 and 1×1×255 are carried out, so that a feature map of 26×26×255 is obtained;
the feature interaction layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×512, 3×3×1024, 1×1×512, 3×3×1024 and 1×1×512, the step sizes are all 1, a feature map of 13×13×512 is obtained, and the convolution operations of 3×3×1024 and 1×1×255 are carried out, so that a feature map of 13×13×255 is obtained;
taking 52×52×255 features as an example, fig. 1 illustrates: the first dimension 52 represents the number of horizontal pixels in the picture, the second dimension 52 represents the number of vertical pixels in the picture, the third dimension 255 represents the number of interesting target features, the information of 3 scales is included, the information of each scale includes 85 information points, and the 85 information points are respectively: center position coordinates (x, y) of the object of interest, width w and height h of the object, and category information p i And confidence C, wherein the category information p i =80; so 3× (1+1+1+1+80+1) =255; the significance of each dimension of the feature map 2 and the feature map 3 is the same as that of the feature map 1;
obtaining predicted frame information of an interested target through the network model, comparing the predicted frame with a real frame, and calculating loss errors including IoU loss L IoU Confidence loss L C Class loss L P The calculation formula is as follows:
IoU loss L IoU
L IoU Representing the target position loss value:
L IoU =1-IoU;
b. confidence loss
The function used for confidence loss is a binary cross entropy function:
L C =obj_loss+noobj_loss
where N represents the total number of bounding boxes predicted by the network,indicating whether an object is present in the i-th predicted bounding box, if so +.>If not, then->C i Indicating the confidence level of the ith bounding box where the object is located,/->Representing the confidence of the ith bounding box of the network prediction;
c. category loss
Wherein p is i Representing the probability of each class in the ith bounding box where the object is located,representing the probability of each category in the ith bounding box of the network prediction;
the final loss function L is:
L=L IoU +L C +L P
according to the invention, an iteration threshold is set as 100 according to the precision requirement, when the iteration number is less than 100, the weight is updated by using an Adam optimization method until the loss value is lower than the set threshold or the iteration number is greater than 100, the training process is ended, and the weight is inputOutput weight file Q 1 ,Q 1 The method comprises the steps of including weight coefficients and offset of parameters of each network layer in the training process, and then performing performance detection on training results;
in summary, the specific training process can be summarized in a simplified manner as follows:
(1) Randomly initializing a weight by a network to enable the initialized weight to be subjected to Gaussian normal distribution;
(2) The input picture data outputs three feature images with different scales through the network model in the second step of the invention, and the feature images are utilized to obtain prediction frame information;
(3) Comparing the predicted frame with the real frame, wherein the calculated loss error at this stage mainly comprises IoU loss L IoU Confidence loss L C Class loss L P
(4) When the iteration number is smaller than 100, the weight is updated by utilizing an Adam optimization method until the loss value is lower than a set threshold value or the iteration number is larger than 100, the training process is ended, a weight file is output, and then performance detection is carried out on the training result; the main test index of the method of the invention is mAP (mean Average Precision), which represents the average accuracy of the average, firstly, the average accuracy is calculated in one category (Average Precision), and then the average accuracy of all the categories is calculated again (mean Average Precision).
4. An EIoU-based modified YOLOv3 process according to claim 1, step four: comparing with the classical YOLOv3 method, analyzing the test result;
in the test process, the detection accuracy rate when IoU =0.5 is adopted as a measure index of the performance of the method, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture by the method is greater than 0.5, the method is considered to be successful in detecting the picture.
CN202010892321.2A 2020-08-28 2020-08-28 YOLOv3 algorithm based on EIoU improvement Active CN112418212B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010892321.2A CN112418212B (en) 2020-08-28 2020-08-28 YOLOv3 algorithm based on EIoU improvement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010892321.2A CN112418212B (en) 2020-08-28 2020-08-28 YOLOv3 algorithm based on EIoU improvement

Publications (2)

Publication Number Publication Date
CN112418212A CN112418212A (en) 2021-02-26
CN112418212B true CN112418212B (en) 2024-02-09

Family

ID=74855048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010892321.2A Active CN112418212B (en) 2020-08-28 2020-08-28 YOLOv3 algorithm based on EIoU improvement

Country Status (1)

Country Link
CN (1) CN112418212B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378739A (en) * 2021-06-19 2021-09-10 湖南省气象台 Foundation cloud target detection method based on deep learning
CN114397877A (en) * 2021-06-25 2022-04-26 南京交通职业技术学院 Intelligent automobile automatic driving system
CN113807466B (en) * 2021-10-09 2023-12-22 中山大学 Logistics package autonomous detection method based on deep learning
CN113903009B (en) * 2021-12-10 2022-07-05 华东交通大学 Railway foreign matter detection method and system based on improved YOLOv3 network
CN114283275B (en) * 2022-03-04 2022-08-16 南昌工学院 Multi-graph target detection method based on optimized deep learning
CN115115887B (en) * 2022-07-07 2023-09-12 中国科学院合肥物质科学研究院 Crop pest detection method based on TSD-Faster RCNN and network thereof
CN116994151B (en) * 2023-06-02 2024-06-04 广州大学 Marine ship target identification method based on SAR image and YOLOv s network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046787A (en) * 2019-12-10 2020-04-21 华侨大学 Pedestrian detection method based on improved YOLO v3 model
CN111062413A (en) * 2019-11-08 2020-04-24 深兰科技(上海)有限公司 Road target detection method and device, electronic equipment and storage medium
CN111310861A (en) * 2020-03-27 2020-06-19 西安电子科技大学 License plate recognition and positioning method based on deep neural network
CN111310773A (en) * 2020-03-27 2020-06-19 西安电子科技大学 Efficient license plate positioning method of convolutional neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11094070B2 (en) * 2019-04-23 2021-08-17 Jiangnan University Visual multi-object tracking based on multi-Bernoulli filter with YOLOv3 detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062413A (en) * 2019-11-08 2020-04-24 深兰科技(上海)有限公司 Road target detection method and device, electronic equipment and storage medium
CN111046787A (en) * 2019-12-10 2020-04-21 华侨大学 Pedestrian detection method based on improved YOLO v3 model
CN111310861A (en) * 2020-03-27 2020-06-19 西安电子科技大学 License plate recognition and positioning method based on deep neural network
CN111310773A (en) * 2020-03-27 2020-06-19 西安电子科技大学 Efficient license plate positioning method of convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Distance-Io U Loss: Faster and Better Learning for Bounding Box Regression;Zhaohui Zheng等;《arXiv》;20191119;第1-8页 *
融合GIoU和Focal loss的YOLOv3目标检测算法;邹承明等;《计算机工程与应用》;20200628;第214-222页 *

Also Published As

Publication number Publication date
CN112418212A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112418212B (en) YOLOv3 algorithm based on EIoU improvement
Yang et al. Pipeline magnetic flux leakage image detection algorithm based on multiscale SSD network
CN111368769B (en) Ship multi-target detection method based on improved anchor point frame generation model
CN112801169B (en) Camouflage target detection method, system, device and storage medium based on improved YOLO algorithm
CN109685152A (en) A kind of image object detection method based on DC-SPP-YOLO
CN111860235A (en) Method and system for generating high-low-level feature fused attention remote sensing image description
CN109697441B (en) Target detection method and device and computer equipment
CN110852243B (en) Road intersection detection method and device based on improved YOLOv3
CN113313094B (en) Vehicle-mounted image target detection method and system based on convolutional neural network
CN112364974B (en) YOLOv3 algorithm based on activation function improvement
CN113158789B (en) Target detection method, system, device and medium for remote sensing image
CN114998220A (en) Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN114913434B (en) High-resolution remote sensing image change detection method based on global relation reasoning
CN114332288B (en) Method for generating text generation image of confrontation network based on phrase drive and network
CN117036941A (en) Building change detection method and system based on twin Unet model
CN114529552A (en) Remote sensing image building segmentation method based on geometric contour vertex prediction
CN113313077A (en) Salient object detection method based on multi-strategy and cross feature fusion
CN116597275A (en) High-speed moving target recognition method based on data enhancement
CN115205681A (en) Lane line segmentation method, lane line segmentation device, and storage medium storing lane line segmentation program
CN115546638A (en) Change detection method based on Siamese cascade differential neural network
CN115223033A (en) Synthetic aperture sonar image target classification method and system
CN114896134A (en) Metamorphic test method, device and equipment for target detection model
CN114937154A (en) Significance detection method based on recursive decoder
CN114782983A (en) Road scene pedestrian detection method based on improved feature pyramid and boundary loss

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant