CN112418212B

CN112418212B - YOLOv3 algorithm based on EIoU improvement

Info

Publication number: CN112418212B
Application number: CN202010892321.2A
Authority: CN
Inventors: 王兰美; 褚安亮; 梁涛; 廖桂生; 王桂宝; 孙长征; 陈正涛
Original assignee: Xidian University; Shaanxi University of Technology
Current assignee: Xidian University; Shaanxi University of Technology
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-02-09
Anticipated expiration: 2040-08-28
Also published as: CN112418212A

Abstract

The invention provides an EIoU-based improved YOLOv3 algorithm, which mainly solves the problem of IoU-based loss L caused by overlapping rate, scale and length-width ratio in the existing algorithm _IoU And (3) calculating inaccuracy so as to influence detection performance. Firstly, downloading a universal data set in the current target detection field; secondly, reconstructing an existing algorithm YOLOv3 network model, training by using a prepared data set, and detecting the performance of the model; then the EIoU-based loss function L _EIoU Embedding the model into a YOLOv3 algorithm model for training and evaluating the performance; and finally, comparing the classical YOLOv3 algorithm, and analyzing the test result. Compared with the classical YOLOv3 algorithm, the YOLOv3 algorithm based on the EIoU improvement improves average accuracy, is more suitable for the situation when a plurality of objects are overlapped in the same area, does not introduce more calculated amount, and does not influence instantaneity compared with an original model. The module can be still embedded into other classical algorithm models for comparison test, and has better applicability and robustness.

Description

YOLOv3 algorithm based on EIoU improvement

Technical Field

The invention belongs to the field of image recognition, and particularly relates to a YOLOv3 target detection algorithm based on an improved loss function EIoU, which shows good detection performance on a general standard data set.

Background

The target detection mainly comprises a traditional target detection technology and a target detection technology based on deep learning, and in recent years, along with development and intelligent popularization of technologies, the traditional target detection technology can not meet the demands of people far, and the target detection technology based on the deep learning is generated and developed rapidly and becomes a mainstream algorithm in the current target detection field.

Target detection techniques based on deep learning can be broadly divided into two types of methods, one and two: the two-stage method mainly refers to algorithms based on candidate areas, such as R-CNN, fast-R-CNN and Fast-R-CNN, and the detection steps of the algorithms are as follows: firstly, generating a plurality of candidate areas on a picture, and then, classifying and regressing candidate frames on the candidate areas through a convolutional neural network. The method has high precision, but the detection speed is low, and the real-time requirement cannot be met; the one-stage method uses a convolution neural network to directly predict the types and positions of different targets, belongs to an end-to-end method, and mainly comprises SSD and YOLO series.

The most common indicator in target detection is the cross-over ratio (Intersection over Union, ioU), which can reflect the detection effect of the predicted detection frame and the real detection frame. However, when IoU =0, the distance between the two, that is, the overlap ratio, cannot be reflected as a loss function. Meanwhile, as loss loss=0, no gradient feedback exists, learning and training cannot be performed, and IoU cannot accurately reflect the degree of coincidence of the two, so that a generalized coincidence ratio (Generalized Intersection over Union, GIoU) is proposed. The GIoU focuses not only on the overlapping area but also on other non-overlapping areas, which can better reflect the overlapping ratio of the two, but the training process is still easy to diverge, and the distance overlapping ratio (Distance Intersection over Union, DIoU) is generated. The DIoU considers the distance between the target and the anchor frame, the overlapping rate and the scale, so that the regression of the target frame becomes more stable, but the aspect ratio of the anchor frame is not considered, the complete intersection ratio (Complete Intersection over Union, CIoU) is proposed on the basis of the DIoU, the above IoU research process is combined, the overlapping rate, the scale and the aspect ratio are comprehensively considered, the edge base intersection ratio (Edge Intersection over Union, EIoU) is proposed, the edge base intersection ratio is embedded into the existing classical algorithm YOLOv3, the performance is quite excellent, the module is more suitable for the situation when a plurality of objects overlap in the same area, in addition, the module does not introduce more calculation amount, the real-time performance is not affected compared with the original model, and the module can be embedded into other classical algorithm models and has wide applicability.

Disclosure of Invention

The method provided by the invention provides an improved YOLOv3 algorithm based on EIoU, and the detection performance of the YOLOv3 algorithm is partially improved by embedding an improved IoU loss function EIoU.

Step one: and downloading a COCO data set of the current target detection field general data set, ensuring to keep consistent with the field general data set so as to achieve a comparison effect and detect the performance of the method. Download address:http:// cocodataset.org/#home。

COCO, collectively Microsoft Common Objects in Context, is a data set available for image recognition by Microsoft team. The COCO dataset provided 80 object categories. The labeling type of the picture in the data set used in the invention is an object detection target detection type which is expressed as category information p labeled with the target of interest in the picture _i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.

Step two: reconstructing the YOLOv3 network system, training the YOLOv3 network based on the data set selected in the step one, outputting a weight file Q, detecting the performance of the data, and making comparison data.

The main network structure of the YOLOv3 algorithm consists of 52 convolutional layers, which are divided into three phases, i.e., three different scale outputs. The 1-26 layer convolution is stage 1, the 27-43 layer convolution is stage 2, and the 44-52 layer convolution is stage 3. The specific network structure and training procedure is as follows, where "x" represents the product:

firstly, randomly initializing weights by a network to enable the initialized values to be subjected to Gaussian normal distribution, then inputting a picture with 416 multiplied by 3 pixels, and obtaining 416 multiplied by 32 feature map output by a 1 st layer convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step size is 1 and the number is 32; and entering a layer 2 convolution layer, wherein the convolution kernel is 3×3 in size, 2 in step size and 64 in number, so that the 208×208×64 feature map output is obtained, and the like. According to different convolution kernels of each layer in the network diagram, respectively entering three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then entering feature interaction layers 1,2 and 3 to continuously carry out convolution operations as follows:

the feature interaction layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and the convolution operations of 3×3×256 and 1×1×255 are carried out, so that a feature map of 52×52×255 is obtained.

The feature interaction layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×256, 3×3×512, 1×1×256, 3×3×512 and 1×1×256, the step sizes are all 1, a feature map of 26×26×256 is obtained, and the convolution operations of 3×3×512 and 1×1×255 are carried out, so that a feature map of 26×26×255 is obtained.

The feature interaction layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are 1×1×512, 3×3×1024, 1×1×512, 3×3×1024 and 1×1×512 in sequence, the step sizes are 1, a feature map of 13×13×512 is obtained, and the convolution operations of 3×3×1024 and 1×1×255 are performed, so that a feature map of 13×13×255 is obtained.

Taking 52×52×255 features as an example, fig. 1 illustrates: the first dimension 52 represents the number of horizontal pixels in the picture, the second dimension 52 represents the number of vertical pixels in the picture, the third dimension 255 represents the number of interesting target features, the information of 3 scales is included, the information of each scale includes 85 information points, and the 85 information points are respectively: center position coordinates (x, y) of the object of interest, width w and height h of the object, and category information p _i And confidence C, wherein the category information p _i =80. So 3× (1+1+1+1+80+1) =255. Significance of each dimension of feature map 2 and feature map 3The same as for the feature of figure 1.

Obtaining predicted frame information of an interested target through the network model, comparing the predicted frame with a real frame, and calculating loss errors including IoU loss L _IoU Confidence penalty L _C Class loss L _P The calculation formula is as follows:

IoU loss L _IoU ：

L _IoU Representing the target position loss value.

L _IoU ＝1-IoU

Wherein IoU is calculated, see fig. 5.

2. Confidence loss:

the function used for confidence loss is a binary cross entropy function:

L _C ＝obj_loss+noobj_loss

where N represents the total number of bounding boxes predicted by the network,indicating whether an object is present in the i-th predicted bounding box, if so +.>If not, then->C _i Indicating the confidence level of the i-th bounding box where the object is located,representing the confidence of the ith bounding box of the network prediction.

3. Category loss

Wherein p is _i Representing the probability of each class in the ith bounding box where the object is located,representing the probability of each class in the ith bounding box of the network prediction.

The final loss function L is:

L＝L _IoU +L _C +L _P

according to the invention, an iteration threshold value epoch is set according to the precision requirement, when the iteration number is smaller than the epoch, the weight is updated by utilizing an Adam optimization algorithm until the loss value is lower than the set threshold value or the iteration number is larger than the epoch, the training process is ended, and a weight file Q is output ₁ ，Q ₁ The method comprises the steps of including weight coefficients and offset of parameters of each network layer in the training process, and then performing performance detection on training results.

Step three: loss L for current IoU-based _IoU The defect that gradient return cannot be carried out under the condition that the pre-selected frame is completely wrapped by the target frame is overcome, and an improved version of loss function L based on EIoU representation is provided _EIoU And embedding the performance of the test result into an algorithm model, and training and detecting the performance of the test result.

The formula is as follows:

L _EIoU ＝1-IoU+R

wherein:

penalty factorWherein (x' ₁ ,y' ₁ )、(x' ₁ ,y' ₂ )、(x' ₂ ,y' ₁ )、(x' ₂ ,y' ₂ ) Representing four vertex coordinates of the prediction frame, (x) ₁ ,y ₁ )、(x ₁ ,y ₂ )、(x ₂ ,y ₁ )、(x ₂ ,y ₂ ) Representing the four vertex coordinates of the real frame respectively, l, w representing the length and width of the minimum closure region capable of containing both the predicted frame and the real frame respectively, and l ² ＝(max(x ₂ ,x' ₂ )-min(x ₁ ,x' ₁ )) ² ，w ² ＝(max(y ₂ ,y' ₂ )-min(y ₁ ,y' ₁ )) ² The method comprises the steps of carrying out a first treatment on the surface of the IoU is the intersection ratio between the predicted frame and the real frame, L _EIoU Representing the loss value.

From the above formula, L _EIoU The prediction frame is pushed to be close to the real frame continuously, and the prediction frame can be still close to the real frame direction under the condition that the prediction frame is not intersected with the real frame and is contained, and the situation that the prediction frame has the same area size but different length-width ratios under the condition that the real frame completely wraps the prediction frame is considered, L _EIoU Is not limited by the aspect ratio.

Embedding the loss function module into the YOLOv3 model to replace IoU loss function, training again, keeping the training process consistent with the training process in the third step, and outputting a weight file Q ₂ And detecting the training result.

Step four: and comparing the classical YOLOv3 algorithm, and analyzing the test result.

In the test process, the detection accuracy when IoU =0.5 is adopted as a measurement index of the performance of the algorithm, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture is greater than 0.5, the algorithm is considered to be successful in detecting the picture.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings needed in the embodiments or the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a partial sample graph in a training set;

FIG. 3 is a diagram of the structure of a YOLOv3 network model;

FIG. 4 is a schematic diagram of a network training process;

fig. 5 is a IoU calculation diagram;

FIG. 6 is a graph comparing loss values of IoU;

FIG. 7 is an EIoU loss value calculation graph;

FIG. 8 is a graph of partial detection results of the original YOLOv3 model;

FIG. 9 is a graph comparing the results of the partial detection of original Yolov3 with the modified Yolov3 model;

table 1 shows the overall performance of the original Yolov3 and modified Yolov3 models on the validation dataset;

Detailed Description

To make the above and other objects, features and advantages of the present invention more apparent, the following specific examples of the present invention are given together with the accompanying drawings, which are described in detail as follows:

referring to fig. 1, the implementation steps of the present invention are as follows:

COCO, collectively Microsoft Common Objects in Context, is a data set available for image recognition by Microsoft team. The COCO dataset provided 80 object categories. The labeling type of the picture in the data set used by the invention is an object detection type which is expressed as category information p labeled with the object of interest in the picture _i And the central position coordinates (x, y) of the target, the width w and the height h of the target are visualized by rectangular frames.

FIG. 2 is a partial sample graph of a training set in a COCO dataset to represent the universality of target detection objects, and training different images under different angles in different scenes.

Referring to fig. 3 and 4: the main network structure of the Yolov3 algorithm consists of 52 convolution layers and is divided into three stages, namely three different-scale outputs. The 1-26 layer convolution is stage 1, the 27-43 layer convolution is stage 2, and the 44-52 layer convolution is stage 3.

The specific network structure and training procedure is as follows, where "x" represents the product:

Take 52X 255 feature FIG. 1 as an example: the first dimension 52 represents the number of horizontal pixels in the picture, the second dimension 52 represents the number of vertical pixels in the picture, the third dimension 255 represents the number of interesting target features, the information of 3 scales is included, the information of each scale includes 85 information points, and the 85 information points are respectively: center position coordinates (x, y) of the object of interest, width w and height h of the object, and category information p _i And confidence C, wherein the category information p _i =80. So 3× (1+1+1+1+80+1) =255. The significance of each dimension of the feature map 2 and the feature map 3 is the same as that of the feature map 1.

IoU loss L _IoU ：

The IoU loses L _IoU Representing the target position loss value.

L _IoU ＝1-IoU

Wherein IoU is calculated, see fig. 5.

2. Confidence loss:

the function used for confidence loss is a binary cross entropy function:

L _C ＝obj_loss+noobj_loss

where N represents the total number of bounding boxes predicted by the network,indicating whether an object is present in the i-th predicted bounding box, if so +.>If not, then->C _i Indicating the confidence level of the ith bounding box where the object is located,/->Representing the confidence of the ith bounding box of the network prediction.

3. Category loss

The final loss function L is:

L＝L _IoU +L _C +L _P

according to the invention, an iteration threshold value epoch=100 is set according to the precision requirement, when the iteration number is smaller than epoch, the weight is updated by utilizing an Adam optimization algorithm until the loss value is lower than the set threshold value or the iteration number is larger than epoch, the training process is ended, and a weight file Q is output ₁ ，Q ₁ The method comprises the steps of including weight coefficients and offset of parameters of each network layer in the training process, and then performing performance detection on training results.

In summary, the specific training process can be summarized in a simplified manner as follows:

(1) The network randomly initializes the weight value, and the initialized value is subjected to Gaussian normal distribution.

(2) And outputting three feature images with different scales by the input picture data through the network model in the step two, and obtaining the prediction frame information by utilizing the feature images.

(3) Comparing the predicted frame with the real frame, wherein the calculated loss error at this stage mainly comprises IoU loss L _IoU Confidence penalty L _C Class loss L _P 。

(4) At this time, when the iteration number is smaller than epoch=100, the Adam optimization algorithm is used for updating the weight until the loss value is lower than a set threshold value or the iteration number is larger than epoch, the training process is ended, a weight file is output, and then performance detection is performed on the training result. The main test index of the method is mAP (mean Average Precision), which represents average accuracy, firstly, the average accuracy is AP (Average Precision) in one category, and then the average accuracy of all the categories is averaged mAP (mean Average Precision).

Referring to fig. 7: loss function L based on EIoU representation _EIoU The loss value calculation is shown in the following formula:

L _EIoU ＝1-IoU+R

wherein:

From the above formula, it can be seen that the loss function L is expressed based on EIoU _EIoU The prediction frame is pushed to be close to the real frame continuously, and the prediction frame can be still close to the real frame direction under the condition that the prediction frame is not intersected with the real frame and is contained, and the situation that the prediction frame has the same area size but different length-width ratios under the condition that the real frame completely wraps the prediction frame is considered, L _EIoU The value range of the EIoU is consistent with that of the GIoU, and the value ranges are 0 and 2.

Embedding the loss function module into the YOLOv3 model replaces the IoU-based loss function L _IoU And training again, wherein the training process is consistent with the training process in the third step, a weight file is output, and the training result is detected.

The invention is further described below in connection with simulation examples.

Simulation example:

the invention adopts the original YOLOv3 model as a comparison model, adopts the COCO data set as a training set and a testing set, and provides a partial detection effect graph.

Fig. 2 is a partial sample diagram in the training set, randomly selecting partial test data in the COCO dataset, and displaying the result, selecting pictures of different backgrounds, different categories, different target sizes, and different target densities, so as to show universality of the test result.

FIG. 4 is a schematic diagram of a network training flow in which the global loss calculation section, the method of the present invention utilizes a loss function L based on EIoU representation _EIoU Instead of the IoU based loss function L _IoU The rest parts are kept the same, and control variable comparison is carried out to detect the accuracy of the method.

FIG. 6 is a method of the inventionLoss function L expressed in EIoU _EIoU Comparing calculation with current calculation method, wherein red frame represents prediction frame and black frame represents real frame, it can be seen that when prediction frame is completely wrapped by real frame, for the case that prediction frame occupies real frame in the same proportion but has different aspect ratio, the method of the invention provides L _EIoU Can be well distinguished, but the existing calculation method cannot be distinguished.

FIG. 7 shows the loss function L based on EIoU representation of the method of the present invention _EIoU The calculation diagram is specifically as follows:

l ² ＝(max(x ₂ ,x' ₂ )-min(x ₁ ,x' ₁ )) ² ＝8 ² ＝64

w ² ＝(max(y ₂ ,y' ₂ )-min(y ₁ ,y' ₁ )) ² ＝6 ² ＝36

L _EIoU ＝1-IoU+R＝1-0.3+0.064＝0.764

fig. 8 is a partial detection result diagram of the original YOLOv3 model, and detection diagrams with different backgrounds, different categories and different target sizes are selected to show universality of the original detection model, so that it can be seen that the basic category detection effect of the object in the picture is good.

Fig. 9 is a graph comparing the detection results of the original YOLOv3 and the improved YOLOv3 model, wherein the left column is a graph of the detection effect of the YOLOv3 model, and the right column is a graph of the detection effect of the improved YOLOv3 model, and it can be seen that in the graph of the detection effect of the original YOLOv3 model, for the case that two or more objects overlap, no good detection effect is obtained, such as three-headed elephant, two horses and two zebras in the drawing, but only one original model is detected. After the improvement of the method provided by the invention, good detection effects can be obtained on target objects in the pictures, and three-head elephant, two horses and two zebras are detected. In conclusion, the improved YOLOv3 model has better performance on partial detection graphs than the original YOLOv3 model.

The overall performance of the original YOLOv3 and the improved YOLOv3 model on the validation data set is shown in the attached table 1, and it can be seen that the average accuracy mAP of the improved YOLOv3 model on the validation set is higher than that of the original YOLOv3 model.

The simulation experiment shows that the improved YOLOv3 model embedded with the EIoU module has quite excellent performance, is more suitable for the situation when a plurality of objects are overlapped in the same area, does not introduce more calculated amount, and has no influence on real-time performance compared with the original model. The module can be embedded into other classical algorithm models for comparison test, and has more applicability.

Claims

1. An EIoU-based improved YOLOv3 process comprising the steps of:

step one: downloading a COCO data set of a current target detection field general data set, ensuring that the COCO data set is consistent with the field general data set so as to achieve a comparison effect, and detecting the performance of the method;

step two: reconstructing a YOLOv3 network system, training the YOLOv3 network based on the data set selected in the step one, outputting a weight file Q, detecting the performance of the weight file Q, and making comparison data;

step three: loss L for current IoU-based _IoU The defect that gradient return cannot be carried out under the condition that the pre-selected frame is completely wrapped by the target frame is overcome, and an improved version of loss function L based on EIoU representation is provided _EIoU Embedding the performance of the test object into a method model, and training and detecting the performance of the test object;

loss function L based on EIoU representation _EIoU The loss value calculation formula is as follows:

L _EIoU ＝1-IoU+R

wherein:

penalty factorWherein (x' ₁ ,y' ₁ )、(x' ₁ ,y' ₂ )、(x' ₂ ,y' ₁ )、(x' ₂ ,y' ₂ ) Representing four vertex coordinates of the prediction frame, (x) ₁ ,y ₁ )、(x ₁ ,y ₂ )、(x ₂ ,y ₁ )、(x ₂ ,y ₂ ) Representing the four vertex coordinates of the real frame, respectively, l, w representing the length and width of the minimum closure region capable of containing both the predicted frame and the real frame, respectively, and l2= (max (x ₂ ,x' ₂ )-min(x ₁ ,x' ₁ )) ² ，w ² ＝(max(y ₂ ,y' ₂ )-min(y ₁ ,y' ₁ )) ² Representing the product, ioU is the intersection ratio between the predicted frame and the real frame, L _EIoU Representing a loss value;

from the above formula, L _EIoU The prediction frame is pushed to be close to the real frame continuously, and the prediction frame can be still close to the real frame direction under the condition that the prediction frame is not intersected with the real frame and is contained, and the situation that the prediction frame has the same area size but different length-width ratios under the condition that the real frame completely wraps the prediction frame is considered, L _EIoU Is not limited by aspect ratio;

embedding the loss function module into the YOLOv3 model replaces the IoU-based loss function L _IoU Training is carried out again, the training process is consistent with the training process in the step three, a weight file is output, and a training result is detected;

step four: the test results were analyzed against the classical YOLOv3 method.

2. The EIoU-based modified YOLOv3 method of claim 1, step one: downloading COCO data set of general data set in current target detection field, COCO is Microsoft Common Objects in Context, which is a data set provided by Microsoft team and can be used for image recognition, COCO data set provides 80 object categories, and the labeling type of the picture in the data set used in the invention is object detection type, which is expressed as category information p labeled with the target of interest in the picture _i The object is locatedThe central position coordinates (x, y) of the target and the width w and the height h of the target are visualized by a rectangular frame; the data set is selected to be consistent with the universal data set in the field so as to achieve the comparison effect, and the performance of the method is detected.

3. An EIoU-based modified YOLOv3 process according to claim 1, step two: reconstructing a YOLOv3 network system, training the YOLOv3 network based on the data set selected in the step one, outputting a weight file Q, detecting the performance of the data, and making comparison data, wherein the specific network model and training process are as follows:

the main network structure of the YOLOv3 method consists of 52 convolution layers, and is divided into three stages, namely three different-scale outputs; the 1-26-layer convolution is stage 1, the 27-43-layer convolution is stage 2, the 44-52-layer convolution is stage 3, the output of stage 1, namely the output receptive field of the 26 th convolution layer is small and is responsible for detecting small targets, the output of stage 2, namely the output receptive field of the 43 rd convolution layer is centered and is responsible for detecting medium-sized targets, the output of stage 3, namely the output receptive field of the 52 th convolution layer is large and is easy to detect large targets;

firstly, randomly initializing weights by a network to enable the initialized values to be subjected to Gaussian normal distribution, then inputting a picture with 416 multiplied by 3 pixels, and obtaining 416 multiplied by 32 feature map output by a 1 st layer convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step size is 1 and the number is 32; entering a layer 2 convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the step length is 2, the number is 64, and the output of the feature map of 208 multiplied by 64 is obtained, so that the method is similar to the method; according to different convolution kernels of each layer in the network diagram, respectively entering three different stages to sequentially obtain a 52×52×256 feature diagram, a 26×26×512 feature diagram and a 13×13×1024 feature diagram, and then entering feature interaction layers 1,2 and 3 to continuously carry out convolution operations as follows:

the feature interaction layer 1 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×128, 3×3×256, 1×1×128, 3×3×256 and 1×1×128, the step sizes are all 1, a feature map of 52×52×128 is obtained, and the convolution operations of 3×3×256 and 1×1×255 are carried out, so that a feature map of 52×52×255 is obtained;

the feature interaction layer 2 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are 1×1×256, 3×3×512, 1×1×256, 3×3×512 and 1×1×256 in sequence, the step sizes are 1, a feature map of 26×26×256 is obtained, and the convolution operations of 3×3×512 and 1×1×255 are carried out, so that a feature map of 26×26×255 is obtained;

the feature interaction layer 3 is a convolution module and comprises 5 steps of convolution operations, the convolution kernel sizes and the number are sequentially 1×1×512, 3×3×1024, 1×1×512, 3×3×1024 and 1×1×512, the step sizes are all 1, a feature map of 13×13×512 is obtained, and the convolution operations of 3×3×1024 and 1×1×255 are carried out, so that a feature map of 13×13×255 is obtained;

taking 52×52×255 features as an example, fig. 1 illustrates: the first dimension 52 represents the number of horizontal pixels in the picture, the second dimension 52 represents the number of vertical pixels in the picture, the third dimension 255 represents the number of interesting target features, the information of 3 scales is included, the information of each scale includes 85 information points, and the 85 information points are respectively: center position coordinates (x, y) of the object of interest, width w and height h of the object, and category information p _i And confidence C, wherein the category information p _i =80; so 3× (1+1+1+1+80+1) =255; the significance of each dimension of the feature map 2 and the feature map 3 is the same as that of the feature map 1;

obtaining predicted frame information of an interested target through the network model, comparing the predicted frame with a real frame, and calculating loss errors including IoU loss L _IoU Confidence loss L _C Class loss L _P The calculation formula is as follows:

IoU loss L _IoU

L _IoU Representing the target position loss value:

L _IoU ＝1-IoU；

b. confidence loss

The function used for confidence loss is a binary cross entropy function:

L _C ＝obj_loss+noobj_loss

where N represents the total number of bounding boxes predicted by the network,indicating whether an object is present in the i-th predicted bounding box, if so +.>If not, then->C _i Indicating the confidence level of the ith bounding box where the object is located,/->Representing the confidence of the ith bounding box of the network prediction;

c. category loss

Wherein p is _i Representing the probability of each class in the ith bounding box where the object is located,representing the probability of each category in the ith bounding box of the network prediction;

the final loss function L is:

L＝L _IoU +L _C +L _P

according to the invention, an iteration threshold is set as 100 according to the precision requirement, when the iteration number is less than 100, the weight is updated by using an Adam optimization method until the loss value is lower than the set threshold or the iteration number is greater than 100, the training process is ended, and the weight is inputOutput weight file Q ₁ ，Q ₁ The method comprises the steps of including weight coefficients and offset of parameters of each network layer in the training process, and then performing performance detection on training results;

(1) Randomly initializing a weight by a network to enable the initialized weight to be subjected to Gaussian normal distribution;

(2) The input picture data outputs three feature images with different scales through the network model in the second step of the invention, and the feature images are utilized to obtain prediction frame information;

(3) Comparing the predicted frame with the real frame, wherein the calculated loss error at this stage mainly comprises IoU loss L _IoU Confidence loss L _C Class loss L _P ；

(4) When the iteration number is smaller than 100, the weight is updated by utilizing an Adam optimization method until the loss value is lower than a set threshold value or the iteration number is larger than 100, the training process is ended, a weight file is output, and then performance detection is carried out on the training result; the main test index of the method of the invention is mAP (mean Average Precision), which represents the average accuracy of the average, firstly, the average accuracy is calculated in one category (Average Precision), and then the average accuracy of all the categories is calculated again (mean Average Precision).

4. An EIoU-based modified YOLOv3 process according to claim 1, step four: comparing with the classical YOLOv3 method, analyzing the test result;

in the test process, the detection accuracy rate when IoU =0.5 is adopted as a measure index of the performance of the method, and if the intersection ratio between the predicted rectangular frame of a certain picture and the real rectangular frame of the picture by the method is greater than 0.5, the method is considered to be successful in detecting the picture.