CN115205667A

CN115205667A - Dense target detection method based on YOLOv5s

Info

Publication number: CN115205667A
Application number: CN202210920891.7A
Authority: CN
Inventors: 宋雪桦; 顾寅武; 张舜尧; 王昌达; 金华; 袁昕
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-10-18

Abstract

The invention relates to a dense target detection method based on YOLOv5 s. Respectively adding a space attention mechanism and a channel attention mechanism into different branches of the CSP module; a RepVGG Block module is used in the backhaul to improve the identification precision of targets with different scales and increase the reasoning speed; an SA attention module is added to improve the feature extraction capability of the algorithm; CARAFE upsampling is used in Neck to obtain a larger receptive field; a variable local Loss function is introduced, and high-quality positive samples are more concerned in the dense target sample training. According to the method, the fish is used as the data set for training, and the trained model weight is used for detection, so that the consumption of manpower and material resources is effectively reduced, the detection accuracy is improved, and the requirements of intensive target detection tasks can be better met.

Description

Dense target detection method based on YOLOv5s

Technical Field

The invention relates to the field of computer vision target detection, in particular to a dense target detection method based on YOLOv5 s.

Background

Visual target detection aims at positioning and identifying objects existing in images, belongs to one of classic tasks in the field of computer vision, is also a premise and a basis of a plurality of computer vision tasks, and has important theoretical research significance and practical application value in the fields of automatic driving, video monitoring, aquaculture, intelligent agriculture and the like. With the rapid development of deep learning technology, the target detection has made great progress. The traditional manual detection mode has poor accuracy, low efficiency, time consumption and labor consumption. With the continuous development of image processing technology, the traditional machine learning carries out classification and identification through a support vector machine, and the method has low accuracy of detection results and is easy to cause the situations of missing detection, false detection and the like. In recent years, for the situation that dense targets exist in many fields, detection by adopting a computer vision technology and a deep learning method is gradually mainstream, and a target detection and identification algorithm automatically extracts target features through a convolutional neural network and has higher detection speed and higher detection accuracy compared with the conventional method.

Disclosure of Invention

Aiming at the existing problems, a dense target detection model based on YOLOv5s is provided. The model can better meet the requirements of intensive target detection tasks.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a dense target detection method based on YOLOv5s comprises the following steps:

1) Placing a detection device at the front end of a bait casting boat to detect the number of fish schools, wherein the detection device comprises a camera device and an illuminating device; the camera device is used for shooting fish schools for quantity detection; the lighting device is kept normally on for underwater lighting;

2) Construction of Fish numberData set D2, partition training set D _train And a verification set D _test ；

3) Constructing a YOLOv5s network model, wherein the YOLOv5s network model comprises Input, backsbone, neck and Prediction; the Input comprises Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling; the Backbone comprises a Focus module, an SPP module and a C3 module; the Neck network tack comprises an FPN module, a PAN module and a C3 module; the Prediction comprises a Bounding box loss function and NMS;

4) Modifying the convolution module of the backbone network, and modifying the convolution module of the backbone network into a RepVGG Block module;

5) Modifying a backbone network structure, and inserting an SA attention mechanism between the RepVGG module and the SPP module;

6) Modifying an upsampling mode of a Yolov5s neck network, and changing nearest upsampling into a CARAFE upsampling mode;

7) Modifying a Loss function Focal Loss of class Loss and confidence coefficient Loss of an evaluation target frame and a prediction frame into a Varifocal Loss function;

8) Carrying out migration training on the fish data set D2 to obtain a training weight w; using GIOU _ Loss as a Loss function, stopping training when the model Loss curve approaches to 0 and has no obvious fluctuation, and obtaining a training weight w, otherwise, continuing training;

9) Inputting images, detecting fish shoals, inputting the obtained fish shoals images into a model with training weight w, and automatically identifying the number of the fish shoals by the model according to the weight.

Further, the step 2) includes the following steps:

2.1 N pieces of fish public data are selected to construct a data set D1;

2.2 Using a labeling tool Labelimg to label the fish in each image in the data set D1 to construct a fish data set D2;

2.3 Proportionally dividing a fish data set D2 into training sets D _train And a verification set D _test 。

Further, the step 4) includes the following steps:

4.1 Training the multi-branch model: during training, adding parallel 1 × 1 convolution branches and identity mapping branches for each 3 × 3 convolution layer;

4.2 Equivalent transformation of the multi-branch model into a one-way model: considering the 1 × 1 convolution as a 3 × 3 convolution with many 0's in the convolution kernel, the identity mapping is a special 1 × 1 convolution; according to the additive principle of convolution, three branches of each ReVGG Block module can be combined into a 3 x 3 convolution;

4.3 Structural parameter reconstruction: and transferring the weight of the multi-branch network to the simple network through the actual data flow.

Further, the step 5) includes the following steps:

5.1 Feature grouping: suppose the input features are X ∈ R ^C×H×W Wherein C, H, W represent channel number, height and width respectively, the feature grouping will split the input X into g groups along the channel dimension, so that each sub-function gradually captures specific semantic response in the training process;

5.2 Using a channel attention mechanism to capture channel correlation information, the calculation formula is as follows:

X′ _k1 ＝σ(W ₁ s+b ₁ )·X _k1

in the formula: s denotes channel statistics, X _k1 Is a branch, X ', divided in the channel dimension' _k1 Represents the final output of the channel attention, σ is sigmoid activation function, W ₁ And b ₁ Is a parameter having a shape of C/2 G.times.1X 1.

5.3 Using a spatial attention mechanism to capture spatial correlation information, the calculation formula is as follows:

X′ _k2 ＝σ(W ₂ ·GN(X _K2 )+b ₂ )·X _k2

in the formula: x _k2 Is a branch, X ', divided in the channel dimension' _k2 Final output, W, representing spatial attention ₂ And b ₂ Is a parameter with a shape of C/2G × 1 × 1, GN represents the group normalization method;

5.4 Polymerization: after the calculation of the channel attention and the spatial attention is completed, the two kinds of attention are integrated and fused by Concat to obtain: x' _k ＝[X′ _k1 ,X′ _k2 ]∈R ^C/2G×H×W And channel permutation operation (channel buffer) is adopted for inter-group communication.

Further, the step 6) includes the following steps:

6.1 Feature map channel compression: assuming that the up-sampling multiplying factor is sigma, for an input feature map with the shape of C multiplied by H multiplied by W, wherein C, H and W respectively represent the number of channels, height and width, the 1 multiplied by 1 convolution is used for compressing the number of channels to C _m ；

6.2 Content coding and upsampling kernel prediction: for the input characteristic diagram compressed in the step 6.1), utilizing

The upsampled convolution kernel of (1) assuming the upsampled convolution kernel is k _up ×k _up The number of input channels is C _m The number of output channels is

The channel dimension is expanded in the space dimension to obtain the shape of

The upsampling core of (a);

6.3 Upsampling kernel normalization: for each channel k of the up-sampling kernel obtained in the step 6.2) _up ×k _up Normalization is performed by softmax, so that the sum of the weights of convolution kernels is 1; for each position in the output profile, it is mapped back to the input profile, taking out the k centered on it _up ×k _up Performing dot product on the predicted up-sampling kernel of the point to obtain an output value; different channels at the same location share the same upsampling core.

Further, in the step 7), the formula of the variacal local Loss function is as follows:

in the formula: p is the predicted IACS, q is the target IoU score, q is the IoU between the prediction bounding box and the gt box for positive samples, and q is 0 for negative samples.

Further, in the step 8), the GIoU _ Loss function conversion formula is as follows:

in the formula:

IoU represents the intersection ratio between two overlapping rectangular boxes; i denotes the overlapping part of two rectangles, U denotes the sum A of the areas of the two rectangles ^p +A ^g The intersection area I, A of two rectangles ^c Is the minimum outer interface area of the two rectangles.

The invention provides a dense target detection method based on YOLOv5s, which adopts a detection model integrating a RepMVGG module, an attention mechanism and a CARAFE up-sampling module. The method can effectively improve the comprehensive performance in the intensive target image detection task, greatly improve the detection accuracy, and has important significance for the development of automatic driving, video monitoring and aquaculture industry.

Drawings

FIG. 1 is a flow chart of a dense target detection method based on YOLOv5s in the present invention.

Fig. 2 is a diagram of the YOLOv5s network structure in the present invention.

Fig. 3 is a structure diagram of a backbone network RepVGG Block module in the invention.

FIG. 4 is a diagram showing the structure of the SA attention mechanism of the present invention.

Detailed Description

The present invention is further described with reference to the drawings and the specific embodiments, it should be noted that the technical solutions and design principles of the present invention are only described in detail with a preferred technical solution, but the scope of the present invention is not limited thereto.

The present invention is not limited to the above-described embodiments, and any obvious improvements, substitutions or modifications can be made by those skilled in the art without departing from the spirit of the present invention.

The dense target detection method based on YOLOv5s provided by the invention has a flow chart shown in figure 1, and comprises the following steps:

1) Placing a detection device at the front end of a bait casting boat to detect the number of fish schools, wherein the detection device comprises a camera device and an illuminating device; the camera device is used for shooting fish schools for quantity detection; the lighting device is kept normally bright for underwater lighting;

2) Constructing a fish data set D2, and dividing a training set D _train And a verification set D _test ；

3) Constructing a YOLOv5s network model, wherein the structure diagram of the YOLOv5s network is shown in figure 2, and the YOLOv5s network model comprises Input, backhaul, neck and Prediction; the Input comprises Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling; the backhaul comprises a Focus module, an SPP module and a C3 module; the neck network comprises an FPN module, a PAN module and a C3 module; the Prediction comprises a Bounding box loss function and NMS; the structure of the backbone network C3 module is divided into two branches, one branch uses a plurality of Bottleneck stacks and 3 standard convolution layers, the other branch only passes through one basic convolution module, and finally two branches are subjected to concat operation;

5) Modifying the structure of the backbone network, and inserting an SA attention mechanism between the RepVGG module and the SPP module;

6) The upsampling mode of the YOLOv5s neck network is modified, and nearest neighbor upsampling is changed into CARAFE upsampling mode.

7) And modifying a Loss function Focal Loss for the class Loss and the confidence coefficient Loss of the evaluation target frame and the prediction frame into a Varifocal Loss.

9) Inputting images, detecting fishes, inputting the obtained fish images into a model with training weight w, and automatically identifying the quantity of the fishes by the model according to the weight.

As a preferred embodiment of the present invention, the step 2) comprises the steps of:

2.1 N pieces of fish public data are selected to construct a data set D1;

2.2 Labeling the fish in each image of the data set D1 by using a labeling tool Labelimg to construct a fish data set D2;

2.3 Proportionally dividing the fish data set D2 into training sets D _train And a verification set D _test 。

As a preferred embodiment of the present invention, the RepVGG Block convolution structure is shown in FIG. 3, and the step 4) includes the following steps:

4.1 ) train the multi-branch model. In training, parallel 1 × 1 convolution branches and identity mapping branches are added for each 3 × 3 convolutional layer.

4.2 Equivalent conversion of the multi-branch model into a single-pass model. Considering a 1 × 1 convolution as a 3 × 3 convolution with many 0's in the convolution kernel, the identity mapping is a special 1 × 1 convolution. According to the additive principle of convolution, three branches of each RepVGG Block module can be combined into a 3 × 3 convolution.

4.3 ) structure parameter reconstruction. And transferring the weight of the multi-branch network into the simple network through the actual data flow.

As a preferred embodiment of the present invention, the SA module structure is shown in fig. 4, and the step 5) includes the following steps:

5.1 Feature grouping, assuming input featuresFor X ∈ R ^C×H×W Wherein C, H, W represent channel number, height and width respectively, the feature grouping will split the input X into g groups along the channel dimension, so that each sub-function gradually captures specific semantic response in the training process;

5.2 Using a channel attention mechanism. Channel correlation information is captured, and the calculation formula is as follows:

X′ _k1 ＝σ(W ₁ s+b ₁ )·X _k1

in the formula: s denotes channel statistics, X _k1 Is a branch, X ', divided in the channel dimension' _k1 Represents the final output of the channel attention, σ being the sigmoid activation function, W ₁ And b ₁ Is a parameter with a shape of C/2G × 1 × 1.

5.3 Use a spatial attention mechanism. Spatial correlation information is captured, and the calculation formula is as follows:

X′ _k2 ＝σ(W ₂ ·GN(X _K2 )+b ₂ )·X _k2

in the formula: x _k2 Denotes a branch, X ', divided in the channel dimension' _k2 Final output, W, representing spatial attention ₂ And b ₂ Is a parameter with a shape of C/2G × 1 × 1, GN represents the group normalization method.

5.4 ) polymerization. After the two previous attention calculations are completed, they are integrated, first by a simple Concat fusion: x' _k ＝[X′ _k1 ,X′ _k2 ]∈R ^C/2G×H×W . And finally, performing inter-group communication by adopting channel permutation operation (channel buffer).

As a preferred embodiment of the present invention, the step 6) above includes the steps of:

6.1 ) feature map channel compression, assuming an upsampling magnification of σ, for an input feature map having a shape of C × H × W, where C, H, and W represent the number of channels, height, and width, respectively, its number of channels is convolved with 1 × 1Compression to C _m And the calculation amount of the subsequent steps is reduced.

6.2 Content coding and upsampling kernel prediction, using the input feature map compressed in the first step

Predicting the upsampled kernel by assuming the upsampled convolution kernel to be k _up ×k _up The number of input channels is C _m The number of output channels is

The channel dimension is expanded in the space dimension to obtain the shape of

The upsampling kernel of (1).

6.3 ) normalization of the upsampled kernels, with each channel k of the resulting upsampled kernels _up ×k _up Normalization is performed using softmax such that the sum of the convolution kernel weights is 1. For each position in the output profile, we map it back to the input profile, taking out the k centered on it _up ×k _up And performing dot product on the predicted upsampled kernel of the point to obtain an output value. Different channels at the same location share the same upsampling core.

As a preferred embodiment of the present invention, the formula of the variacal Loss function in step 7) is as follows:

where p is the predicted IACS, q is the target IoU score, q is the IoU between the prediction bounding box and the gt box for positive samples, and q is 0 for negative samples.

As a preferred embodiment of the present invention, the GIoU _ Loss function conversion formula in step 8) is as follows:

in the formula (I), the compound is shown in the specification,

Claims

1. A dense target detection method based on YOLOv5s is characterized by comprising the following steps:

1) Placing a detection device at the front end of a bait casting boat to detect the number of fish schools, wherein the detection device comprises a camera device and an illuminating device; the camera device is used for shooting the fish school for quantity detection; the lighting device is kept normally on for underwater lighting;

3) Constructing a YOLOv5s network model, wherein the YOLOv5s network model comprises Input, backhaul, neck and Prediction; the Input comprises Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling; the backhaul comprises a Focus module, an SPP module and a C3 module; the Neck network tack comprises an FPN module, a PAN module and a C3 module; the Prediction comprises a Bounding box loss function and NMS;

4) Modifying the backbone network convolution module, and modifying the backbone network convolution module into a RepVGG Block module;

9) Inputting images, detecting fish shoals, inputting the obtained fish shoal images into a model with a training weight of w, and automatically identifying the number of the fish shoals by the model according to the weight.

2. The YOLOv5 s-based dense target detection method of claim 1, wherein the step 2) comprises the steps of:

2.1 N pieces of fish public data are selected to construct a data set D1;

3. The YOLOv5 s-based dense target detection method of claim 1, wherein the step 4) comprises the steps of:

4.2 Equivalent transformation of the multi-branch model into a single-way model: considering the 1 × 1 convolution as a 3 × 3 convolution with many 0's in the convolution kernel, the identity mapping is a special 1 × 1 convolution; according to the additive principle of convolution, three branches of each RepVGG Block module can be combined into a 3 multiplied by 3 convolution;

4.3 Structural parameter reconstruction: and transferring the weight of the multi-branch network into the simple network through the actual data flow.

4. The YOLOv5 s-based dense target detection method of claim 1, wherein the step 5) comprises the steps of:

5.1 Feature grouping: assuming that the input features are X ∈ R ^C×H×W Wherein C, H, W represent channel number, height and width respectively, the feature grouping will split the input X into g groups along the channel dimension, so that each sub-function gradually captures specific semantic response in the training process;

X _k ′ ₁ ＝σ(W ₁ s+b ₁ )·X _k1

in the formula: s denotes channel statistics, X _k1 For a branch divided in the channel dimension, X _k ′ ₁ Represents the final output of the channel attention, σ is sigmoid activation function, W ₁ And b ₁ Is a parameter with a shape of C/2G × 1 × 1;

X′ _k2 ＝σ(W ₂ ·GN(X _K2 )+b ₂ )·X _k2

in the formula: x _k2 Is a branch, X 'divided in the channel dimension' _k2 Final output, W, representing spatial attention ₂ And b ₂ Is a parameter with a shape of C/2G × 1 × 1, GN represents the group normalization method;

5.4 Polymerization: after the calculation of the channel attention and the space attention is completed, the two kinds of attention are integrated and fused by Concat to obtain: x _k ′＝[X′ _k1 ,X′ _k2 ]∈R ^C/2G×H×W And performing inter-group communication by adopting channel permutation operation (channel buffer).

5. The method for detecting dense targets based on YOLOv5s as claimed in claim 1, wherein the step 6) comprises the following steps:

6.1 ) feature map channelsCompression: assuming that the up-sampling multiplying factor is sigma, for an input feature map with the shape of C × H × W, wherein C, H, and W respectively represent the number of channels, height, and width, the 1 × 1 convolution is used to compress the number of channels to C _m ；

6.2 Content coding and upsampling kernel prediction: for the input characteristic diagram compressed in the step 6.1), use

The channel dimension is expanded in the space dimension to obtain the shape of

The upsampling core of (a);

6. The YOLOv5 s-based dense target detection method according to claim 1, wherein in the step 7), the Varifocal Loss function formula is as follows:

7. The YOLOv5 s-based dense target detection method according to claim 1, wherein in the step 8), the GIoU _ Loss function conversion formula is as follows:

in the formula:

IoU represents the intersection ratio between two overlapping rectangular boxes; i denotes the overlap of two rectangles, U denotes the sum A of the areas of the two rectangles ^p +A ^g The intersection area I, A of two rectangles ^c Is the minimum outer area of the two rectangles.