CN114140665A

CN114140665A - Dense small target detection method based on improved YOLOv5

Info

Publication number: CN114140665A
Application number: CN202111474306.7A
Authority: CN
Inventors: 陆声链; 刘晓宇; 李帼; 陈明
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-04

Abstract

The invention discloses a dense small target detection method based on an improved YOLOv5 algorithm, which further improves the YOLOv5 algorithm. The idea is that (1) a coordinated attention mechanism (CA) is added in a YOLOv5 backbone extraction network, and position information is embedded into channel attention, so that a mobile network obtains information in a larger area and large overhead is avoided; (2) in a characteristic fusion network of YOLOv5, BiFPN is used for replacing PANet, and weight is introduced to better balance characteristic information of different scales; (3) aiming at dense and mutually shielded small targets, the dense targets are trained by using the variance local, so that the network model can accurately identify the targets which are overlapped in a large-area cluster manner. The method has better robustness to the current color change of the object, complex natural environment conditions and the like.

Description

Dense small target detection method based on improved YOLOv5

Technical Field

The invention relates to the technical field of target detection, in particular to a dense small target detection method based on improved YOLOv5.

Background

The target detection is a research hotspot in the fields of machine vision and artificial intelligence, and is also a core technology for application of face recognition, object classification, automatic sorting and the like. A large number of researchers have conducted a great deal of research around target detection, and some solutions have been proposed. The early method mainly extracts features through images, wherein the features comprise color features, texture features, shape features and spatial relationship features of the images. The color histogram-based feature matching method mainly includes a histogram intersection method, a reference color table method and the like, and local features of an image cannot be extracted well because colors cannot measure the direction and the size of the image. Common methods based on texture feature extraction are a gray level co-occurrence matrix method and a semi-variance graph, common models include a random field model and a fractal model, and texture is a concept of a region, so that excessive regionalization can be caused, and global features are ignored. The shape feature extraction-based method mainly includes a boundary feature method, a geometric parameter method, and the like, and the deformation target recognition effect is not good. Some researchers propose a fruit automatic identification technology based on machine vision, and carry out image acquisition according to a machine vision principle. After the image of the fruit is preprocessed by smoothing, sharpening and the like, the color sample value of the fruit is calculated in the RGB color space, image segmentation is carried out according to the sample value, and finally feature extraction is carried out by utilizing the segmented result. The traditional method based on image characteristics has the main problems that the expansibility of the method is not high, and different characteristics are often needed for different targets.

In recent years, machine learning, particularly deep learning, has emerged, leading to a breakthrough change in the field of computer vision. Researchers have proposed a convolutional neural network-based fruit identification method. The method generally comprises the steps of firstly obtaining RGB pictures of fruits, preprocessing and labeling the RGB pictures, constructing a data set, building a convolutional neural network, setting parameters of a network model, putting a training set into the convolutional neural network for training, and finally obtaining a fruit recognition model. Due to the strong applicability of the deep learning technology, the deep learning technology is popularized and applied to a plurality of target detection occasions in recent years.

In general, the deep learning target detection method widely applied at present can obtain a better detection effect on target detection with large area, large volume and not serious shielding. However, accurate and automatic detection of small, dense and severely occluded objects, such as leaves, fruits, flowers on trees, wild fauna photographed at high altitudes, etc., remains a challenge. Meanwhile, for target detection under these outdoor natural conditions, it is also necessary to overcome the influence of various environmental factors such as light, rain, fog, and the like. Although some methods, such as an object detection method based on a convolutional neural network, focus on the detection of small targets, there are also disadvantages, one is that when a large number of dense and overlapping targets are faced, accurate identification cannot be achieved, and the false identification is high; another disadvantage is that these methods pay too much attention to the accuracy of identifying small targets when identifying targets, and do not consider the model size and detection speed of the convolutional neural network, so that the finally generated detection model is difficult to deploy and use in mobile devices.

In a real application environment, the characteristics and application scenarios of specific targets also need to be considered. For example, in the application of detecting fruits on orchard trees, the characteristics of the fruits such as individuals, colors and the like show differences along with the growth cycle, and even if the fruits of the same variety have different characteristics, postures, shielding degrees and the like, the fruits of different varieties have different character characteristics. In addition, in the growing process of the fruits, the identification of the fruits on the final tree can be influenced by the light intensity, the fertilization mode, the irrigation of a water source and other complex environmental factors. Therefore, the target detection algorithm also needs to take into account the variation of the target object in size, color, and ambient conditions.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a dense small target detection method based on an improved YOLOv5 algorithm, which further improves the YOLOv5 algorithm. The idea is that (1) a coordinated attention mechanism (CA) is added in a YOLOv5 backbone extraction network, and position information is embedded into channel attention, so that a mobile network obtains information in a larger area and large overhead is avoided; (2) in a characteristic fusion network of YOLOv5, BiFPN is used for replacing PANet, and weight is introduced to better balance characteristic information of different scales; (3) aiming at dense and mutually shielded small targets, the dense targets are trained by using the variance local, so that the network model can accurately identify the targets which are overlapped in a large-area cluster manner. The method has better robustness to the current color change of the object, complex natural environment conditions and the like.

The technical scheme for realizing the purpose of the invention is as follows:

a dense small target detection method based on an improved YOLOv5 algorithm comprises the following steps:

s1, acquiring images: the method comprises the following steps that a user utilizes image acquisition equipment to acquire images of a target object, names the images of the acquired images according to the format of a Pascal VOC data set, and creates three folders named as antibiotics, ImageSets and JPEGImages;

s2, image preprocessing:

s2-1, image marking: in the image collected in the step S1, labeling the target in the image by using an image labeling tool label img, and labeling the position and the category name of the target;

s2-2, image amplification: if the image collected by the user in the step S1 cannot meet the requirement of identifying 2000 images required by a single category target, the image is amplified by using an Augmentor image data enhancement library, the user selects a storage path and a marking information XML file path of the image, an amplified image and XML file output path are formulated, a required image enhancer such as an enhancer for brightness, clipping, gaussian noise and the like is selected, the amplification quantity and an amplification mode (sequence, combination, random and the like) are selected to amplify the image, and the requirement required for identification is met;

s2-3, dividing the data set: dividing the amplified image and the labeled file into a training set, a testing set, a verification set and a training verification set, wherein the training set, the testing set and the verification set respectively account for 50%, 25% and 25%, and the training verification set accounts for 75% of the sum of the training set and the verification set;

s3, setting network model parameters: in a configuration file yaml of a YOLOv5 network model, setting the size of an input image of a convolutional neural network, the number of identification types and the iteration times according to the size of a memory and a video memory of a computer and the identification effect and training speed required by a user; and the user needs to use the type of the display card supporting CUDA acceleration;

s3-1, when the size of the image of the selected input network is 608 × 608 (independent of the size of the original image), the batch parameter is 8, the iteration time epcoh is 300, and the type of the detected object is 2, the user trains the model by using a single GPU, and at least 6GB is needed;

s3-2, when the size of the image selected to be input into the network is 640 x 640 (irrelevant to the size of the original image), the batch parameter is 8, the iteration time epcoh is 300, and the type of the detected object is 2, the user trains the model by using a single GPU, and at least 8GB is needed;

s4, improving the original YOLOv5 network structure to obtain an improved YOLOv5 network structure, wherein the improvement process is as follows:

s4-1: in the original YOLOv5 network structure, a CA coordinated attention mechanism is added after the 3 rd, 6 th and 9 th layers, and input features in the vertical and horizontal directions are respectively aggregated into two independent position senses by using two 1D global pooling operations; then, the two characteristic maps with the embedded specific direction information are respectively coded into two attribute maps, each attribute map captures the long-distance dependence relationship of the input characteristic map along a spatial direction, and the position information is stored in the generated attribute map; applying both of the attribute maps to the input feature map by multiplication to emphasize the representation of the attention area;

s4-2: in the original YOLOv5 network structure, BiFPN is extracted by using reinforced features, and on the basis of a PANet simplified version, if input and output nodes are on the same layer, an extra edge is added, so that more features are fused without increasing cost; upsampling P5_ in, and performing Concat _ bifpn stacking with P4_ in after upsampling to obtain P4_ td; then, P4_ td is upsampled, and after the upsampling, the P3_ in is subjected to Concat _ bifpn stacking to obtain P3_ out; then, the P3_ out is downsampled, and the downsampled P3_ out and P4 are subjected to Concat _ bifpn stacking to obtain P4_ out; then, the P4_ out is downsampled, and the downsampled P4_ out is stacked with the P5_ in to obtain P3_ out;

s4-3: performing a computation of a Loss function using Varifocal local to solve the class imbalance problem, where p is a predicted Iou perceptual classification score (IACS), q is a target score, and for positive samples in training, q is set to the iou (gt iou) between the generated bbox and gt box, and for negative samples in training, the training targets q of all classes are 0, and the training is concentrated on candidate detection samples with higher IACS; alpha and gamma are hyper-parameters, alpha is an adjustable scale factor used for balancing the loss between the positive and negative examples, alpha is more than or equal to 0 and less than or equal to 1, therefore, the training avoids paying attention to the negative example, the loss function can judge and balance between the difficult sample and the easy sample, and finally reduces the loss contribution of the easy sample, because p is more than or equal to 0 and less than or equal to 1, and gamma is set to be more than 1;

s5, training a network model: setting parameters of improved YOLOv5 network configuration files train and YOLOv5.yaml, putting the yaml file with the set parameters and the improved YOLOv5 network structure into a computer with a configured environment, training by using a training set and a verification set to mark pictures in a centralized manner, putting the pictures which are divided in the testing set into the computer to test in the training process to obtain the training effect of each stage, monitoring tensisorbard-logdir rings/train parameters in the setting process to observe the mAP value of the training in real time, and storing the trained network model weight pt after the training is finished;

s6, identifying by using the trained network model weight: preparing an image to be detected on a computer, changing a configuration file yaml, a trained weight and a to-be-detected picture path in detect.

According to the dense small target detection method based on the improved YOLOv5 algorithm, a coordinated attention mechanism and enhanced feature extraction are added by improving the structure of a YOLOv5 network model, so that smaller target objects of individuals can be better identified; the variable local Loss function is used, and target objects which are clustered and overlapped densely in a large area can be accurately identified; compared with the prior art, the invention has the following advantages:

(1) when the on-tree fruit recognition is carried out, an improved YOLOv5 network structure is adopted to train the image data set, and the trained model can accurately recognize dense small targets and can accurately recognize large-area clustered and overlapped target objects.

(2) When the target is detected, the improved YOLOv5 network structure is adopted to train the data set, and the detection model obtained by training has small volume and can adapt to various embedded devices

(3) The method can be applied in outdoor natural environment, has the characteristics of high identification precision and high speed, and can meet the requirement of real-time identification.

Drawings

FIG. 1 is a flow chart of a dense small target detection method based on the improved YOLOv5 algorithm;

FIG. 2 is a chart of CA coordinated attention machine;

FIG. 3 is a diagram of a BiFPN enhanced feature extraction network structure;

FIG. 4 is a graph showing the recognition effect of the improved YOLOv5 network model on mature citrus fruits;

FIG. 5 is a graph showing the recognition effect of the improved YOLOv5 network model on mature Nanfeng tangerines;

FIG. 6 is a graph of citrus identification effect of the improved YOLOv5 network model on the growth period;

FIG. 7 is a graph showing the recognition effect of improved YOLOv5 network model on Nanfeng mandarin orange in the growing period;

fig. 8 is a graph showing the identification effect of the improved YOLOv5 network model on large-area clustered citrus fruits.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Example (b):

in this embodiment, citrus and Nanfeng mandarin orange are taken as examples to identify fruits on citrus trees in an orchard.

A dense small target detection method based on an improved YOLOv5 algorithm is shown in FIG. 1 and comprises the following steps:

s1, acquiring images: a user uses a digital camera or other image acquisition equipment to acquire images of citrus trees with fruits, names the images according to the format of a Pascal VOC data set, and creates three folders named as antibiotics, ImageSets and JPEGImages;

s2, image preprocessing:

s2-1, image marking: in the image collected in step S1, the image labeling tool label img is used to label the citrus fruit in the image, and the position and the category name of the citrus fruit are labeled. In this example, two varieties (categories) of citrus and Nanfeng mandarin orange are selected, and therefore, the two varieties are taken as an example and not only can be treated.

(1) When the citrus is framed, the tag can be named orange; when the Nanfeng tangerine is selected in a frame, the label can be named sweet _ orange;

(2) selecting densely clustered and overlapped citrus fruits in a frame mode in real time, selecting the citrus fruits in a frame mode one by one, and selecting the citrus fruits in a frame mode accurately by hand;

(3) when the oranges with the shielding rate of more than 95% are selected in the frame, the current target is abandoned;

(4) when the pixel area of the framed fruit target is less than 8 × 8, discarding the current target;

s2-2, image amplification: if the image acquired by the user in the step S1 can not meet the requirement of identifying 2000 pictures required by a single variety of citrus, the user can use the Augmentor image data enhancement library to amplify the image; the user selects the image storage path and the information marking XML file path, formulates an amplified image and an XML file output path, selects a required image intensifier (such as an intensifier of brightness, clipping, Gaussian noise and the like), selects the amplification quantity and the amplification mode (such as sequence, combination, randomness and the like) to amplify the image, and meets the requirement of identification.

s3, setting network model parameters: in a configuration file yaml of a YOLOv5 network model, setting the size, the number of identification types, the iteration times and the like of input images of a convolutional neural network according to the sizes of a memory and a video memory of a computer and the identification effect and training speed required by a user; and the user needs to use the type of the video card supporting cuda acceleration;

s3-1, when the size of the image of the selected input network is 608 × 608 (independent of the size of the original image), the batch parameter is 8, the iteration time epcoh is 300, and the type of the detected object is 2, the user uses a single gpu training model, and at least 6GB is needed;

s3-2, when the size of the image selected to be input into the network is 640 x 640 (irrelevant to the size of the original image), the batch parameter is 8, the iteration time epcoh is 300, and the type of the detected object is 2, the user uses a single gpu training model and needs at least 8 GB;

s4-1, in the original YOLOv5 network structure, a CA coordinated attention mechanism is added after the 3 rd, 6 th and 9 th layers, and a CA coordinated attention mechanism diagram is shown in FIG. 2, and input features in the vertical and horizontal directions are respectively aggregated into two independent position senses by two 1D global pooling operations. The two signatures with embedded specific orientation information are then encoded as two orientation maps, respectively, each of which captures the long-range dependence of the input signature along one spatial direction. The location information can thus be saved in the generated attribution map. Both of the annotation maps are then applied to the input feature map by multiplication to emphasize the representation of the attention area.

S4-2, extracting BiFPN by using an enhanced feature in an original YOLOv5 network structure, wherein the BiFPN enhanced feature extraction network structure is as shown in figure 3, on the basis of a PANet simplified version, if input and output nodes are in the same layer, an extra edge is added, more features are fused without increasing cost, P5_ in is subjected to upsampling, and the upsampling and P4_ in are subjected to Concat _ BiFPN stacking to obtain P4_ td; then, P4_ td is upsampled, and after the upsampling, the P3_ in is subjected to Concat _ bifpn stacking to obtain P3_ out; then, the P3_ out is downsampled, and the downsampled P3_ out and P4 are subjected to Concat _ bifpn stacking to obtain P4_ out; then, P4_ out is downsampled, and the downsampled P5_ in is stacked to obtain P3_ out.

S4-3, using the variacal local for the computation of the Loss function to solve the class imbalance problem, where p is the predicted Iou perceptual classification score (IACS), q is the target score, q is set to IoU (gt IoU) between the generated bbox and gt box for positive samples in the training, and q is 0 for all classes of training targets for negative samples in the training. The training is focused on the candidate detection sample with the higher IACS; alpha and gamma are hyper-parameters, alpha is an adjustable scale factor used for balancing the loss between the positive and negative examples, alpha is more than or equal to 0 and less than or equal to 1, therefore, the training avoids paying attention to the negative example, the loss function can judge and balance between the difficult sample and the easy sample, and finally reduces the loss contribution of the easy sample, because p is more than or equal to 0 and less than or equal to 1, and gamma is set to be more than 1;

s5, training a network model: setting parameters of improved YOLOv5 network configuration files train and YOLOv5.yaml, putting the yaml file with the set parameters and the improved YOLOv5 network structure into a computer with a configured environment, training by using a training set and a verification set marked pictures, putting the pictures divided in the testing set into the computer for testing in the training process to obtain the training effect of each stage, monitoring tensisorbard-logdir rings/train parameters in the setting process to observe the mAP value of the training in real time, and storing the trained network model weight pt after the training is finished.

S6, identifying by using the trained network model weight: preparing a fruit image to be detected on a computer, changing a configuration file yaml, a trained weight and a picture path to be detected in detect.

The above scheme is adopted to identify the fruits on the citrus and the Nanfeng mandarin orange trees in different periods, the identification results are shown in fig. 4, fig. 5, fig. 6, fig. 7 and fig. 8, and the identification result graphs show that: when the on-tree fruit identification is carried out, the method is adopted to train the two fruit data sets, and the trained model can accurately identify the fruit targets with smaller individual and can accurately identify the fruit targets which are clustered and overlapped in a large area.

Claims

1. A dense small target detection method based on an improved YOLOv5 algorithm is characterized by comprising the following steps:

s2, image preprocessing:

s2-2, image amplification: if the image collected by the user in the step S1 can not meet the requirement of identifying 2000 images required by a single category target, the image is amplified by using an Augmentor image data enhancement library, the user selects a storage path and a marking information XML file path of the image, an amplified image and an XML file output path are formulated, a required image enhancer is selected, the amplification quantity and the amplification mode are selected to amplify the image, and the requirement required by identification is met;

s3-1, when the size of the image of the selected input network is 608 × 608, the batch parameter is 8, the iteration time epoch is 300, and the type of the detected object is 2, the user trains the model by using a single GPU, and at least 6GB is needed;

s3-2, when the size of the image of the selected input network is 640 x 640, the batch parameter is 8, the iteration time epoch is 300, and the type of the detected object is 2, the user uses a single GPU to train the model, and at least 8GB is needed;

2. The method for detecting dense small objects based on the improved YOLOv5 algorithm in claim 1, wherein in step 2-2), the boosters comprise a brightness booster, a clipping booster, and a gaussian noise booster; the amplification modes comprise a sequence, a combination and a random amplification mode.