CN113869412B

CN113869412B - Image target detection method combining lightweight attention mechanism and YOLOv network

Info

Publication number: CN113869412B
Application number: CN202111141568.1A
Authority: CN
Inventors: 段运生; 檀怡; 竺德; 孙冬
Original assignee: Anhui University; CERNET Corp
Current assignee: Anhui University; CERNET Corp
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2024-06-07
Anticipated expiration: 2041-09-28
Also published as: CN113869412A

Abstract

The invention discloses an image target detection method combining a lightweight attention mechanism and a YOLOv network, which comprises a training process of a target detection algorithm of the lightweight attention mechanism and the YOLOv network, wherein the lightweight attention mechanism and the YOLOv network are combined by the algorithm to improve the feature extraction capability, a depth separable convolution module is combined into the YOLOv network, the efficiency of the algorithm is improved, the detection precision is further improved, a multiscale fusion method is used in the traditional YOLOv3 network, the feature extraction capability of a model is improved, the performance of the model is further improved, and a target detection method with higher recognition degree is designed by combining the lightweight attention mechanism, the depth separable convolution and the multiscale fusion method into the YOLOv network, so that the task of target detection in an image can be effectively completed, the feature of the image is automatically extracted, and the efficiency is improved, and meanwhile the detection precision is higher.

Description

Image target detection method combining lightweight attention mechanism and YOLOv network

Technical Field

The invention relates to the technical field of target detection algorithm research in the field of computer vision, in particular to an image target detection method combining a lightweight attention mechanism and YOLOv networks.

Background

Target detection is an important branch of the field of computer vision research and is one of the most fundamental problems in computer vision research. Target detection is the identification of the spatial location or coverage of a specified category (such as people, dogs, zebra, elephants and cars) in some specific images. In addition, in the research of artificial intelligence and information technology, target detection is also important, especially in the aspects of robot vision, face recognition, automatic driving, intelligent monitoring and the like. Challenges encountered in target detection today include both accuracy and efficiency, and how to improve efficiency while ensuring accuracy is a major aspect of current research. According to the existing target detection algorithm, two main classes can be classified, one is a single-stage detection framework (One-stage Detectors) and the other is a Two-stage detection framework (Two-stage Detectors). The monopole type target detection method directly calculates on the complete image to complete detection, and the two-stage type target detection method firstly carries out pretreatment on the image to extract some candidate frames, and then carries out correction on the pretreated image to obtain a final detection result. In contrast, two-stage target detection is more accurate, but slower. The commonly used two-stage target detection method comprises a regional convolution neural network algorithm (R-CNN), a Fast regional convolution neural network algorithm (Fast-R-CNN), a spatial pyramid pooling algorithm (SPP-Net) in a deep convolution network, a multi-region convolution neural network (MR-CNN) and the like. Wherein, R-CNN is used as a basic stone of a two-stage target detection algorithm and is the most representative two-stage target detection algorithm. In the preprocessing stage, a selective search algorithm is utilized to select a candidate frame of interest, and then the spatial position of a target is positioned through a convolutional neural network, a support vector machine and a regression method. In addition, common monopolar target detection includes Single Shot MultiBox Detector (SSD), YOLO, YOLOv2, and YOLOv. Currently, the research directions of researchers on target detection algorithms can be roughly divided into three types, namely, an improved two-stage target detection algorithm, an improved monopole type target detection method and an algorithm combining the monopole type target detection algorithm and the two-stage target detection algorithm. The above algorithm has been shown to perform well in the target detection study, but can be further improved, so we propose a target detection method that combines a lightweight attention mechanism with YOLOv networks.

The R-CNN algorithm is one of two-stage models, and is also the earliest proposed deep learning method for target detection. Firstly, a selective search algorithm is used for searching the feature similarity of adjacent image blocks in an original image, then scoring the similar image blocks, selecting candidate frames of an interested image area to be input into a trained CNN as samples, entering a full-connection layer after feature extraction, and training an SVM classifier and a linear regression prediction model to complete a final target detection task.

Although the R-CNN algorithm has a certain improvement over the conventional target detection method, and the trained CNN after use also has a good effect in the image feature extraction method, the operation time of the algorithm is increased since the generation of the candidate region in the first stage of the R-CNN algorithm is conventional, and when there are a large number of candidate regions in an image, the front propagation calculation of the CNN is multiplied, because each candidate region performs feature extraction once, and the operation time is greatly increased. Thus, the method is applicable to a variety of applications. These repeated operations of the R-CNN algorithm limit the performance of the algorithm.

The Fast-R-CNN algorithm is an improved target detection algorithm based on the R-CNN algorithm, and the main purpose of the Fast-R-CNN algorithm is to optimize the running time of the R-CNN algorithm. Similar to the R-CNN algorithm, the main idea of the Fast-R-CNN algorithm is a method for generating a suggested region, but the difference is that a candidate box is not in an afferent neural network, but is directly used as an input of a convolutional neural network, so that the feature extraction operation is realized. And according to the relation between the region and the extracted features, fusing in a pooling layer. In summary, the most important improvement of Fast-R-CNN is to propose the idea of pooling layers and parallel multitasking training of the region of interest.

The Fast-R-CNN algorithm has several disadvantages:

1) The Fast-R-CNN algorithm is the same as R-CNN, and the interested region needs to be selected and then the feature extraction operation is carried out, so that the process can only be carried out on a CPU, and a great amount of time is wasted.

2) Because of the limitation of the running time, the Fast-R-CNN algorithm cannot be used in real-time application, and the end-to-end training test is not really realized.

The SSD algorithm is a one-stage target detection algorithm, and the feature extractor adopted by the SSD algorithm is a VGG-16 network. When an image is input, the SSD algorithm firstly carries out convolution operation by utilizing a plurality of convolution layers, so that a plurality of feature images with different sizes are obtained, local feature information in the feature images is estimated by utilizing a convolution kernel, and meanwhile, the spatial position information and the classification probability of a target to be detected are also calculated. In addition, the SSD algorithm acts on many position areas of the image and the sizes of bounding boxes of detection results are inconsistent, which causes some redundant boxes to appear, in order to solve the problem, the SSD algorithm also adds a non-maximum suppression technology to merge bounding boxes with high overlapping degree, and also introduces a hard negative sample mining technology to keep the balance of positive and negative samples.

The SSD algorithm has the following disadvantages:

1) When the SSD algorithm performs feature extraction, the features of the features are fewer, and the processing of samples with lower resolution is often not good.

2) The setting of certain parameters in the SSD algorithm is considered to be set and cannot be obtained through training, so that the debugging process is very dependent on experience and has certain randomness, and the generalization capability is poor.

The Fast-R-CNN algorithm is an algorithm obtained by optimizing on the basis of a Fast-R-CNN model, and is therefore also a two-stage target detection method. The method combines the regional suggestion generation module with the Fast-R-CNN module to finish the task of target detection. The Fast-R-CNN module is used for completing the feature mapping of the input image and extracting the features on the basis of the feature mapping. The strategy adopted by the region proposal generating module is a sliding window method, a plurality of candidate regions are generated on the characteristic image after convolution operation, and finally the candidate regions are transferred to a full-connection layer through an ROI pooling layer for final fusion operation. Therefore, the Faster-R-CNN algorithm realizes end-to-end training, and improves the detection efficiency of the model.

The Faster-R-CNN algorithm has the following disadvantages:

1) Because the fast-R-CNN algorithm divides the training process into two phases, the real-time requirement cannot be met in efficiency.

2) For small target detection, the fast-R-CNN algorithm performs poorly, most importantly because its final prediction uses a single deep feature map, resulting in poor generalization ability at different scales.

Disclosure of Invention

The invention aims to provide an image target detection method combining a lightweight attention mechanism and YOLOv networks, which can effectively complete the task of target detection in images, automatically extract the characteristics of pictures and has higher detection precision while improving the efficiency.

In order to achieve the above purpose, the present invention provides the following technical solutions: an image target detection method combining a lightweight attention mechanism and a YOLOv network is characterized by comprising the training process of a target detection algorithm of the lightweight attention mechanism and the YOLOv network:

The training process of the lightweight attention mechanism and the object detection algorithm of YOLOv network is divided into two phases: the first stage is to perform feature extraction on an input image under multiple scales, wherein the feature extraction comprises a residual structure of depth separable convolution and an attention mechanism; the second stage is to fuse the multi-scale features trained in the previous stage and finally output a predicted image, wherein the specific training process is as follows:

the first step: initializing a weight value by a network;

And a second step of: the input image is subjected to multi-scale feature extraction;

and a third step of: under multiple scales, obtaining a downsampled characteristic map of the multiple scales through a depth separable convolution layer and a residual error module of an attention mechanism;

Fourth step: the characteristics of each scale are output and predicted through a convolution layer;

fifth step: and fusing the output predictions of the multi-scale features to form a final prediction model.

Preferably, the method includes a depth separable convolution structure, wherein the depth separable structure is a part of realizing a feature extraction function, and is a key module for realizing a lightweight design, because in standard convolution, convolution operation and combination of feature channels are performed simultaneously, the two parts are separated, namely, the two parts are divided into a depth convolution process and a point convolution process, the calculated amount and the parameter number in the convolution process are reduced greatly through the convolution process after grouping, and the purpose of lightweight is achieved, and further, for input features: d _F×D_F ×m is first decomposed into two convolution processes, namely a depth convolution process and a point-by-point convolution process, then the calculated amount is written as D _K×D_K ×1×m depth convolution process and 1×1×m×n point-by-point convolution process, then the calculated amount is written as

O₁(n)＝D_K·D_K·M·D_F·D_F+M·N·D_F·D_F

For the traditional standard convolution process, the calculated amount under the same input is that

O₂(n)＝D_K·D_K·M·N·D_F·D_F

As a result of the comparison it was found that,

When the convolution kernel size is 3×3, the calculation amount is reduced by nearly 9 times by adopting the depth separable convolution, so that the efficiency of the model is effectively improved.

Preferably, the residual structure of the attention mechanism is another part of the feature extraction process, and is used for improving the feature extraction performance on the backbone network, as an input feature image U, firstly performing a point convolution operation, then performing a depth convolution operation with the size of 3×3 to obtain a graph F after feature extraction, then collecting the attention mechanism SE-Block module to obtain a new graph F1, and finally summing the graphs F and F1 to obtain a final output feature graph V, wherein the attention mechanism can optimize the connection of a channel domain and a spatial domain and can induce the feature extraction network to learn a region of interest,

Wherein: f _tr (, θ) represents a convolution mapping operation, specifically:

F_tr:X→U,X∈R^{H′×W′×C′},U∈R^H×W×C

Wherein: f _sq (·) represents a squeze operation, i.e. a compression operation, in particular

Wherein: f _ex (. Cndot.w) represents the expression operation, i.e. the Excitation operation, in particular

F_ex(z,W)＝σ(g(z,W))＝σ(W₂ReLU(W₁z))

Wherein: z is the output after the compression operation, the activation function takes Sigmoid, and R is a super parameter and the final output is written as X% = F _scale (u, s) = s·u, where u and s are the output of the convolution operation and the output of the excitation operation, respectively.

Preferably, the method comprises the step of adopting a common cross entropy function as the loss function of the prediction model, wherein the specific difference between the predicted value and the true value is calculated by cross entropy, and the cross entropy is expressed as follows:

wherein y represents a real label, y' represents the probability that a sample belongs to a certain class, and in order to further balance the problem of weight distribution of a difficult sample in actual detection, the whole loss function expression of the improved network is as follows:

Preferably, the SE-Block module calibrates the feature relationships in the network through compression and excitation processes, increasing the effective weight and decreasing the ineffective or less effective weight.

Preferably, the depth separable convolution corresponds to the conv2d operation of the operator stage.

Preferably, the residual module of the attention mechanism is bneck operations in the operator stage.

Compared with the prior art, the invention has the following beneficial effects:

The method combines a lightweight attention mechanism and YOLOv networks to improve feature extraction capability, a depth separable convolution module is combined into YOLOv networks, the algorithm efficiency is improved, the detection accuracy is further improved, a multi-scale fusion method is used in a traditional YOLOv network, the feature extraction capability of the model is improved, the performance of the model is further improved, and a target detection method with higher recognition degree is designed by combining the lightweight attention mechanism, the depth separable convolution and the multi-scale fusion method into YOLOv networks, so that the task of target detection in an image can be effectively completed, the features of the image are automatically extracted, and the efficiency is improved, and meanwhile, the detection accuracy is higher.

Drawings

FIG. 1 is a training process for the lightweight attention mechanism and object detection of YOLOv networks of the present invention;

FIG. 2 is a sample of a face object detection image of the present invention;

FIG. 3 is a diagram of a depth separable convolution structure of the present invention;

FIG. 4 is a residual structure of the attention mechanism of the present invention;

FIG. 5 is a schematic view of the SE-Block structure;

FIG. 6 shows the variation curves of different models of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-6, a method for detecting an image target by combining a lightweight attention mechanism and YOLOv networks includes a training procedure of the lightweight attention mechanism and a target detection algorithm of YOLOv networks:

the first step: initializing a weight value by a network;

In this embodiment, the method includes a depth separable convolution structure, where the depth separable structure is a part of realizing a feature extraction function, and is a key module for realizing a lightweight design, because in standard convolution, convolution operation and combination of feature channels are performed simultaneously, and the two parts are performed separately, that is, the two parts are divided into a depth convolution process and a point convolution process, and by the convolution process after grouping, the calculation amount and the parameter number in the convolution process are reduced greatly, so that the purpose of lightweight is achieved, and further, for input features: d _F×D_F ×m is first decomposed into two convolution processes, namely a depth convolution process and a point-by-point convolution process, then the calculated amount is written as D _K×D_K ×1×m depth convolution process and 1×1×m×n point-by-point convolution process, then the calculated amount is written as

O₁(n)＝D_K·D_K·M·D_F·D_F+M·N·D_F·D_F

O₂(n)＝D_K·D_K·M·N·D_F·D_F

As a result of the comparison it was found that,

In this embodiment, the residual structure of the attention mechanism is another part of the feature extraction process, and is used to improve the feature extraction performance on the backbone network, as an input feature image U, firstly perform a point convolution operation, then perform a depth convolution operation with a size of 3×3 to obtain a graph F after feature extraction, then aggregate the attention mechanism SE-Block to obtain a new graph F1, and finally sum the graphs F and F1 to obtain a final output feature graph V, where the attention mechanism can optimize the association of the channel domain and the spatial domain, and can induce the feature extraction network to learn the region of interest,

Wherein: f _tr (, θ) represents a convolution mapping operation, specifically:

F_tr:X→U,X∈R^{H′×W′×C′},U∈R^H×W×C

F_ex(z,W)＝σ(g(z,W))＝σ(W₂ReLU(W₁z))

In this embodiment, including the loss function of the prediction model, we use a common cross entropy function as the loss function of the prediction model, and the difference between the specific predicted value and the actual value uses cross entropy calculation, where the expression of the cross entropy is as follows:

In this embodiment, the SE-Block module calibrates the feature relationships in the network by compression and excitation processes, increasing the effective weight and decreasing the ineffective or less effective weight.

In this embodiment, the depth separable convolution corresponds to the conv2d operation of the operator stage.

In this embodiment, the residual module of the attention mechanism is then bneck operations in the operator stage.

The light-weight attention mechanism is combined with YOLOv networks to improve the feature extraction capability, the depth separable convolution module is combined with YOLOv networks, the algorithm efficiency is improved, the detection precision is further improved, the multi-scale fusion method is used in the traditional YOLOv networks, the feature extraction capability of the model is improved, the performance of the model is further improved, the light-weight attention mechanism, the depth separable convolution and the multi-scale fusion method are combined with YOLOv networks, a target detection method with higher recognition degree is designed, the task of target detection in an image can be effectively completed, the features of the image are automatically extracted, and the detection precision is higher while the efficiency is improved.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An image target detection method combining a lightweight attention mechanism and a YOLOv network is characterized by comprising the training process of a target detection algorithm of the lightweight attention mechanism and the YOLOv network:

The training process of the lightweight attention mechanism and the object detection algorithm of YOLOv network is divided into two phases: the first stage is to perform feature extraction on an input image under multiple scales, wherein the feature extraction comprises a depth separable convolution structure and a residual structure of an attention mechanism; the second stage is to fuse the multi-scale features trained in the previous stage and finally output a predicted image, wherein the specific training process is as follows:

the first step: initializing a weight value by a network;

Fifth step: fusing the output predictions of the multi-scale features to form a final prediction model,

The depth separable convolution structure is a part for realizing the feature extraction function, and is a key module for realizing the lightweight design, because in standard convolution, convolution operation and combination of feature channels are performed simultaneously, the two parts are separated, namely the depth convolution and the point convolution process, the calculated amount and the parameter number in the convolution process are reduced greatly through the convolution process after grouping, and the purpose of lightweight is further achieved, and for the input features: d _F×D_F ×m is first decomposed into two convolution processes, namely, a depth convolution process and a point-by-point convolution process, then the calculated amount is written as D _K×D_K ×1×m depth convolution process and 1×1×m×n point-by-point convolution process, namely, the calculated amount is

O₁(n)＝D_K·D_K·M·D_F·D_F+M·N·D_F·D_F，

Wherein the residual structure of the attention mechanism is another part of the feature extraction process, and is used for improving the feature extraction performance on the backbone network, as an input feature image U, firstly performing point convolution operation, then performing depth convolution operation with the size of 3×3 to obtain a graph F after feature extraction, then collecting the attention mechanism SE-Block module to obtain a new graph F1, and finally summing the graphs F and F1 to obtain a final output feature graph V, specifically, the attention mechanism can optimize the connection of a channel domain and a space domain and can induce the feature extraction network to learn the region of interest,

Wherein: f _tr (, θ) represents a convolution mapping operation, specifically:

F_tr:X→U,X∈R^{H′×W′×C′},U∈R^H×W×C

F_ex(z,W)＝σ(g(z,W))＝σ(W₂ReLU(W₁z))

Wherein: z is the output after the compression operation, the activation function takes Sigmoid, and R is a superparameter and the final output is written as/>Where u and s are the output of the convolution operation and the output of the excitation operation, respectively.

2. The method for detecting an image object by combining a lightweight attention mechanism and YOLOv networks according to claim 1, wherein the method comprises a loss function of a prediction model, a common cross entropy function is adopted as the loss function of the prediction model, a specific difference between a predicted value and a true value is calculated by adopting cross entropy, and the cross entropy is expressed as follows:

3. The method of claim 1, wherein the SE-Block module calibrates the feature relationships in the network by compression and excitation processes to increase the effective weight and decrease the ineffective or less effective weight.

4. The method of image object detection in combination with a lightweight attention mechanism and YOLOv network of claim 1, wherein the depth separable convolution corresponds to the operator phase conv2d operation.

5. The method for image object detection in combination with a lightweight attention mechanism and YOLOv network as recited in claim 1, wherein the residual module of the attention mechanism is then bneck operation of the operator stage.